Pattern matching \w or \W

dfhtextpipe · Post by **dfhtextpipe** » Thu Oct 29, 2015 1:39 am

Just been caught out by the following subtle qualification in pattern matching with \w or \W (and any other "word" related patterns).

A "word" character is any letter or digit or the underscore character. The definition of letters and digits may vary, for example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Here's the problem:

I'm processing some French text files, yet my PC uses the English locale.
I need the "word" patterns to match any letter in the French alphabet, including characters with diacritics (and also the ligatures).
See https://en.wikipedia.org/wiki/French_orthography

Being British, I don't want to switch my PC to use the French locale.
That would affect the whole of Windows, which would be overkill and confusing.

It would therefore be useful to have a TextPipe filter to switch the effective locale for subsequent filters.
It should only affect how TextPipe works, without changing the whole of Windows.

Most of my activities are with UTF-8 encoded files.
For the avoidance of doubt, I don't want to change the code page for the files being processed.

It's rather strange that TextPipe has a dependency on the locale for the \w and \W patterns.
cf. In Notepad++, the patterns \w and \W don't seem to be locale dependent.

NB. French is only one example - I process text files from a wide variety of languages.
These occasionally include some with non-Roman scripts.

Best regards,

David

Post by **DataMystic Support** » Thu Oct 29, 2015 8:41 am

We use the PCRE library, and it has a locale dependency, which has both its uses and complications.

You can use [a-z0-9_]+ to capture words without locale differences.

It turns out that PCRE can cope with having different locales for each patterns. It relies on using a locale name like "french" to build different tables.

The PCRE locale support - http://www.pcre.org/original/doc/html/p ... html#SEC13 indicates

As more and more applications change to using Unicode, the need for this locale support is expected to die away

.

I don't think we can build this unless we have more demand.

dfhtextpipe · Post by **dfhtextpipe** » Thu Oct 29, 2015 9:42 pm

Thanks for the explanation, Simon

I needed to capture words from a French source text (UTF-8) that also includes all sorts of punctuation and some added markup.
btw. The text also includes some special characters with codepoint > 256.

Your proposal to use "[a-z0-9_]+" simply doesn't cut ice for this task!
The class would need to be extended to include all the accented letters, etc.

I discovered the difficulty when I wrote a filter to extract and count all the hyphenated French words.

My first attempt used the extract pattern "[^-\w]((\w+-)+\w+)[^-\w]"

The central capture pattern "((\w+-)+\w+)" was designed to catch words with one or more hyphen.
The two outer classes were to ensure that "non-word" characters were excluded.
NB. At this stage, I wasn't bothered about French apostrophes.

Because \w only matches [A-Za-z0-9_] , this filter left out all the patterns containing French characters with diacritics.

When I tried the same regexp pattern with Notepad++ to count the number of matches, it actually found these OK.

I eventually resorted to using \S instead of \w and followed by a filter to remove the unwanted punctuation before the count duplicates filter.

The extract pattern is now "[^-\S]((\S+-)+\S+)[^-\S]"
Then I carefully removed most of the punctuation, yet retaining (e.g.) apostrophes.

This exercise illustrates the problems caused by PCRE in TextPipe being locale dependent like this.

I had a look at the link to PCRE stuff.

PCRE handles caseless matching, and determines whether characters are letters, digits, or whatever, by reference to a set of tables, indexed by character code point. When running in UTF-8 mode, or in the 16- or 32-bit libraries, this applies only to characters with code points less than 256. By default, higher-valued code points never match escapes such as \w or \d. However, if PCRE is built with Unicode property support, all characters can be tested with \p and \P, or, alternatively, the PCRE_UCP option can be set when a pattern is compiled; this causes \w and friends to use Unicode property support instead of the built-in tables.

The use of locales with Unicode is discouraged. If you are handling characters with code points greater than 128, you should either use Unicode support, or use locales, but not try to mix the two.

It seems to me that Notepad++ may have used the PCRE library built with Unicode property support, whereas TextPipe does not.

Best regards,

David

Post by **DataMystic Support** » Thu Oct 29, 2015 11:24 pm

Thanks! I will look into it.

dfhtextpipe · Post by **dfhtextpipe** » Fri Nov 06, 2015 7:25 pm

Unicode character properties
Unicode defines several properties for each character. Patterns in PCRE can match these properties. e.g. \p{Ps}.*?\p{Pe} would match a string beginning with any "opening punctuation" and ending with any "close punctuation" such as "[abc]". Since version 8.10, matching of certain "normal" metacharacters can be driven by Unicode properties when the compile option PCRE_UCP is set. The option can be set for a pattern by including (*UCP) at the start of pattern. The option alters behavior of the following metacharacters: \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. For example, the set of characters matched by \w (word characters) is expanded to include letters and accented letters as defined by Unicode properties. Such matching is slower than the normal (ASCII-only) non-UCP alternative. Note that the UCP option requires the PCRE library to have been built to include UTF-8 and Unicode property support. Support for UTF-16 is included in version 8.30 while support for UTF-32 was added in version 8.32.

See https://en.wikipedia.org/wiki/Perl_Comp ... xpressions

David

DataMystic

Pattern matching \w or \W

Pattern matching \w or \W

Re: Pattern matching \w or \W

Re: Pattern matching \w or \W

Re: Pattern matching \w or \W

Re: Pattern matching \w or \W