Page 1 of 1

Pattern matching UTF-8 support

Posted: Fri Oct 14, 2011 10:54 pm
by dfhtextpipe
The help page includes the following:
2. In a pattern, the escape sequence \x{...}, where the contents of the braces is a string of hexadecimal digits, is interpreted as a UTF-8 character whose code number is the given hexadecimal number, for example: \x{1234}. If a non-hexadecimal digit appears between the braces, the item is not recognized. This escape sequence can be used either as a literal, or within a character class.
Yet if in a replace filter, in the "Find pattern (perl style" field, and with UTF-8 enabled, I enter \x{1234}, TextPipe indicates that "the character value in the x{...} sequence is too large".

This makes it impossible to use as documented.

I have many UTF-8 files containing the single character U+FEFF ZERO WIDTH NO BREAK SPACE.
I would like to remove this character. How can I do it? TextPipe does not allow \x{feff}.

For U+FEFF, the equivalent hexadecimal byte codes are \xEF\xBB\xBF yet TextPipe doesn't find this pattern either!

Re: Pattern matching UTF-8 support

Posted: Tue Oct 18, 2011 8:49 am
by DataMystic Support
Hi David,

I checked it out with the PCRE guys. You need to enable the UTF-8 flag for this to work.

Re: Pattern matching UTF-8 support

Posted: Fri Oct 28, 2011 4:20 am
by dfhtextpipe
I had enabled UTF-8 support. That's the point - I did all the right things but it still won't work.

David

Re: Pattern matching UTF-8 support

Posted: Fri Oct 28, 2011 10:20 am
by DataMystic Support
Hi David,

Just checked and you are right. This is a display validation issue only - the filter works properly, but the error message it gives is not correct.

It will be fixed in the next release.

Re: Pattern matching UTF-8 support

Posted: Sat Oct 29, 2011 1:01 am
by dfhtextpipe
Thanks for the response, Simon.

I was almost starting to imagine I was going mad, but you have allayed my fears!

David