Backwards compatibility & UTF-8 Byte Codes

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 24, 2013 3:57 am

Perl replace filters in which the replace pattern uses UTF-8 Byte Codes no longer work!

Prior to TextPipe version 9.5, this was the ONLY way that some Unicode characters could be used as UTF-8 supported replacements.

The serious implication is that the latest version of TextPipe is NOT backwards compatible!

Here's what my subfilter used to do.

Code: Select all

Perl pattern [<<] with [\xE2\x80\x9C]
   [X] Match case
   [ ] Whole words only
   [ ] Case sensitive replace
   [ ] Prompt on replace
   [ ] Skip prompt if identical
   [ ] First only
   [ ] Extract matches
   Maximum text buffer size 4096
   [ ] Maximum match (greedy)
   [ ] Allow comments
   [ ] '.' matches newline
   [X] UTF-8 Support

 Further search/replace list phrases (CSV format):
 >>,\xE2\x80\x9D
 <,\xE2\x80\x98
 >,\xE2\x80\x99
 \x7c,

I now get gibberish instead of the correct replacements.

David

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 24, 2013 4:00 am

Version 9.5.1 release notes state:

When Utf8 Support is checked for replace filters, Unicode characters in the
search for and replace with fields are now correctly converted to UTF-8 prior
to being processed by the PCRE engine.

This desirable improvement should NOT have broken earlier functionality involving UTF-8 Byte Codes.

Post by **DataMystic Support** » Wed Jul 24, 2013 1:00 pm

Ok, so the forthcoming 9.5.2 should, on loading an older filter file, convert byte codes of replace strings to Unicode when utf-8 support is checked and perl regex mode is selected?

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 24, 2013 5:09 pm

A proper fix can't come soon enough - I was about to revert to version 9.5 from 9.5.1

A good reminder that software testing is not a second class activity.

However, I'm not convinced that your proposed method is a good solution.

There will always be situations in which UTF-8 Byte Codes are still required.
Not the least of these are for code-points which, to the human eye, the character is invisible when pasted into the replace field.
Likewise, for non-Roman characters, when you may not have Windows font-support for the replacement characters.

Font-support should never be assumed.
Programmers like me often work on non-Roman scripts. I may or may not have installed a suitable font.
cf. I can usually check output results using a Unicode editor which has no dependence on Installed Fonts. e.g. SC Unipad.

Far better would be to allow either UTF-8 Byte Codes or Unicode characters in the replace patterns.

If there is an ambiguity issue that makes this difficult to implement, then perhaps the way forward
would be add a tick box option to the Perl patterns dialogue, to Allow UTF-8 Byte Codes (in replace).

David

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 24, 2013 7:23 pm

PS. There is no mention in the PCRE help page of three-byte UTF-8 characters or four-byte UTF-8 characters.

The help page only makes reference to two-byte UTF-8 characters.

3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

The same help page is primarily oriented to defining search patterns, and says very little about replace patterns.

cf. The example filter earlier in this issue made use of four patterns which are three-byte UTF-8 characters.

Yet with Unicode extending beyond the Basic Multilingual Page, there should be adequate support for four-byte UTF-8 characters.
e.g. Those in the Supplementary Ideographic Plane

Code: Select all

Range	Block	Code Points
20000..2A6DF	CJK Unified Ideographs Extension B	42,720
2A6E0..2A6FF	<reserved>	32
2A700..2B73F	CJK Unified Ideographs Extension C	4,160
2B740..2B81F	CJK Unified Ideographs Extension D	224
2B820..2F7FF	<reserved>	16,352
2F800..2FA1F	CJK Compatibility Ideographs Supplement	544
2FA20..2FFFF	<reserved>	1,504

These could easily come up if one were processing files involving Chinese scripts.

David

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 24, 2013 7:59 pm

Simon,

Based on the precautionary principle, I would never view it as a good idea to "on loading an older filter file, convert ...."

Unless you've tested all possible eventualities, there's no saying what unpredictable behaviour might ensue if changes are made to an older filter file on loading.

Best regards,

David

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 24, 2013 8:09 pm

There is also a possible misunderstanding lurking in this issue, due to the ambiguity of the new feature description.

"Converting the byte codes of replace strings to Unicode" could be done visibly or invisibly. Which is it?

Would what is displayed in the search for and replace with fields change?

Again, what about those situations in which UTF-8 Byte Codes MUST be used to avoid confusion with other PCRE escape characters.

We've discussed problems encountered while left to right parsing of PCRE patterns in some earlier issues.
It get pretty complicated when processing USFM files (or RTF files as text) which already contain lots of backslash characters.

You know what an RTF file is. USFM is defined at http://paratext.org/usfm

David

Post by **DataMystic Support** » Wed Jul 24, 2013 10:29 pm

Hi David,

Perhaps it would be best NOT to apply the utf-8 transform to the replace text and just treat it as single bytes (ASCII/ANSI).

Post by **DataMystic Support** » Thu Jul 25, 2013 3:08 pm

This is the approach we're following for 9.5.2, to be released in an hour or so.

dfhtextpipe · Post by **dfhtextpipe** » Thu Jul 25, 2013 5:50 pm

Thanks, Simon.

Downloaded 9.5.2 and will report back after I've installed it.

David

dfhtextpipe · Post by **dfhtextpipe** » Thu Jul 25, 2013 8:33 pm

I have just tested v9.5.2 for my replace list filter in which the UTF-8 Byte Codes were in the replace column.

All seems OK now. If anything related comes up, I'll start a new issue.

Thanks for speedy response.

David

Post by **DataMystic Support** » Fri Jul 26, 2013 9:00 am

I'm actually sorry it took so long!

DataMystic

Backwards compatibility & UTF-8 Byte Codes

Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes

Re: Backwards compatibility & UTF-8 Byte Codes