Backwards compatibility & UTF-8 Byte Codes

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

Perl replace filters in which the replace pattern uses UTF-8 Byte Codes no longer work!

Prior to TextPipe version 9.5, this was the ONLY way that some Unicode characters could be used as UTF-8 supported replacements.

The serious implication is that the latest version of TextPipe is NOT backwards compatible!

Here's what my subfilter used to do.

Code: Select all

Perl pattern [<<] with [\xE2\x80\x9C]
   [X] Match case
   [ ] Whole words only
   [ ] Case sensitive replace
   [ ] Prompt on replace
   [ ] Skip prompt if identical
   [ ] First only
   [ ] Extract matches
   Maximum text buffer size 4096
   [ ] Maximum match (greedy)
   [ ] Allow comments
   [ ] '.' matches newline
   [X] UTF-8 Support

 Further search/replace list phrases (CSV format):
 >>,\xE2\x80\x9D
 <,\xE2\x80\x98
 >,\xE2\x80\x99
 \x7c,
 
I now get gibberish instead of the correct replacements.

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

Version 9.5.1 release notes state:
When Utf8 Support is checked for replace filters, Unicode characters in the
search for and replace with fields are now correctly converted to UTF-8 prior
to being processed by the PCRE engine.
This desirable improvement should NOT have broken earlier functionality involving UTF-8 Byte Codes.
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Backwards compatibility & UTF-8 Byte Codes

Post by DataMystic Support »

Ok, so the forthcoming 9.5.2 should, on loading an older filter file, convert byte codes of replace strings to Unicode when utf-8 support is checked and perl regex mode is selected?
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

A proper fix can't come soon enough - I was about to revert to version 9.5 from 9.5.1

A good reminder that software testing is not a second class activity.

However, I'm not convinced that your proposed method is a good solution.

There will always be situations in which UTF-8 Byte Codes are still required.
Not the least of these are for code-points which, to the human eye, the character is invisible when pasted into the replace field.
Likewise, for non-Roman characters, when you may not have Windows font-support for the replacement characters.

Font-support should never be assumed.
Programmers like me often work on non-Roman scripts. I may or may not have installed a suitable font.
cf. I can usually check output results using a Unicode editor which has no dependence on Installed Fonts. e.g. SC Unipad.

Far better would be to allow either UTF-8 Byte Codes or Unicode characters in the replace patterns.

If there is an ambiguity issue that makes this difficult to implement, then perhaps the way forward
would be add a tick box option to the Perl patterns dialogue, to Allow UTF-8 Byte Codes (in replace).

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

PS. There is no mention in the PCRE help page of three-byte UTF-8 characters or four-byte UTF-8 characters.

The help page only makes reference to two-byte UTF-8 characters.
3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.
The same help page is primarily oriented to defining search patterns, and says very little about replace patterns.

cf. The example filter earlier in this issue made use of four patterns which are three-byte UTF-8 characters.

Yet with Unicode extending beyond the Basic Multilingual Page, there should be adequate support for four-byte UTF-8 characters.
e.g. Those in the Supplementary Ideographic Plane

Code: Select all

Range	Block	Code Points
20000..2A6DF	CJK Unified Ideographs Extension B	42,720
2A6E0..2A6FF	<reserved>	32
2A700..2B73F	CJK Unified Ideographs Extension C	4,160
2B740..2B81F	CJK Unified Ideographs Extension D	224
2B820..2F7FF	<reserved>	16,352
2F800..2FA1F	CJK Compatibility Ideographs Supplement	544
2FA20..2FFFF	<reserved>	1,504
These could easily come up if one were processing files involving Chinese scripts.

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

Simon,

Based on the precautionary principle, I would never view it as a good idea to "on loading an older filter file, convert ...."

Unless you've tested all possible eventualities, there's no saying what unpredictable behaviour might ensue if changes are made to an older filter file on loading.

Best regards,

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

There is also a possible misunderstanding lurking in this issue, due to the ambiguity of the new feature description.

"Converting the byte codes of replace strings to Unicode" could be done visibly or invisibly. Which is it?

Would what is displayed in the search for and replace with fields change?

Again, what about those situations in which UTF-8 Byte Codes MUST be used to avoid confusion with other PCRE escape characters.

We've discussed problems encountered while left to right parsing of PCRE patterns in some earlier issues.
It get pretty complicated when processing USFM files (or RTF files as text) which already contain lots of backslash characters.

You know what an RTF file is. USFM is defined at http://paratext.org/usfm

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Backwards compatibility & UTF-8 Byte Codes

Post by DataMystic Support »

Hi David,

Perhaps it would be best NOT to apply the utf-8 transform to the replace text and just treat it as single bytes (ASCII/ANSI).
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Backwards compatibility & UTF-8 Byte Codes

Post by DataMystic Support »

This is the approach we're following for 9.5.2, to be released in an hour or so.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

Thanks, Simon.

Downloaded 9.5.2 and will report back after I've installed it.

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Backwards compatibility & UTF-8 Byte Codes

Post by dfhtextpipe »

I have just tested v9.5.2 for my replace list filter in which the UTF-8 Byte Codes were in the replace column.

All seems OK now. If anything related comes up, I'll start a new issue.

Thanks for speedy response.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Backwards compatibility & UTF-8 Byte Codes

Post by DataMystic Support »

I'm actually sorry it took so long!
Post Reply