Unicode Normalization bug

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Unicode Normalization bug

Post by dfhtextpipe »

I recently encountered a bug in Normalization to NFC for text containing Myanmar characters.
The bug affected composite characters each of which uses the same pair of combining characters:

့ MYANMAR SIGN DOT BELOW
် MYANMAR SIGN ASAT

I suspect that TextPipe uses out of date Normalization algorithms.

Some background.

Software that includes Normalization should be tested against the official Unicode Normalization Test http://www.unicode.org/Public/UNIDATA/N ... onTest.txt (2.2MB) for that version of Unicode,

The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby characters are reordered in ascending order according to their canonical combining class [ccc]. See http://www.unicode.org/reports/tr15/?wi ... ption_Norm.

U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW has ccc=7; therefore U+1037 is reordered before U+103A.

The bug is that TextPipe does not reorder these two codepoints.

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Post by dfhtextpipe »

Further details - comparing the current version of Unicode with the old one....

Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU Normalization Browser (which uses the "Internationalization Components for Unicode" library, which is the most widely used Unicode software library), we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with reordering:

See http://bit.ly/nqYzQp.

However, if you run the same test for Unicode 3.2 (released March 2002, and so almost 10 years out of date), there is no reordering:

See http://bit.ly/orZ7df.

NB. I used the URL shortener to allow parameters to be passed to the test page.
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Post by dfhtextpipe »

The attached ZIP file contains a small UTF-8 text file containing 8 composite characters from the Myanmar block of Unicode.

To display Myanmar characters, you may wish to download and install the SIL Padauk font from
http://scripts.sil.org/cms/scripts/page ... &id=Padauk

David
Attachments
Test.Myanmar.NFC.zip
Myanmar test file.
(173 Bytes) Downloaded 607 times
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode Normalization bug

Post by DataMystic Support »

Hi David,

We are working on an update for you, and once we have finished struggling through the AVL tree differences I will post a new beta for you to try.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode Normalization bug

Post by DataMystic Support »

Hi David,

TextPipe 8.9.8 has been released:

* Capture Text, Break on Value Change window now shows length of strings and
current cursor position.
* Updated internal PCRE (Pattern Matching ) engine to v8.13 and support for
Unicode 6.0.0.
* Updated Unicode internal libraries to support Unicode 4.1 for Normalization
etc.
* COM callees are now notified of Stack violations (e.g. during pattern
execution) or other critical errors via the existing 'FilterWindow.errorText'
variable.
* Only modified files are added back into Zip files such as .zip, .docx, .xlsx
and .pptx.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Post by dfhtextpipe »

Just downloaded v8.9.8 and about to install it.

Will let you know how I fare.

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Post by dfhtextpipe »

Having installed v8.9.8 I am pleased to confirm that when normalizing Burmese script to NFC, TextPipe now gives identical results to BabelPad.

Well done - and thanks especially for giving this issue such a high priority.

David
David
tuandq
Posts: 1
Joined: Sun Mar 25, 2012 5:04 pm

Re: Unicode Normalization bug

Post by tuandq »

Today I use TextPipe Pro 9.1 (Evaluation)'s NFC filter on some Vietnamese XML and Word XML files. But TextPipe's NFC filter not only doesn't affect anything but also Word XML files are corrupted after apply filter! The attachment is a zip file includes a Vietnamese Unicode text file and a Vietnamese Unicode Word XML file. Please test on it!

Regards.

Tuandq.
Attachments
VNUnicode.zip
(2.83 KiB) Downloaded 578 times
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Post by dfhtextpipe »

I did a character frequency analysis for the short XML file - see attached.

In what way was the file corrupted?

Were you running a TP filter for the XML file within an MS Word .docx file ?

David
Attachments
VNUnicode Char Freq.zip
Character frequency analysis (BabelPad)
(1.01 KiB) Downloaded 543 times
David
Post Reply