Unicode Normalization bug

dfhtextpipe · Post by **dfhtextpipe** » Mon Oct 10, 2011 7:25 pm

I recently encountered a bug in Normalization to NFC for text containing Myanmar characters.
The bug affected composite characters each of which uses the same pair of combining characters:

့ MYANMAR SIGN DOT BELOW
် MYANMAR SIGN ASAT

I suspect that TextPipe uses out of date Normalization algorithms.

Some background.

Software that includes Normalization should be tested against the official Unicode Normalization Test http://www.unicode.org/Public/UNIDATA/N ... onTest.txt (2.2MB) for that version of Unicode,

The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby characters are reordered in ascending order according to their canonical combining class [ccc]. See http://www.unicode.org/reports/tr15/?wi ... ption_Norm.

U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW has ccc=7; therefore U+1037 is reordered before U+103A.

The bug is that TextPipe does not reorder these two codepoints.

David

dfhtextpipe · Post by **dfhtextpipe** » Mon Oct 10, 2011 7:30 pm

Further details - comparing the current version of Unicode with the old one....

Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU Normalization Browser (which uses the "Internationalization Components for Unicode" library, which is the most widely used Unicode software library), we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with reordering:

See http://bit.ly/nqYzQp.

However, if you run the same test for Unicode 3.2 (released March 2002, and so almost 10 years out of date), there is no reordering:

See http://bit.ly/orZ7df.

NB. I used the URL shortener to allow parameters to be passed to the test page.

dfhtextpipe · Post by **dfhtextpipe** » Mon Oct 10, 2011 7:34 pm

The attached ZIP file contains a small UTF-8 text file containing 8 composite characters from the Myanmar block of Unicode.

To display Myanmar characters, you may wish to download and install the SIL Padauk font from
http://scripts.sil.org/cms/scripts/page ... &id=Padauk

David

Post by **DataMystic Support** » Wed Oct 19, 2011 3:57 pm

Hi David,

We are working on an update for you, and once we have finished struggling through the AVL tree differences I will post a new beta for you to try.

Post by **DataMystic Support** » Mon Oct 24, 2011 8:54 pm

Hi David,

TextPipe 8.9.8 has been released:

* Capture Text, Break on Value Change window now shows length of strings and
current cursor position.
* Updated internal PCRE (Pattern Matching ) engine to v8.13 and support for
Unicode 6.0.0.
* Updated Unicode internal libraries to support Unicode 4.1 for Normalization
etc.
* COM callees are now notified of Stack violations (e.g. during pattern
execution) or other critical errors via the existing 'FilterWindow.errorText'
variable.
* Only modified files are added back into Zip files such as .zip, .docx, .xlsx
and .pptx.

dfhtextpipe · Post by **dfhtextpipe** » Fri Oct 28, 2011 4:22 am

Just downloaded v8.9.8 and about to install it.

Will let you know how I fare.

David

dfhtextpipe · Post by **dfhtextpipe** » Sat Oct 29, 2011 1:04 am

Having installed v8.9.8 I am pleased to confirm that when normalizing Burmese script to NFC, TextPipe now gives identical results to BabelPad.

Well done - and thanks especially for giving this issue such a high priority.

David

tuandq · Post by **tuandq** » Sun Mar 25, 2012 5:35 pm

Today I use TextPipe Pro 9.1 (Evaluation)'s NFC filter on some Vietnamese XML and Word XML files. But TextPipe's NFC filter not only doesn't affect anything but also Word XML files are corrupted after apply filter! The attachment is a zip file includes a Vietnamese Unicode text file and a Vietnamese Unicode Word XML file. Please test on it!

Regards.

Tuandq.

dfhtextpipe · Post by **dfhtextpipe** » Tue Apr 03, 2012 7:25 pm

I did a character frequency analysis for the short XML file - see attached.

In what way was the file corrupted?

Were you running a TP filter for the XML file within an MS Word .docx file ?

David

DataMystic

Unicode Normalization bug

Unicode Normalization bug

Re: Unicode Normalization bug

Re: Unicode Normalization bug

Re: Unicode Normalization bug

Re: Unicode Normalization bug

Re: Unicode Normalization bug

Re: Unicode Normalization bug

Re: Unicode Normalization bug

Re: Unicode Normalization bug