Unicode Normalization bug
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Unicode Normalization bug
I recently encountered a bug in Normalization to NFC for text containing Myanmar characters.
The bug affected composite characters each of which uses the same pair of combining characters:
့ MYANMAR SIGN DOT BELOW
် MYANMAR SIGN ASAT
I suspect that TextPipe uses out of date Normalization algorithms.
Some background.
Software that includes Normalization should be tested against the official Unicode Normalization Test http://www.unicode.org/Public/UNIDATA/N ... onTest.txt (2.2MB) for that version of Unicode,
The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby characters are reordered in ascending order according to their canonical combining class [ccc]. See http://www.unicode.org/reports/tr15/?wi ... ption_Norm.
U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW has ccc=7; therefore U+1037 is reordered before U+103A.
The bug is that TextPipe does not reorder these two codepoints.
David
The bug affected composite characters each of which uses the same pair of combining characters:
့ MYANMAR SIGN DOT BELOW
် MYANMAR SIGN ASAT
I suspect that TextPipe uses out of date Normalization algorithms.
Some background.
Software that includes Normalization should be tested against the official Unicode Normalization Test http://www.unicode.org/Public/UNIDATA/N ... onTest.txt (2.2MB) for that version of Unicode,
The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby characters are reordered in ascending order according to their canonical combining class [ccc]. See http://www.unicode.org/reports/tr15/?wi ... ption_Norm.
U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW has ccc=7; therefore U+1037 is reordered before U+103A.
The bug is that TextPipe does not reorder these two codepoints.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode Normalization bug
Further details - comparing the current version of Unicode with the old one....
Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU Normalization Browser (which uses the "Internationalization Components for Unicode" library, which is the most widely used Unicode software library), we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with reordering:
See http://bit.ly/nqYzQp.
However, if you run the same test for Unicode 3.2 (released March 2002, and so almost 10 years out of date), there is no reordering:
See http://bit.ly/orZ7df.
NB. I used the URL shortener to allow parameters to be passed to the test page.
Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU Normalization Browser (which uses the "Internationalization Components for Unicode" library, which is the most widely used Unicode software library), we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with reordering:
See http://bit.ly/nqYzQp.
However, if you run the same test for Unicode 3.2 (released March 2002, and so almost 10 years out of date), there is no reordering:
See http://bit.ly/orZ7df.
NB. I used the URL shortener to allow parameters to be passed to the test page.
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode Normalization bug
The attached ZIP file contains a small UTF-8 text file containing 8 composite characters from the Myanmar block of Unicode.
To display Myanmar characters, you may wish to download and install the SIL Padauk font from
http://scripts.sil.org/cms/scripts/page ... &id=Padauk
David
To display Myanmar characters, you may wish to download and install the SIL Padauk font from
http://scripts.sil.org/cms/scripts/page ... &id=Padauk
David
- Attachments
-
- Test.Myanmar.NFC.zip
- Myanmar test file.
- (173 Bytes) Downloaded 613 times
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Unicode Normalization bug
Hi David,
We are working on an update for you, and once we have finished struggling through the AVL tree differences I will post a new beta for you to try.
We are working on an update for you, and once we have finished struggling through the AVL tree differences I will post a new beta for you to try.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Unicode Normalization bug
Hi David,
TextPipe 8.9.8 has been released:
* Capture Text, Break on Value Change window now shows length of strings and
current cursor position.
* Updated internal PCRE (Pattern Matching ) engine to v8.13 and support for
Unicode 6.0.0.
* Updated Unicode internal libraries to support Unicode 4.1 for Normalization
etc.
* COM callees are now notified of Stack violations (e.g. during pattern
execution) or other critical errors via the existing 'FilterWindow.errorText'
variable.
* Only modified files are added back into Zip files such as .zip, .docx, .xlsx
and .pptx.
TextPipe 8.9.8 has been released:
* Capture Text, Break on Value Change window now shows length of strings and
current cursor position.
* Updated internal PCRE (Pattern Matching ) engine to v8.13 and support for
Unicode 6.0.0.
* Updated Unicode internal libraries to support Unicode 4.1 for Normalization
etc.
* COM callees are now notified of Stack violations (e.g. during pattern
execution) or other critical errors via the existing 'FilterWindow.errorText'
variable.
* Only modified files are added back into Zip files such as .zip, .docx, .xlsx
and .pptx.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode Normalization bug
Just downloaded v8.9.8 and about to install it.
Will let you know how I fare.
David
Will let you know how I fare.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode Normalization bug
Having installed v8.9.8 I am pleased to confirm that when normalizing Burmese script to NFC, TextPipe now gives identical results to BabelPad.
Well done - and thanks especially for giving this issue such a high priority.
David
Well done - and thanks especially for giving this issue such a high priority.
David
David
Re: Unicode Normalization bug
Today I use TextPipe Pro 9.1 (Evaluation)'s NFC filter on some Vietnamese XML and Word XML files. But TextPipe's NFC filter not only doesn't affect anything but also Word XML files are corrupted after apply filter! The attachment is a zip file includes a Vietnamese Unicode text file and a Vietnamese Unicode Word XML file. Please test on it!
Regards.
Tuandq.
Regards.
Tuandq.
- Attachments
-
- VNUnicode.zip
- (2.83 KiB) Downloaded 582 times
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode Normalization bug
I did a character frequency analysis for the short XML file - see attached.
In what way was the file corrupted?
Were you running a TP filter for the XML file within an MS Word .docx file ?
David
In what way was the file corrupted?
Were you running a TP filter for the XML file within an MS Word .docx file ?
David
- Attachments
-
- VNUnicode Char Freq.zip
- Character frequency analysis (BabelPad)
- (1.01 KiB) Downloaded 548 times
David