Unicode normalize filters?
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Unicode normalize filters?
Any progress on the critical issues in the four Unicode normalize filters that I communicated to you by email several months ago?
Best regards,
David
Best regards,
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Unicode normalize filters?
Sorry, not as yet
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
The Help page libiconv (under Advanced Topics ) states:
Still had no satisfactory explanation on this issue.
Best regards,
David
It's incredible that the Unicode normalize filters have critical bugs if such a library is being used.About libiconv
libiconv is a GNU library used by TextPipe for some of its Unicode conversions.
C-Source code, .obj files and binaries are available for free from
http://www.gnu.org/software/libiconv/
Still had no satisfactory explanation on this issue.
Best regards,
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Unicode normalize filters?
TextPipe 10.5 is being released today and uses the Windows Unicode libraries for these functions.
Please let me know how they go!
Please let me know how they go!
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
I tested the 26 x 26 x 4 digraphs again and it all worked as it should when converted to NFC.
i.e. No unwarranted spurious characters in the output.
There are further tests that I could do, but that can wait a while.
btw. Did you notify the supplier of the defective Unicode library that something was serious amiss?
Did no other TextPipe customer ever report the problem?
David
i.e. No unwarranted spurious characters in the output.
There are further tests that I could do, but that can wait a while.
btw. Did you notify the supplier of the defective Unicode library that something was serious amiss?
Did no other TextPipe customer ever report the problem?
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
btw. Somebody told me last month that Microsoft had changed how they implement Unicode rendering in Windows 7 compared to earlier versions of Windows.
They no longer use Uniscribe, even though that's is still being maintained.
See https://en.wikipedia.org/wiki/Uniscribe
and https://en.wikipedia.org/wiki/DirectWrite
Even so, there are some things that Windows 7 displays differently than the same text on Mac OS X.
Some glyphs in Biblical Hebrew are one application area where such differences are evident.
Best regards,
David
They no longer use Uniscribe, even though that's is still being maintained.
See https://en.wikipedia.org/wiki/Uniscribe
and https://en.wikipedia.org/wiki/DirectWrite
Even so, there are some things that Windows 7 displays differently than the same text on Mac OS X.
Some glyphs in Biblical Hebrew are one application area where such differences are evident.
Best regards,
David
Last edited by dfhtextpipe on Thu Feb 08, 2018 6:13 am, edited 1 time in total.
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Unicode normalize filters?
Hi David - we've removed the library in question from our build as it hasn't been updated in 2 years.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
Unicode Normalization using TextPipe 10.6.2 still has critical problems.
The filter Normalize to NFD completely removes some characters such as the following:
This is totally unwarranted.
I only discovered this inadvertently today while I was using TextPipe to analyse the XML file in this repo.
https://github.com/lemtom/coverdale
(And yes, I did first convert UTF-8 to UTF-16 LE, etc.)
Best regards,
David
The filter Normalize to NFD completely removes some characters such as the following:
Code: Select all
U+0364 ͤ COMBINING LATIN SMALL LETTER E
U+0365 ͥ COMBINING LATIN SMALL LETTER I
U+036D ͭ COMBINING LATIN SMALL LETTER T
I only discovered this inadvertently today while I was using TextPipe to analyse the XML file in this repo.
https://github.com/lemtom/coverdale
(And yes, I did first convert UTF-8 to UTF-16 LE, etc.)
Best regards,
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
Here's a suitable test file that demonstrates that Normalize to NFD removes all combining characters.
Save the above as a UTF-8 text file, then run this filter with that as the input file.
Code: Select all
à
á
â
ã
ā
a̅
ă
ȧ
ä
ả
å
a̋
ǎ
a̍
a̎
ȁ
a̐
ȃ
a̒
a̓
a̔
a̕
a̖
a̗
a̘
a̙
a̚
a̛
a̜
a̝
a̞
a̟
a̠
a̡
a̢
ạ
a̤
ḁ
a̦
a̧
ą
a̩
a̪
a̫
a̬
a̭
a̮
a̯
a̰
a̱
a̲
a̳
a̴
a̵
a̶
a̷
a̸
a̹
a̺
a̻
a̼
a̽
a̾
a̿
à
á
a͂
a̓
ä
á
aͅ
a͆
a͇
a͈
a͉
a͊
a͋
a͌
a͍
a͎
a͐
a͑
a͒
a͓
a͔
a͕
a͖
a͗
a͘
a͙
a͚
a͛
a͜
a͝
a͞
a͟
a͠
a͡
a͢
aͣ
aͤ
aͥ
aͦ
aͧ
aͨ
aͩ
aͪ
aͫ
aͬ
aͭ
aͮ
aͯ
Code: Select all
Comment...
| Normalize to NFD
|
|--Convert from UTF-8 to UTF-16LE
|
|--NFD - Canonical Decomposition
|
+--Convert from UTF-16LE to UTF-8
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
There's something critically wrong in the Normalize to NFD filter.
It's far worse than reported in my previous comments.
Here are some Latin letters with various diacritics:
With the same filter, this becomes:
NB. The long gap is filled with 68 NUL codes! (though phpBB has replaced these by spaces).
All but a few of the diacritics were stripped, leaving only these three characters unchanged:
Aside: This output file displays very differently when opened with BabelPad.
By way of comparison, this is what the output should be (after 106 normalizations):
NB. phpBB has renormalized the pasted text to NFC, so I would need to supply the actual text file for a proper comparison.
btw. These characters are unchanged after normalization to NFD by means other than TextPipe.
Normalize to NFD should not be used for serious work in its current state.
btw. I confirm that when the NFD filter is disabled, the output and input files are identical,
so the Unicode conversions between UTF-8 and UTF-16 LE are not at fault.
Best regards,
David
It's far worse than reported in my previous comments.
Here are some Latin letters with various diacritics:
Code: Select all
ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ
Code: Select all
AAAAAACEEEIIIINOOOOØUUUUY
aaaaaaceeeeiiiinoooooøuuuuy GgGgHhĦħIiIiIiIiIJjKkLlLlLlĿ
All but a few of the diacritics were stripped, leaving only these three characters unchanged:
Code: Select all
Ħħ Ŀ
Code: Select all
䅁䅁䅁䕃䕅䥉䥉低住썏喘啕奕慡慡慡散敥楥楩湩潯潯썯疸畵祵杇杇案ꛄꟄ楉楉楉楉䩉䭪䱫䱬䱬쑬
Code: Select all
ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ
btw. These characters are unchanged after normalization to NFD by means other than TextPipe.
Code: Select all
Øø Đđ Ħħ Ŀ
btw. I confirm that when the NFD filter is disabled, the output and input files are identical,
so the Unicode conversions between UTF-8 and UTF-16 LE are not at fault.
Best regards,
David
Last edited by dfhtextpipe on Fri Feb 09, 2018 4:50 am, edited 5 times in total.
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
Hi Simon,
If it's likely that it will take a lot of time and effort to fix the bugs in the Unicode Normalize filters, it would be preferable to disable them in the next release and restore them only after they have been properly fixed and thoroughly tested.
Best regards,
David
If it's likely that it will take a lot of time and effort to fix the bugs in the Unicode Normalize filters, it would be preferable to disable them in the next release and restore them only after they have been properly fixed and thoroughly tested.
Best regards,
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
Hi Simon,
Anything on the horizon for this longstanding critical problem?
Best regards,
David
Anything on the horizon for this longstanding critical problem?
Best regards,
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Unicode normalize filters?
Hi David,
Yes, you'll be pleased to know we have a fix for this now.
Unrelated to the bug, one thing I noticed with this code that I don't believe is necessary, is that after conversion it removes all NonSpacingMarks and CombiningMarks from the text.
This has been disabled for now.
Can you advise if this is expected behaviour?
Thanks,
Simon
Yes, you'll be pleased to know we have a fix for this now.
Unrelated to the bug, one thing I noticed with this code that I don't believe is necessary, is that after conversion it removes all NonSpacingMarks and CombiningMarks from the text.
This has been disabled for now.
Can you advise if this is expected behaviour?
Thanks,
Simon
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Unicode normalize filters?
Thanks Simon,
That's good news at last.
CombiningMarks and NonSpacingMarks should never be removed as part of Unicode Normalization.
btw. Am I correct in assuming that these are just alternative names for Combining Diacritical Marks and Spacing Modifier Letters?
Those would include the Unicode blocksNot necessarily comprehensively listed.
Stripping [all|some] diacritics might be a useful process for some users, but any such option should be an independent feature.
David
That's good news at last.
CombiningMarks and NonSpacingMarks should never be removed as part of Unicode Normalization.
btw. Am I correct in assuming that these are just alternative names for Combining Diacritical Marks and Spacing Modifier Letters?
Those would include the Unicode blocks
Code: Select all
Block Name Range Code Points Characters Unicode Version
Spacing Modifier Letters 02B0..02FF 80 80 1.0.0
Combining Diacritical Marks 0300..036F 112 112 1.0.0
Combining Diacritical Marks Extended 1AB0..1AFF 80 15 7.0
Combining Diacritical Marks Supplement 1DC0..1DFF 64 63 4.1
Combining Diacritical Marks for Symbols 20D0..20FF 48 33 1.0.0
Combining Half Marks FE20..FE2F 16 16 1.1
Stripping [all|some] diacritics might be a useful process for some users, but any such option should be an independent feature.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Unicode normalize filters?
Unsure - the library code for this does relies on a category table extracted from the Unicode database. Best that code is disabled for now