Page 1 of 2
Unicode normalize filters?
Posted: Wed Sep 06, 2017 7:39 pm
by dfhtextpipe
Any progress on the critical issues in the four Unicode normalize filters that I communicated to you by email several months ago?
Best regards,
David
Re: Unicode normalize filters?
Posted: Wed Sep 06, 2017 9:27 pm
by DataMystic Support
Sorry, not as yet
Re: Unicode normalize filters?
Posted: Thu Sep 28, 2017 12:26 am
by dfhtextpipe
The Help page
libiconv (under
Advanced Topics ) states:
About libiconv
libiconv is a GNU library used by TextPipe for some of its Unicode conversions.
C-Source code, .obj files and binaries are available for free from
http://www.gnu.org/software/libiconv/
It's incredible that the
Unicode normalize filters have critical bugs if such a library is being used.
Still had no satisfactory explanation on this issue.
Best regards,
David
Re: Unicode normalize filters?
Posted: Fri Oct 20, 2017 5:49 am
by DataMystic Support
TextPipe 10.5 is being released today and uses the Windows Unicode libraries for these functions.
Please let me know how they go!
Re: Unicode normalize filters?
Posted: Sat Oct 21, 2017 12:01 am
by dfhtextpipe
I tested the 26 x 26 x 4 digraphs again and it all worked as it should when converted to NFC.
i.e. No unwarranted spurious characters in the output.
There are further tests that I could do, but that can wait a while.
btw. Did you notify the supplier of the defective Unicode library that something was serious amiss?
Did no other TextPipe customer ever report the problem?
David
Re: Unicode normalize filters?
Posted: Sat Oct 21, 2017 12:09 am
by dfhtextpipe
btw. Somebody told me last month that Microsoft had changed how they implement Unicode rendering in Windows 7 compared to earlier versions of Windows.
They no longer use Uniscribe, even though that's is still being maintained.
See https://en.wikipedia.org/wiki/Uniscribe
and https://en.wikipedia.org/wiki/DirectWrite
Even so, there are some things that Windows 7 displays differently than the same text on Mac OS X.
Some glyphs in Biblical Hebrew are one application area where such differences are evident.
Best regards,
David
Re: Unicode normalize filters?
Posted: Mon Oct 23, 2017 7:19 am
by DataMystic Support
Hi David - we've removed the library in question from our build as it hasn't been updated in 2 years.
Re: Unicode normalize filters?
Posted: Thu Feb 08, 2018 6:12 am
by dfhtextpipe
Unicode Normalization using
TextPipe 10.6.2 still has critical problems.
The filter
Normalize to NFD completely removes some characters such as the following:
Code: Select all
U+0364 ͤ COMBINING LATIN SMALL LETTER E
U+0365 ͥ COMBINING LATIN SMALL LETTER I
U+036D ͭ COMBINING LATIN SMALL LETTER T
This is totally unwarranted.
I only discovered this inadvertently today while I was using TextPipe to analyse the XML file in this repo.
https://github.com/lemtom/coverdale
(
And yes, I did first convert UTF-8 to UTF-16 LE, etc.)
Best regards,
David
Re: Unicode normalize filters?
Posted: Thu Feb 08, 2018 8:07 am
by dfhtextpipe
Here's a suitable test file that demonstrates that
Normalize to NFD removes all combining characters.
Code: Select all
à
á
â
ã
ā
a̅
ă
ȧ
ä
ả
å
a̋
ǎ
a̍
a̎
ȁ
a̐
ȃ
a̒
a̓
a̔
a̕
a̖
a̗
a̘
a̙
a̚
a̛
a̜
a̝
a̞
a̟
a̠
a̡
a̢
ạ
a̤
ḁ
a̦
a̧
ą
a̩
a̪
a̫
a̬
a̭
a̮
a̯
a̰
a̱
a̲
a̳
a̴
a̵
a̶
a̷
a̸
a̹
a̺
a̻
a̼
a̽
a̾
a̿
à
á
a͂
a̓
ä
á
aͅ
a͆
a͇
a͈
a͉
a͊
a͋
a͌
a͍
a͎
a͐
a͑
a͒
a͓
a͔
a͕
a͖
a͗
a͘
a͙
a͚
a͛
a͜
a͝
a͞
a͟
a͠
a͡
a͢
aͣ
aͤ
aͥ
aͦ
aͧ
aͨ
aͩ
aͪ
aͫ
aͬ
aͭ
aͮ
aͯ
Save the above as a UTF-8 text file, then run this filter with that as the input file.
Code: Select all
Comment...
| Normalize to NFD
|
|--Convert from UTF-8 to UTF-16LE
|
|--NFD - Canonical Decomposition
|
+--Convert from UTF-16LE to UTF-8
Re: Unicode normalize filters?
Posted: Fri Feb 09, 2018 12:59 am
by dfhtextpipe
There's something critically wrong in the
Normalize to NFD filter.
It's far worse than reported in my previous comments.
Here are some Latin letters with various diacritics:
Code: Select all
ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ
With the same filter, this becomes:
Code: Select all
AAAAAACEEEIIIINOOOOØUUUUY
aaaaaaceeeeiiiinoooooøuuuuy GgGgHhĦħIiIiIiIiIJjKkLlLlLlĿ
NB. The long gap is filled with 68
NUL codes! (though
phpBB has replaced these by spaces).
All but a few of the diacritics were stripped, leaving only these three characters unchanged:
Aside: This output file displays very differently when opened with BabelPad.
Code: Select all
䅁䅁䅁䕃䕅䥉䥉低住썏喘啕奕慡慡慡散敥楥楩湩潯潯썯疸畵祵杇杇案ꛄꟄ楉楉楉楉䩉䭪䱫䱬䱬쑬
By way of comparison, this is what the output should be (after 106 normalizations):
Code: Select all
ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ
NB.
phpBB has renormalized the pasted text to NFC, so I would need to supply the actual text file for a proper comparison.
btw. These characters are unchanged after normalization to NFD by means other than TextPipe.
Normalize to NFD should not be used for serious work in its current state.
btw. I confirm that when the NFD filter is disabled, the output and input files are identical,
so the Unicode conversions between UTF-8 and UTF-16 LE are not at fault.
Best regards,
David
Re: Unicode normalize filters?
Posted: Fri Feb 09, 2018 3:59 am
by dfhtextpipe
Hi Simon,
If it's likely that it will take a lot of time and effort to fix the bugs in the Unicode Normalize filters, it would be preferable to disable them in the next release and restore them only after they have been properly fixed and thoroughly tested.
Best regards,
David
Re: Unicode normalize filters?
Posted: Wed May 02, 2018 2:34 am
by dfhtextpipe
Hi Simon,
Anything on the horizon for this longstanding critical problem?
Best regards,
David
Re: Unicode normalize filters?
Posted: Fri May 04, 2018 8:29 am
by DataMystic Support
Hi David,
Yes, you'll be pleased to know we have a fix for this now.
Unrelated to the bug, one thing I noticed with this code that I don't believe is necessary, is that after conversion it removes all NonSpacingMarks and CombiningMarks from the text.
This has been disabled for now.
Can you advise if this is expected behaviour?
Thanks,
Simon
Re: Unicode normalize filters?
Posted: Fri May 04, 2018 11:12 pm
by dfhtextpipe
Thanks Simon,
That's good news at last.
CombiningMarks and
NonSpacingMarks should never be removed as part of
Unicode Normalization.
btw. Am I correct in assuming that these are just alternative names for
Combining Diacritical Marks and
Spacing Modifier Letters?
Those would include the Unicode blocks
Code: Select all
Block Name Range Code Points Characters Unicode Version
Spacing Modifier Letters 02B0..02FF 80 80 1.0.0
Combining Diacritical Marks 0300..036F 112 112 1.0.0
Combining Diacritical Marks Extended 1AB0..1AFF 80 15 7.0
Combining Diacritical Marks Supplement 1DC0..1DFF 64 63 4.1
Combining Diacritical Marks for Symbols 20D0..20FF 48 33 1.0.0
Combining Half Marks FE20..FE2F 16 16 1.1
Not necessarily comprehensively listed.
Stripping [all|some] diacritics might be a useful process for some users, but any such option should be an independent feature.
David
Re: Unicode normalize filters?
Posted: Sun May 06, 2018 9:50 pm
by DataMystic Support
Unsure - the library code for this does relies on a category table extracted from the Unicode database. Best that code is disabled for now