Unicode normalize filters?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Unicode normalize filters?

Post by dfhtextpipe »

Any progress on the critical issues in the four Unicode normalize filters that I communicated to you by email several months ago?

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode normalize filters?

Post by DataMystic Support »

Sorry, not as yet
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

The Help page libiconv (under Advanced Topics ) states:
About libiconv

libiconv is a GNU library used by TextPipe for some of its Unicode conversions.

C-Source code, .obj files and binaries are available for free from

http://www.gnu.org/software/libiconv/
It's incredible that the Unicode normalize filters have critical bugs if such a library is being used.

Still had no satisfactory explanation on this issue.

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode normalize filters?

Post by DataMystic Support »

TextPipe 10.5 is being released today and uses the Windows Unicode libraries for these functions.

Please let me know how they go!
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

I tested the 26 x 26 x 4 digraphs again and it all worked as it should when converted to NFC.
i.e. No unwarranted spurious characters in the output.

There are further tests that I could do, but that can wait a while.

btw. Did you notify the supplier of the defective Unicode library that something was serious amiss?

Did no other TextPipe customer ever report the problem?

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

btw. Somebody told me last month that Microsoft had changed how they implement Unicode rendering in Windows 7 compared to earlier versions of Windows.

They no longer use Uniscribe, even though that's is still being maintained.

See https://en.wikipedia.org/wiki/Uniscribe

and https://en.wikipedia.org/wiki/DirectWrite

Even so, there are some things that Windows 7 displays differently than the same text on Mac OS X.
Some glyphs in Biblical Hebrew are one application area where such differences are evident.

Best regards,

David
Last edited by dfhtextpipe on Thu Feb 08, 2018 6:13 am, edited 1 time in total.
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode normalize filters?

Post by DataMystic Support »

Hi David - we've removed the library in question from our build as it hasn't been updated in 2 years.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

Unicode Normalization using TextPipe 10.6.2 still has critical problems.

The filter Normalize to NFD completely removes some characters such as the following:

Code: Select all

U+0364	ͤ	COMBINING LATIN SMALL LETTER E
U+0365	ͥ	COMBINING LATIN SMALL LETTER I
U+036D	ͭ	COMBINING LATIN SMALL LETTER T
This is totally unwarranted.

I only discovered this inadvertently today while I was using TextPipe to analyse the XML file in this repo.
https://github.com/lemtom/coverdale

(And yes, I did first convert UTF-8 to UTF-16 LE, etc.)

Best regards,

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

Here's a suitable test file that demonstrates that Normalize to NFD removes all combining characters.

Code: Select all

à
á
â
ã
ā
a̅
ă
ȧ
ä
ả
å
a̋
ǎ
a̍
a̎
ȁ
a̐
ȃ
a̒
a̓
a̔
a̕
a̖
a̗
a̘
a̙
a̚
a̛
a̜
a̝
a̞
a̟
a̠
a̡
a̢
ạ
a̤
ḁ
a̦
a̧
ą
a̩
a̪
a̫
a̬
a̭
a̮
a̯
a̰
a̱
a̲
a̳
a̴
a̵
a̶
a̷
a̸
a̹
a̺
a̻
a̼
a̽
a̾
a̿
à
á
a͂
a̓
ä
á
aͅ
a͆
a͇
a͈
a͉
a͊
a͋
a͌
a͍
a͎
a͐
a͑
a͒
a͓
a͔
a͕
a͖
a͗
a͘
a͙
a͚
a͛
a͜
a͝
a͞
a͟
a͠
a͡
a͢
aͣ
aͤ
aͥ
aͦ
aͧ
aͨ
aͩ
aͪ
aͫ
aͬ
aͭ
aͮ
aͯ
Save the above as a UTF-8 text file, then run this filter with that as the input file.

Code: Select all

Comment...
|  Normalize to NFD
|
|--Convert from UTF-8 to UTF-16LE
|   
|--NFD - Canonical Decomposition
|   
+--Convert from UTF-16LE to UTF-8
    
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

There's something critically wrong in the Normalize to NFD filter.
It's far worse than reported in my previous comments.

Here are some Latin letters with various diacritics:

Code: Select all

ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ
With the same filter, this becomes:

Code: Select all

AAAAAACEEEIIIINOOOOØUUUUY
aaaaaaceeeeiiiinoooooøuuuuy                                                                    GgGgHhĦħIiIiIiIiIJjKkLlLlLlĿ
NB. The long gap is filled with 68 NUL codes! (though phpBB has replaced these by spaces).
All but a few of the diacritics were stripped, leaving only these three characters unchanged:

Code: Select all

Ħħ Ŀ
Aside: This output file displays very differently when opened with BabelPad.

Code: Select all

䅁䅁䅁䕃䕅䥉䥉低住썏喘啕奕਍慡慡慡散敥楥楩湩潯潯썯疸畵祵杇杇案ꛄꟄ楉楉楉楉䩉䭪䱫䱬䱬쑬
By way of comparison, this is what the output should be (after 106 normalizations):

Code: Select all

ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ
NB. phpBB has renormalized the pasted text to NFC, so I would need to supply the actual text file for a proper comparison.

btw. These characters are unchanged after normalization to NFD by means other than TextPipe.

Code: Select all

Øø Đđ Ħħ Ŀ
Normalize to NFD should not be used for serious work in its current state.

btw. I confirm that when the NFD filter is disabled, the output and input files are identical,
so the Unicode conversions between UTF-8 and UTF-16 LE are not at fault.

Best regards,

David
Last edited by dfhtextpipe on Fri Feb 09, 2018 4:50 am, edited 5 times in total.
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

Hi Simon,

If it's likely that it will take a lot of time and effort to fix the bugs in the Unicode Normalize filters, it would be preferable to disable them in the next release and restore them only after they have been properly fixed and thoroughly tested.

Best regards,

David
David
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

Hi Simon,

Anything on the horizon for this longstanding critical problem?

Best regards,

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode normalize filters?

Post by DataMystic Support »

Hi David,

Yes, you'll be pleased to know we have a fix for this now.

Unrelated to the bug, one thing I noticed with this code that I don't believe is necessary, is that after conversion it removes all NonSpacingMarks and CombiningMarks from the text.

This has been disabled for now.

Can you advise if this is expected behaviour?

Thanks,

Simon
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode normalize filters?

Post by dfhtextpipe »

Thanks Simon,

That's good news at last.

CombiningMarks and NonSpacingMarks should never be removed as part of Unicode Normalization.

btw. Am I correct in assuming that these are just alternative names for Combining Diacritical Marks and Spacing Modifier Letters?
Those would include the Unicode blocks

Code: Select all

Block Name	Range	Code Points	Characters	Unicode Version
Spacing Modifier Letters	02B0..02FF	80	80	1.0.0
Combining Diacritical Marks	0300..036F	112	112	1.0.0
Combining Diacritical Marks Extended	1AB0..1AFF	80	15	7.0
Combining Diacritical Marks Supplement	1DC0..1DFF	64	63	4.1
Combining Diacritical Marks for Symbols	20D0..20FF	48	33	1.0.0
Combining Half Marks	FE20..FE2F	16	16	1.1
Not necessarily comprehensively listed.

Stripping [all|some] diacritics might be a useful process for some users, but any such option should be an independent feature.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode normalize filters?

Post by DataMystic Support »

Unsure - the library code for this does relies on a category table extracted from the Unicode database. Best that code is disabled for now
Post Reply