Unicode normalize filters?

dfhtextpipe · Post by **dfhtextpipe** » Wed Sep 06, 2017 7:39 pm

Any progress on the critical issues in the four Unicode normalize filters that I communicated to you by email several months ago?

Best regards,

David

Post by **DataMystic Support** » Wed Sep 06, 2017 9:27 pm

Sorry, not as yet

dfhtextpipe · Post by **dfhtextpipe** » Thu Sep 28, 2017 12:26 am

The Help page libiconv (under Advanced Topics ) states:

About libiconv

libiconv is a GNU library used by TextPipe for some of its Unicode conversions.

C-Source code, .obj files and binaries are available for free from

http://www.gnu.org/software/libiconv/

It's incredible that the Unicode normalize filters have critical bugs if such a library is being used.

Still had no satisfactory explanation on this issue.

Best regards,

David

Post by **DataMystic Support** » Fri Oct 20, 2017 5:49 am

TextPipe 10.5 is being released today and uses the Windows Unicode libraries for these functions.

Please let me know how they go!

dfhtextpipe · Post by **dfhtextpipe** » Sat Oct 21, 2017 12:01 am

I tested the 26 x 26 x 4 digraphs again and it all worked as it should when converted to NFC.
i.e. No unwarranted spurious characters in the output.

There are further tests that I could do, but that can wait a while.

btw. Did you notify the supplier of the defective Unicode library that something was serious amiss?

Did no other TextPipe customer ever report the problem?

David

dfhtextpipe · Post by **dfhtextpipe** » Sat Oct 21, 2017 12:09 am

btw. Somebody told me last month that Microsoft had changed how they implement Unicode rendering in Windows 7 compared to earlier versions of Windows.

They no longer use Uniscribe, even though that's is still being maintained.

See https://en.wikipedia.org/wiki/Uniscribe

and https://en.wikipedia.org/wiki/DirectWrite

Even so, there are some things that Windows 7 displays differently than the same text on Mac OS X.
Some glyphs in Biblical Hebrew are one application area where such differences are evident.

Best regards,

David

Post by **DataMystic Support** » Mon Oct 23, 2017 7:19 am

Hi David - we've removed the library in question from our build as it hasn't been updated in 2 years.

dfhtextpipe · Post by **dfhtextpipe** » Thu Feb 08, 2018 6:12 am

Unicode Normalization using TextPipe 10.6.2 still has critical problems.

The filter Normalize to NFD completely removes some characters such as the following:

Code: Select all

U+0364	ͤ	COMBINING LATIN SMALL LETTER E
U+0365	ͥ	COMBINING LATIN SMALL LETTER I
U+036D	ͭ	COMBINING LATIN SMALL LETTER T

This is totally unwarranted.

I only discovered this inadvertently today while I was using TextPipe to analyse the XML file in this repo.
https://github.com/lemtom/coverdale

(And yes, I did first convert UTF-8 to UTF-16 LE, etc.)

Best regards,

David

dfhtextpipe · Post by **dfhtextpipe** » Thu Feb 08, 2018 8:07 am

Here's a suitable test file that demonstrates that Normalize to NFD removes all combining characters.

Code: Select all

à
á
â
ã
ā
a̅
ă
ȧ
ä
ả
å
a̋
ǎ
a̍
a̎
ȁ
a̐
ȃ
a̒
a̓
a̔
a̕
a̖
a̗
a̘
a̙
a̚
a̛
a̜
a̝
a̞
a̟
a̠
a̡
a̢
ạ
a̤
ḁ
a̦
a̧
ą
a̩
a̪
a̫
a̬
a̭
a̮
a̯
a̰
a̱
a̲
a̳
a̴
a̵
a̶
a̷
a̸
a̹
a̺
a̻
a̼
a̽
a̾
a̿
à
á
a͂
a̓
ä
á
aͅ
a͆
a͇
a͈
a͉
a͊
a͋
a͌
a͍
a͎
a͐
a͑
a͒
a͓
a͔
a͕
a͖
a͗
a͘
a͙
a͚
a͛
a͜
a͝
a͞
a͟
a͠
a͡
a͢
aͣ
aͤ
aͥ
aͦ
aͧ
aͨ
aͩ
aͪ
aͫ
aͬ
aͭ
aͮ
aͯ

Save the above as a UTF-8 text file, then run this filter with that as the input file.

Code: Select all

Comment...
|  Normalize to NFD
|
|--Convert from UTF-8 to UTF-16LE
|   
|--NFD - Canonical Decomposition
|   
+--Convert from UTF-16LE to UTF-8

dfhtextpipe · Post by **dfhtextpipe** » Fri Feb 09, 2018 12:59 am

There's something critically wrong in the Normalize to NFD filter.
It's far worse than reported in my previous comments.

Here are some Latin letters with various diacritics:

Code: Select all

ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ

With the same filter, this becomes:

Code: Select all

AAAAAACEEEIIIINOOOOØUUUUY
aaaaaaceeeeiiiinoooooøuuuuy                                                                    GgGgHhĦħIiIiIiIiIJjKkLlLlLlĿ

NB. The long gap is filled with 68 NUL codes! (though phpBB has replaced these by spaces).
All but a few of the diacritics were stripped, leaving only these three characters unchanged:

Code: Select all

Ħħ Ŀ

Aside: This output file displays very differently when opened with BabelPad.

Code: Select all

䅁䅁䅁䕃䕅䥉䥉低住썏喘啕奕਍慡慡慡散敥楥楩湩潯潯썯疸畵祵杇杇案ꛄꟄ楉楉楉楉䩉䭪䱫䱬䱬쑬

By way of comparison, this is what the output should be (after 106 normalizations):

Code: Select all

ÀÁÂÃÄÅÇÈÉËÌÍÎÏÑÒÓÔÖØÙÚÛÜÝ
àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ
ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķĹĺĻļĽľĿ

NB. phpBB has renormalized the pasted text to NFC, so I would need to supply the actual text file for a proper comparison.

btw. These characters are unchanged after normalization to NFD by means other than TextPipe.

Code: Select all

Øø Đđ Ħħ Ŀ

Normalize to NFD should not be used for serious work in its current state.

btw. I confirm that when the NFD filter is disabled, the output and input files are identical,
so the Unicode conversions between UTF-8 and UTF-16 LE are not at fault.

Best regards,

David

dfhtextpipe · Post by **dfhtextpipe** » Fri Feb 09, 2018 3:59 am

Hi Simon,

If it's likely that it will take a lot of time and effort to fix the bugs in the Unicode Normalize filters, it would be preferable to disable them in the next release and restore them only after they have been properly fixed and thoroughly tested.

Best regards,

David

dfhtextpipe · Post by **dfhtextpipe** » Wed May 02, 2018 2:34 am

Hi Simon,

Anything on the horizon for this longstanding critical problem?

Best regards,

David

Post by **DataMystic Support** » Fri May 04, 2018 8:29 am

Hi David,

Yes, you'll be pleased to know we have a fix for this now.

Unrelated to the bug, one thing I noticed with this code that I don't believe is necessary, is that after conversion it removes all NonSpacingMarks and CombiningMarks from the text.

This has been disabled for now.

Can you advise if this is expected behaviour?

Thanks,

Simon

dfhtextpipe · Post by **dfhtextpipe** » Fri May 04, 2018 11:12 pm

Thanks Simon,

That's good news at last.

CombiningMarks and NonSpacingMarks should never be removed as part of Unicode Normalization.

btw. Am I correct in assuming that these are just alternative names for Combining Diacritical Marks and Spacing Modifier Letters?
Those would include the Unicode blocks

Code: Select all

Block Name	Range	Code Points	Characters	Unicode Version
Spacing Modifier Letters	02B0..02FF	80	80	1.0.0
Combining Diacritical Marks	0300..036F	112	112	1.0.0
Combining Diacritical Marks Extended	1AB0..1AFF	80	15	7.0
Combining Diacritical Marks Supplement	1DC0..1DFF	64	63	4.1
Combining Diacritical Marks for Symbols	20D0..20FF	48	33	1.0.0
Combining Half Marks	FE20..FE2F	16	16	1.1

Not necessarily comprehensively listed.

Stripping [all|some] diacritics might be a useful process for some users, but any such option should be an independent feature.

David

Post by **DataMystic Support** » Sun May 06, 2018 9:50 pm

Unsure - the library code for this does relies on a category table extracted from the Unicode database. Best that code is disabled for now

DataMystic

Unicode normalize filters?

Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?

Re: Unicode normalize filters?