Page 1 of 1

Unicode support beyond the Basic Multilingual Plane?

Posted: Mon Dec 10, 2018 8:40 pm
by dfhtextpipe
It rather looks as though TextPipe does not support Unicode characters beyond the Basic Multilingual Plane.

cf. Unicode 11 added Plane 16 to the standard. 100000..10FFFF Supplementary Private Use Area-B
See https://www.unicode.org/versions/Unicode11.0.0/

I've just been testing the filter Convert Numeric HTML/XML Entities to text using the trial run area.

Codes beyond the BMP are improperly converted. e.g.

Code: Select all

𑊰
becomes

Code: Select all

which is U+12B0 ETHIOPIC SYLLABLE KWA
The proper conversion should be U+112B0 KHUDAWADI LETTER A

Thus files containing NCRs with more than 4 hex digits would be converted with errors in the output.

When will TextPipe become more fully compliant with the latest Unicode standard?

Best regards,

David

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Mon May 06, 2019 10:39 am
by DataMystic Support
We are currently looking into what is required here.

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Mon Mar 02, 2020 10:09 pm
by dfhtextpipe
What's New in TextPipe v11 – 12 December, 2019
==============================================
...
  • Upgraded Unicode support to Unicode 12.1.
...

Thanks!

David

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Mon Mar 02, 2020 10:14 pm
by dfhtextpipe
Bug alert!

Convert Numeric HTML/XML Entities to text converted

Code: Select all

𑊰 ꨀ
to

Code: Select all

; ;
NB. I have also just tried using this same example of Entity data in a UTF-8 input file as well as in the Trial Run area.
The output file had simply a semicolon just like the trial run area.

This is now become a very serious software bug!

Regards,

David

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Tue Mar 03, 2020 12:03 am
by dfhtextpipe
I have also retested the similar filter called Convert HTML/XML entities to text.

Please refer to https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

It's apparent from the Help page entitled Convert HTML/XML entities to text that this filter only supports HTML 4.0

It would be therefore be a further essential improvement to expand the covered entities to the larger set of character entity references in HTML 5.0 - complete with the alternative names for some of these.

Furthermore, I have just tested all the 252 covered entities with the filter.
In regard to the HTML 5.0 standard, two of these are now improperly converted by TextPipe 11.4

Code: Select all

Entity	Character	TextPipe	Exact?
⟨	⟨	〈	FALSE
⟩	⟩	〉	FALSE

Code: Select all

⟨ should be U+27E8 (moved to current code point in HTML 5.0; previously in HTML 4.0 it was mapped to U+2329 (9000); 
⟩ should be U+27E9 (moved to current code point in HTML 5.0; previously in HTML 4.0 it was mapped to U+232A (9001);
Best regards,

David

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Wed Mar 04, 2020 2:15 am
by dfhtextpipe
See also http://www.datamystic.com/forums/viewtopic.php?f=17&t=2505

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Sat Mar 14, 2020 2:36 am
by dfhtextpipe
Hi Simon,

Anything to report on this critical issue and the related one?

Best regards,

David

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Wed Apr 29, 2020 9:03 am
by DataMystic Support
Hi David,

This is being prepared for v11.6.

Regards,

Simon

Re: Unicode support beyond the Basic Multilingual Plane?

Posted: Wed Apr 29, 2020 5:30 pm
by dfhtextpipe
Excellent news!

David