Page 1 of 1
Unicode support beyond the Basic Multilingual Plane?
Posted: Mon Dec 10, 2018 8:40 pm
by dfhtextpipe
It rather looks as though TextPipe does not support Unicode characters beyond the Basic Multilingual Plane.
cf.
Unicode 11 added Plane 16 to the standard. 100000..10FFFF Supplementary Private Use Area-B
See https://www.unicode.org/versions/Unicode11.0.0/
I've just been testing the filter
Convert Numeric HTML/XML Entities to text using the trial run area.
Codes beyond the BMP are improperly converted. e.g.
becomes
which is
U+12B0 ETHIOPIC SYLLABLE KWA
The proper conversion should be
U+112B0 KHUDAWADI LETTER A
Thus files containing NCRs with more than 4 hex digits would be converted with errors in the output.
When will TextPipe become more fully compliant with the latest Unicode standard?
Best regards,
David
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Mon May 06, 2019 10:39 am
by DataMystic Support
We are currently looking into what is required here.
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Mon Mar 02, 2020 10:09 pm
by dfhtextpipe
What's New in TextPipe v11 – 12 December, 2019
==============================================
...
- Upgraded Unicode support to Unicode 12.1.
...
Thanks!
David
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Mon Mar 02, 2020 10:14 pm
by dfhtextpipe
Bug alert!
Convert Numeric HTML/XML Entities to text converted
to
NB. I have also just tried using this same example of Entity data in a UTF-8 input file as well as in the
Trial Run area.
The output file had simply a semicolon just like the trial run area.
This is now become a very serious software bug!
Regards,
David
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Tue Mar 03, 2020 12:03 am
by dfhtextpipe
I have also retested the similar filter called
Convert HTML/XML entities to text.
Please refer to https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
It's apparent from the Help page entitled
Convert HTML/XML entities to text that this filter only supports
HTML 4.0
It would be therefore be a further essential improvement to expand the
covered entities to the larger set of
character entity references in
HTML 5.0 - complete with the alternative names for some of these.
Furthermore, I have just tested all the 252
covered entities with the filter.
In regard to the
HTML 5.0 standard, two of these are now
improperly converted by
TextPipe 11.4
Code: Select all
Entity Character TextPipe Exact?
⟨ ⟨ 〈 FALSE
⟩ ⟩ 〉 FALSE
Code: Select all
⟨ should be U+27E8 (moved to current code point in HTML 5.0; previously in HTML 4.0 it was mapped to U+2329 (9000);
⟩ should be U+27E9 (moved to current code point in HTML 5.0; previously in HTML 4.0 it was mapped to U+232A (9001);
Best regards,
David
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Wed Mar 04, 2020 2:15 am
by dfhtextpipe
See also http://www.datamystic.com/forums/viewtopic.php?f=17&t=2505
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Sat Mar 14, 2020 2:36 am
by dfhtextpipe
Hi Simon,
Anything to report on this critical issue and the related one?
Best regards,
David
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Wed Apr 29, 2020 9:03 am
by DataMystic Support
Hi David,
This is being prepared for v11.6.
Regards,
Simon
Re: Unicode support beyond the Basic Multilingual Plane?
Posted: Wed Apr 29, 2020 5:30 pm
by dfhtextpipe
Excellent news!
David