DataMystic

Posted: **Mon Dec 10, 2018 8:40 pm**

It rather looks as though TextPipe does not support Unicode characters beyond the Basic Multilingual Plane.

cf. Unicode 11 added Plane 16 to the standard. 100000..10FFFF Supplementary Private Use Area-B
See https://www.unicode.org/versions/Unicode11.0.0/

I've just been testing the filter Convert Numeric HTML/XML Entities to text using the trial run area.

Codes beyond the BMP are improperly converted. e.g.

Code: Select all

&#x112B0;

becomes

Code: Select all

ኰ

which is U+12B0 ETHIOPIC SYLLABLE KWA
The proper conversion should be U+112B0 KHUDAWADI LETTER A

Thus files containing NCRs with more than 4 hex digits would be converted with errors in the output.

When will TextPipe become more fully compliant with the latest Unicode standard?

Best regards,

David

Posted: **Mon May 06, 2019 10:39 am**

We are currently looking into what is required here.

Posted: **Mon Mar 02, 2020 10:09 pm**

What's New in TextPipe v11 – 12 December, 2019
==============================================
...

Upgraded Unicode support to Unicode 12.1.

...

Thanks!

David

Posted: **Mon Mar 02, 2020 10:14 pm**

Bug alert!

Convert Numeric HTML/XML Entities to text converted

Code: Select all

&#x112B0; &#xAA00;

to

Code: Select all

; ;

NB. I have also just tried using this same example of Entity data in a UTF-8 input file as well as in the Trial Run area.
The output file had simply a semicolon just like the trial run area.

This is now become a very serious software bug!

Regards,

David

Posted: **Tue Mar 03, 2020 12:03 am**

I have also retested the similar filter called Convert HTML/XML entities to text.

Please refer to https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

It's apparent from the Help page entitled Convert HTML/XML entities to text that this filter only supports HTML 4.0

It would be therefore be a further essential improvement to expand the covered entities to the larger set of character entity references in HTML 5.0 - complete with the alternative names for some of these.

Furthermore, I have just tested all the 252 covered entities with the filter.
In regard to the HTML 5.0 standard, two of these are now improperly converted by TextPipe 11.4

Code: Select all

Entity	Character	TextPipe	Exact?
&lang;	⟨	〈	FALSE
&rang;	⟩	〉	FALSE

Code: Select all

&lang; should be U+27E8 (moved to current code point in HTML 5.0; previously in HTML 4.0 it was mapped to U+2329 (9000); 
&rang; should be U+27E9 (moved to current code point in HTML 5.0; previously in HTML 4.0 it was mapped to U+232A (9001);

Best regards,

David

Posted: **Wed Mar 04, 2020 2:15 am**

See also http://www.datamystic.com/forums/viewtopic.php?f=17&t=2505

Posted: **Sat Mar 14, 2020 2:36 am**

Hi Simon,

Anything to report on this critical issue and the related one?

Best regards,

David

Posted: **Wed Apr 29, 2020 9:03 am**

Hi David,

This is being prepared for v11.6.

Regards,

Simon

Posted: **Wed Apr 29, 2020 5:30 pm**

Excellent news!

David

DataMystic

Unicode support beyond the Basic Multilingual Plane?

Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?

Re: Unicode support beyond the Basic Multilingual Plane?