Subtle bug in Convert Numeric HTML/XML Entities to text
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Subtle bug in Convert Numeric HTML/XML Entities to text
Hi Simon,
I have just encountered some very rare errors in the output of the filter Convert Numeric HTML/XML Entities to text.
My input file contains 31,286 NCRs that are to be converted.
These are scattered through out a UTF-8 text file that has 36,888 lines and a total of 7,219,971 characters.
The vast majority of the NCRs are converted correctly.
However, there were 5 locations where the conversion was incorrect.
I can only conclude that there's a subtle software bug in TextPipe Standard 10.7.2.
The problem is in understanding the root cause, because the same NCRs are converted correctly in other locations in the file.
The errors were detected by comparing the output file with one obtained by using BabelPad version 1.0.0.4 to Convert NCRs to Unicode.
This issue is critical, seeing as the TextPipe conversion is producing these rare but wrong results.
The attached .diff file was generated using WinMerge. Not ideal, but it does provide the context.
NB. I can readily send you a copy of the input file by email so that you might investigate further in detail.
Best regards,
David
I have just encountered some very rare errors in the output of the filter Convert Numeric HTML/XML Entities to text.
My input file contains 31,286 NCRs that are to be converted.
These are scattered through out a UTF-8 text file that has 36,888 lines and a total of 7,219,971 characters.
The vast majority of the NCRs are converted correctly.
However, there were 5 locations where the conversion was incorrect.
I can only conclude that there's a subtle software bug in TextPipe Standard 10.7.2.
The problem is in understanding the root cause, because the same NCRs are converted correctly in other locations in the file.
The errors were detected by comparing the output file with one obtained by using BabelPad version 1.0.0.4 to Convert NCRs to Unicode.
This issue is critical, seeing as the TextPipe conversion is producing these rare but wrong results.
The attached .diff file was generated using WinMerge. Not ideal, but it does provide the context.
NB. I can readily send you a copy of the input file by email so that you might investigate further in detail.
Best regards,
David
- Attachments
-
- Reina1569.Bible.UTF-8.diff.zip
- Zip contains .diff generated by WinMerge 0.86
- (950 Bytes) Downloaded 1990 times
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Subtle bug in Convert Numeric HTML/XML Entities to text
Since posting the issue, I have found a workaround.
If the filter Convert Numeric HTML/XML Entities to text is plac1ed under a Restrict to each line in turn filter, the output file is without error.
I thereby conclude that the bug must have some relation to data caching or memory management.
David
If the filter Convert Numeric HTML/XML Entities to text is plac1ed under a Restrict to each line in turn filter, the output file is without error.
I thereby conclude that the bug must have some relation to data caching or memory management.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Subtle bug in Convert Numeric HTML/XML Entities to text
A workaround is one thing, but were you able to confirm whether or not my conjecture about data caching was valid?
If so, is there anything that can be done to prevent the bad output that I observed without such a workaround?
David
If so, is there anything that can be done to prevent the bad output that I observed without such a workaround?
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Subtle bug in Convert Numeric HTML/XML Entities to text
Hi David - can you please send the original file and the TextPipe filter?
The filter is a simple byte-by-byte state machine, expecting
&#<decimal digits>; or
&#x<hex digits>; or
It handles 2-byte unicode characters properly.
I don't understand why *any* changes are being made to your file as the non-&# data should pass straight through.
Simon
The filter is a simple byte-by-byte state machine, expecting
&#<decimal digits>; or
&#x<hex digits>; or
It handles 2-byte unicode characters properly.
I don't understand why *any* changes are being made to your file as the non-&# data should pass straight through.
Simon
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Subtle bug in Convert Numeric HTML/XML Entities to text
Hi David,
This definitely is an issue with TextPipe 10.7.2.
But testing with TextPipe 10.8, the issue is no longer there.
This definitely is an issue with TextPipe 10.7.2.
But testing with TextPipe 10.8, the issue is no longer there.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Subtle bug in Convert Numeric HTML/XML Entities to text
That's good news!
I await TextPipe 10.8 with bated breath.
David
I await TextPipe 10.8 with bated breath.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Subtle bug in Convert Numeric HTML/XML Entities to text
Related question, but the other way round...
Do any of the new libraries used in TextPipe 11.x support the conversion of Unicode characters to numerical entities?
This would obviate needing to resort to a map.
cf. Help for filter Convert Numeric HTML/XML Entities to text includes:
David
Do any of the new libraries used in TextPipe 11.x support the conversion of Unicode characters to numerical entities?
This would obviate needing to resort to a map.
cf. Help for filter Convert Numeric HTML/XML Entities to text includes:
Regards,To convert plain text to HTML entities (ie convert in the opposite direction), use a Map.
David
David