Is there anything you can to to improve the speed of Unicode Normalization?
I just added a Normalize to NFC filter to process a set of 27 Arabic UTF-8 text files,
and I'm seeing the predicted end time as over 60 minutes, while it was still processing the first file!
It's much much slower than the same function in BabelPad! See
http://www.babelstone.co.uk/Software/BabelPad.html
Maybe there's something you can learn from BabelStone ?
PS. When the predicted end time reach 90 minutes, I hit the cancel button.
David
TextPipe Standard user
Speed of Unicode Normalization
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Speed of Unicode Normalization
OK - I did something wrong. I should have read the help page. This reads,
This works OK and is speedy.
Therefore the real gripe is that TextPipe attempted to apply NFC to UTF-8 (which it cannot do) without reporting that it's impossible.
I would therefore suggest that the Normalization filters should include a detection for encoding,
and report an error message for anything other than UTF-16 (LE) as the input stream.
David
now wearing my beta-tester hat
I should have usedFound under Filters\Unicode (Standard and Pro)
Applies a Unicode NFC - Canonical Decomposition, followed by Canonical Composition transformation to incoming Unicode text (UTF16-LE).
Output is also Unicode UTF16-LE.
Code: Select all
Comment...
| Normalize Unicode to NFC
|
|--Convert from UTF-8 to UTF-16LE
|
|--NFC - Canonical Decomposition, followed by Canonical Composition
|
+--Convert from UTF-16LE to UTF-8
Therefore the real gripe is that TextPipe attempted to apply NFC to UTF-8 (which it cannot do) without reporting that it's impossible.
I would therefore suggest that the Normalization filters should include a detection for encoding,
and report an error message for anything other than UTF-16 (LE) as the input stream.
David
now wearing my beta-tester hat
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Speed of Unicode Normalization
Hi David,
One of TextPipe's strengths is that it makes no assumptions about the data it is processing - it just does what it is told.
This is at the same time, one of its weaknesses. I envisage a situation where TextPipe can detect input file types (where possible), and then allow that information to flow through the filter list - to detect these kinds of issues.
One of TextPipe's strengths is that it makes no assumptions about the data it is processing - it just does what it is told.
This is at the same time, one of its weaknesses. I envisage a situation where TextPipe can detect input file types (where possible), and then allow that information to flow through the filter list - to detect these kinds of issues.