Page 1 of 1

Speed of Unicode Normalization

Posted: Thu Dec 30, 2010 8:16 pm
by dfhtextpipe
Is there anything you can to to improve the speed of Unicode Normalization?

I just added a Normalize to NFC filter to process a set of 27 Arabic UTF-8 text files,
and I'm seeing the predicted end time as over 60 minutes, while it was still processing the first file!

It's much much slower than the same function in BabelPad! See
http://www.babelstone.co.uk/Software/BabelPad.html

Maybe there's something you can learn from BabelStone ?

PS. When the predicted end time reach 90 minutes, I hit the cancel button.

David
TextPipe Standard user

Re: Speed of Unicode Normalization

Posted: Thu Dec 30, 2010 9:13 pm
by dfhtextpipe
OK - I did something wrong. I should have read the help page. This reads,
Found under Filters\Unicode (Standard and Pro)

Applies a Unicode NFC - Canonical Decomposition, followed by Canonical Composition transformation to incoming Unicode text (UTF16-LE).

Output is also Unicode UTF16-LE.
I should have used

Code: Select all

Comment...
|  Normalize Unicode to NFC
|
|--Convert from UTF-8 to UTF-16LE
|   
|--NFC - Canonical Decomposition, followed by Canonical Composition
|   
+--Convert from UTF-16LE to UTF-8
    
This works OK and is speedy.

Therefore the real gripe is that TextPipe attempted to apply NFC to UTF-8 (which it cannot do) without reporting that it's impossible.

I would therefore suggest that the Normalization filters should include a detection for encoding,
and report an error message for anything other than UTF-16 (LE) as the input stream.

David
now wearing my beta-tester hat

Re: Speed of Unicode Normalization

Posted: Mon Jan 03, 2011 11:48 am
by DataMystic Support
Hi David,

One of TextPipe's strengths is that it makes no assumptions about the data it is processing - it just does what it is told.

This is at the same time, one of its weaknesses. I envisage a situation where TextPipe can detect input file types (where possible), and then allow that information to flow through the filter list - to detect these kinds of issues.