Speed of Unicode Normalization

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Speed of Unicode Normalization

Post by dfhtextpipe »

Is there anything you can to to improve the speed of Unicode Normalization?

I just added a Normalize to NFC filter to process a set of 27 Arabic UTF-8 text files,
and I'm seeing the predicted end time as over 60 minutes, while it was still processing the first file!

It's much much slower than the same function in BabelPad! See
http://www.babelstone.co.uk/Software/BabelPad.html

Maybe there's something you can learn from BabelStone ?

PS. When the predicted end time reach 90 minutes, I hit the cancel button.

David
TextPipe Standard user
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Speed of Unicode Normalization

Post by dfhtextpipe »

OK - I did something wrong. I should have read the help page. This reads,
Found under Filters\Unicode (Standard and Pro)

Applies a Unicode NFC - Canonical Decomposition, followed by Canonical Composition transformation to incoming Unicode text (UTF16-LE).

Output is also Unicode UTF16-LE.
I should have used

Code: Select all

Comment...
|  Normalize Unicode to NFC
|
|--Convert from UTF-8 to UTF-16LE
|   
|--NFC - Canonical Decomposition, followed by Canonical Composition
|   
+--Convert from UTF-16LE to UTF-8
    
This works OK and is speedy.

Therefore the real gripe is that TextPipe attempted to apply NFC to UTF-8 (which it cannot do) without reporting that it's impossible.

I would therefore suggest that the Normalization filters should include a detection for encoding,
and report an error message for anything other than UTF-16 (LE) as the input stream.

David
now wearing my beta-tester hat
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Speed of Unicode Normalization

Post by DataMystic Support »

Hi David,

One of TextPipe's strengths is that it makes no assumptions about the data it is processing - it just does what it is told.

This is at the same time, one of its weaknesses. I envisage a situation where TextPipe can detect input file types (where possible), and then allow that information to flow through the filter list - to detect these kinds of issues.
Post Reply