Page 1 of 1

Normalization of a UTF-8 file?

Posted: Wed Mar 13, 2013 10:23 pm
by dfhtextpipe
Might it be feasible to be able to apply Unicode normalization filters directly to UTF-8 encoded files? If not, why not?

It seems rather slow and inefficient to have to first convert a UTF-8 file to UTF-16 LE before using a normalization to NFC filter, and then back again to UTF-8 afterwards.

Especially when [say] the proportion of combining characters within the file is relatively low.

David

Re: Normalization of a UTF-8 file?

Posted: Thu Mar 14, 2013 6:21 am
by DataMystic Support
Hi David,

We take advantage of functions that operate with UTF16LE only (and we don't plan to rewrite them).

Are you finding this very slow? Or is it more a question of making it transparent to the user? ie having this conversion done in the background?

Re: Normalization of a UTF-8 file?

Posted: Fri Mar 15, 2013 7:26 pm
by dfhtextpipe
Normalization of a UTF-8 file of length 5,521,778 bytes took almost 4 minutes.

The number of combining characters in the input file was 195,986 - which is approximately 3% of the total.

The time penalty arises from having to double the number of bytes required to represent the other 97% of the file
during the conversion of UTF-8 to UTF-16 LE.

In fact the time penalty occurs twice, because these characters also have to be converted back again to UTF-8,
even though they didn't need normalizing in the first place.

Hence my remark in the initial posting.

David

Re: Normalization of a UTF-8 file?

Posted: Mon Mar 18, 2013 5:20 pm
by DataMystic Support
Hi David,

Understood. Would you be able to send me a compressed sample file to benchmark against?

Re: Normalization of a UTF-8 file?

Posted: Thu Mar 21, 2013 8:23 pm
by dfhtextpipe
Hi Simon,

I could arrange that when I have a few moments to spare.

David

Re: Normalization of a UTF-8 file?

Posted: Mon Mar 25, 2013 2:48 pm
by DataMystic Support
One other question David,

TextPipe only files with UTF-8 BOMs to be UTF-8, so ANSI files are not considered Utf-8.

I believe that the Unicode spec says that utf-8 files do not need a BOM. Do you think that TextPipe's Restrict to UTF-8 files should be changed to reflect this?

ie
1. Rename the existing filter to Restrict to UTF-8 BOM files
2. Create a new filter for Restrict to UTF-8 files, which allows any files that do not look like UTF16 or UTF32.

What do you think?

Re: Normalization of a UTF-8 file?

Posted: Tue Mar 26, 2013 4:03 am
by dfhtextpipe
Simon,

I can only report my own experience and common practice.

Many of the files that I handle are encoded as UTF-8 without BOM.
And if the input files are not thus encoded, most of the output files from my filters are.

UTF-8 files that are with BOM are less often seen, and you already have a filter to Remove BOM.

David