Normalization of a UTF-8 file?

dfhtextpipe · Post by **dfhtextpipe** » Wed Mar 13, 2013 10:23 pm

Might it be feasible to be able to apply Unicode normalization filters directly to UTF-8 encoded files? If not, why not?

It seems rather slow and inefficient to have to first convert a UTF-8 file to UTF-16 LE before using a normalization to NFC filter, and then back again to UTF-8 afterwards.

Especially when [say] the proportion of combining characters within the file is relatively low.

David

Post by **DataMystic Support** » Thu Mar 14, 2013 6:21 am

Hi David,

We take advantage of functions that operate with UTF16LE only (and we don't plan to rewrite them).

Are you finding this very slow? Or is it more a question of making it transparent to the user? ie having this conversion done in the background?

dfhtextpipe · Post by **dfhtextpipe** » Fri Mar 15, 2013 7:26 pm

Normalization of a UTF-8 file of length 5,521,778 bytes took almost 4 minutes.

The number of combining characters in the input file was 195,986 - which is approximately 3% of the total.

The time penalty arises from having to double the number of bytes required to represent the other 97% of the file
during the conversion of UTF-8 to UTF-16 LE.

In fact the time penalty occurs twice, because these characters also have to be converted back again to UTF-8,
even though they didn't need normalizing in the first place.

Hence my remark in the initial posting.

David

Post by **DataMystic Support** » Mon Mar 18, 2013 5:20 pm

Hi David,

Understood. Would you be able to send me a compressed sample file to benchmark against?

dfhtextpipe · Post by **dfhtextpipe** » Thu Mar 21, 2013 8:23 pm

Hi Simon,

I could arrange that when I have a few moments to spare.

David

Post by **DataMystic Support** » Mon Mar 25, 2013 2:48 pm

One other question David,

TextPipe only files with UTF-8 BOMs to be UTF-8, so ANSI files are not considered Utf-8.

I believe that the Unicode spec says that utf-8 files do not need a BOM. Do you think that TextPipe's Restrict to UTF-8 files should be changed to reflect this?

ie
1. Rename the existing filter to Restrict to UTF-8 BOM files
2. Create a new filter for Restrict to UTF-8 files, which allows any files that do not look like UTF16 or UTF32.

What do you think?

dfhtextpipe · Post by **dfhtextpipe** » Tue Mar 26, 2013 4:03 am

Simon,

I can only report my own experience and common practice.

Many of the files that I handle are encoded as UTF-8 without BOM.
And if the input files are not thus encoded, most of the output files from my filters are.

UTF-8 files that are with BOM are less often seen, and you already have a filter to Remove BOM.

David

DataMystic

Normalization of a UTF-8 file?

Normalization of a UTF-8 file?

Re: Normalization of a UTF-8 file?

Re: Normalization of a UTF-8 file?

Re: Normalization of a UTF-8 file?

Re: Normalization of a UTF-8 file?

Re: Normalization of a UTF-8 file?

Re: Normalization of a UTF-8 file?