Normalization of a UTF-8 file?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Normalization of a UTF-8 file?

Post by dfhtextpipe »

Might it be feasible to be able to apply Unicode normalization filters directly to UTF-8 encoded files? If not, why not?

It seems rather slow and inefficient to have to first convert a UTF-8 file to UTF-16 LE before using a normalization to NFC filter, and then back again to UTF-8 afterwards.

Especially when [say] the proportion of combining characters within the file is relatively low.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Normalization of a UTF-8 file?

Post by DataMystic Support »

Hi David,

We take advantage of functions that operate with UTF16LE only (and we don't plan to rewrite them).

Are you finding this very slow? Or is it more a question of making it transparent to the user? ie having this conversion done in the background?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Normalization of a UTF-8 file?

Post by dfhtextpipe »

Normalization of a UTF-8 file of length 5,521,778 bytes took almost 4 minutes.

The number of combining characters in the input file was 195,986 - which is approximately 3% of the total.

The time penalty arises from having to double the number of bytes required to represent the other 97% of the file
during the conversion of UTF-8 to UTF-16 LE.

In fact the time penalty occurs twice, because these characters also have to be converted back again to UTF-8,
even though they didn't need normalizing in the first place.

Hence my remark in the initial posting.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Normalization of a UTF-8 file?

Post by DataMystic Support »

Hi David,

Understood. Would you be able to send me a compressed sample file to benchmark against?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Normalization of a UTF-8 file?

Post by dfhtextpipe »

Hi Simon,

I could arrange that when I have a few moments to spare.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Normalization of a UTF-8 file?

Post by DataMystic Support »

One other question David,

TextPipe only files with UTF-8 BOMs to be UTF-8, so ANSI files are not considered Utf-8.

I believe that the Unicode spec says that utf-8 files do not need a BOM. Do you think that TextPipe's Restrict to UTF-8 files should be changed to reflect this?

ie
1. Rename the existing filter to Restrict to UTF-8 BOM files
2. Create a new filter for Restrict to UTF-8 files, which allows any files that do not look like UTF16 or UTF32.

What do you think?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Normalization of a UTF-8 file?

Post by dfhtextpipe »

Simon,

I can only report my own experience and common practice.

Many of the files that I handle are encoded as UTF-8 without BOM.
And if the input files are not thus encoded, most of the output files from my filters are.

UTF-8 files that are with BOM are less often seen, and you already have a filter to Remove BOM.

David
David
Post Reply