Normalization of a UTF-8 file?
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Normalization of a UTF-8 file?
Might it be feasible to be able to apply Unicode normalization filters directly to UTF-8 encoded files? If not, why not?
It seems rather slow and inefficient to have to first convert a UTF-8 file to UTF-16 LE before using a normalization to NFC filter, and then back again to UTF-8 afterwards.
Especially when [say] the proportion of combining characters within the file is relatively low.
David
It seems rather slow and inefficient to have to first convert a UTF-8 file to UTF-16 LE before using a normalization to NFC filter, and then back again to UTF-8 afterwards.
Especially when [say] the proportion of combining characters within the file is relatively low.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Normalization of a UTF-8 file?
Hi David,
We take advantage of functions that operate with UTF16LE only (and we don't plan to rewrite them).
Are you finding this very slow? Or is it more a question of making it transparent to the user? ie having this conversion done in the background?
We take advantage of functions that operate with UTF16LE only (and we don't plan to rewrite them).
Are you finding this very slow? Or is it more a question of making it transparent to the user? ie having this conversion done in the background?
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Normalization of a UTF-8 file?
Normalization of a UTF-8 file of length 5,521,778 bytes took almost 4 minutes.
The number of combining characters in the input file was 195,986 - which is approximately 3% of the total.
The time penalty arises from having to double the number of bytes required to represent the other 97% of the file
during the conversion of UTF-8 to UTF-16 LE.
In fact the time penalty occurs twice, because these characters also have to be converted back again to UTF-8,
even though they didn't need normalizing in the first place.
Hence my remark in the initial posting.
David
The number of combining characters in the input file was 195,986 - which is approximately 3% of the total.
The time penalty arises from having to double the number of bytes required to represent the other 97% of the file
during the conversion of UTF-8 to UTF-16 LE.
In fact the time penalty occurs twice, because these characters also have to be converted back again to UTF-8,
even though they didn't need normalizing in the first place.
Hence my remark in the initial posting.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Normalization of a UTF-8 file?
Hi David,
Understood. Would you be able to send me a compressed sample file to benchmark against?
Understood. Would you be able to send me a compressed sample file to benchmark against?
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Normalization of a UTF-8 file?
Hi Simon,
I could arrange that when I have a few moments to spare.
David
I could arrange that when I have a few moments to spare.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Normalization of a UTF-8 file?
One other question David,
TextPipe only files with UTF-8 BOMs to be UTF-8, so ANSI files are not considered Utf-8.
I believe that the Unicode spec says that utf-8 files do not need a BOM. Do you think that TextPipe's Restrict to UTF-8 files should be changed to reflect this?
ie
1. Rename the existing filter to Restrict to UTF-8 BOM files
2. Create a new filter for Restrict to UTF-8 files, which allows any files that do not look like UTF16 or UTF32.
What do you think?
TextPipe only files with UTF-8 BOMs to be UTF-8, so ANSI files are not considered Utf-8.
I believe that the Unicode spec says that utf-8 files do not need a BOM. Do you think that TextPipe's Restrict to UTF-8 files should be changed to reflect this?
ie
1. Rename the existing filter to Restrict to UTF-8 BOM files
2. Create a new filter for Restrict to UTF-8 files, which allows any files that do not look like UTF16 or UTF32.
What do you think?
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Normalization of a UTF-8 file?
Simon,
I can only report my own experience and common practice.
Many of the files that I handle are encoded as UTF-8 without BOM.
And if the input files are not thus encoded, most of the output files from my filters are.
UTF-8 files that are with BOM are less often seen, and you already have a filter to Remove BOM.
David
I can only report my own experience and common practice.
Many of the files that I handle are encoded as UTF-8 without BOM.
And if the input files are not thus encoded, most of the output files from my filters are.
UTF-8 files that are with BOM are less often seen, and you already have a filter to Remove BOM.
David
David