When the Count duplicate lines filter is used on UTF-8 encoded input data that is in a particular non-Roman script, such as the Gurmukhi writing system used for Eastern Punjabi, what collation algorithm determines the sort order of the output lines?
I successfully made a filter to extract and count all the words from a Punjabi Bible translation, and I am puzzled by the order of the counted words.
cf. For a language that uses Roman script such as Shona, it's clear that the output is sorted on the words as if they were ANSI data, even though the file being processed is UTF-8.
It get's more complicated even with Roman script data when the alphabet makes use of codepoints beyond \xFF.
Best regards,
David
Count duplicate lines filter and non-Roman script
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Count duplicate lines filter and non-Roman script
Any update to report on this?
Aside: Is the same collation algorithm used in the Sort filter?
Might it be feasible to enhance both filters by including in the UI a drop-down list control for selecting a writing system specific collation order?
David
Aside: Is the same collation algorithm used in the Sort filter?
Might it be feasible to enhance both filters by including in the UI a drop-down list control for selecting a writing system specific collation order?
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Count duplicate lines filter and non-Roman script
Help for this filter includes:
The output of this filter then needs to be piped into the Sort filter and UTF-8 sorted on the text column corresponding to the start of the lines that were counted.
IMHO, including a dropdown in UI for this filter to select the sort method should be implemented.
This would obviate the need for using the sort filter afterwards.
David
That may well be, but the output of using this filter on an input stream with a mixture of Unicode blocks comes out in a very strange order!The file need NOT be sorted prior to this filter.
The output of this filter then needs to be piped into the Sort filter and UTF-8 sorted on the text column corresponding to the start of the lines that were counted.
IMHO, including a dropdown in UI for this filter to select the sort method should be implemented.
This would obviate the need for using the sort filter afterwards.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Count duplicate lines filter and non-Roman script
Hi David, this sounds like it is specific to your use case. We've added some text to the help file to address it.