Count duplicate lines filter and non-Roman script
Posted: Wed Sep 27, 2017 6:33 pm
When the Count duplicate lines filter is used on UTF-8 encoded input data that is in a particular non-Roman script, such as the Gurmukhi writing system used for Eastern Punjabi, what collation algorithm determines the sort order of the output lines?
I successfully made a filter to extract and count all the words from a Punjabi Bible translation, and I am puzzled by the order of the counted words.
cf. For a language that uses Roman script such as Shona, it's clear that the output is sorted on the words as if they were ANSI data, even though the file being processed is UTF-8.
It get's more complicated even with Roman script data when the alphabet makes use of codepoints beyond \xFF.
Best regards,
David
I successfully made a filter to extract and count all the words from a Punjabi Bible translation, and I am puzzled by the order of the counted words.
cf. For a language that uses Roman script such as Shona, it's clear that the output is sorted on the words as if they were ANSI data, even though the file being processed is UTF-8.
It get's more complicated even with Roman script data when the alphabet makes use of codepoints beyond \xFF.
Best regards,
David