Page 1 of 1

Locale-sensitive filters and multilingual texts?

Posted: Sat Jun 02, 2012 6:15 am
by dfhtextpipe
Several TextPipe filters are sensitive to the locale, especially those that involve sorting or case comparison.

Locales are currently configured as part of the regional settings of the Windows operating system.

Yet a monoglot programmer may be someone who is tasked with processing multilingual text files.
i.e. Several different projects each for a specific language.
Furthermore, the programmer may largely bring IT skills to these projects, rather than skills in each of languages.

It makes next to no sense for the programmer to keep changing the locale at the OS level.
This just leads the way to incomprehensible GUIs for all his Windows applications.
It may also lead to having to put up with unfamiliar keyboard layouts for different alphabets and syllabaries, etc.

To work using TextPipe in such circumstances, it would be much better for each locale-sensitive TextPipe filter
to include options for specifying the locale to use for that filter.

e.g. If you are processing a text written if French, German or Turkish, then the chosen filter will have an option to select one
of these locales from a whole host of locales that TextPipe is designed to support.
  • I mention French because vowels have accents and because of the cedilla, etc.
    I mention German because vowels can have accents and because of the chracter ß (U+00DF) LATIN SMALL LETTER SHARP S.
    I mention Turkish here because of the dotted and dotless I aspect of the Turkish alphabet.
Such locales are not restricted to the ANSI range of characters.
Yet even extended the available locales to cover more of the Latin alphabet based locales would be a good start.

Code: Select all

Block Name	Range	Code Points	Characters	Unicode Version
Basic Latin	0000..007F	128	128	1.0.0
Latin-1 Supplement	0080..00FF	128	128	1.0.0
Latin Extended-A	0100..017F	128	128	1.0.0
Latin Extended-B	0180..024F	208	208	1.0.0
Is this something that you would be prepared to develop as an enhancement to TextPipe?

Best regards,
David

Re: Locale-sensitive filters and multilingual texts?

Posted: Sun Jun 10, 2012 1:59 pm
by DataMystic Support
Hi David,

In short yes, the question is how to do it.

The earlier discussion we had on each filter knowing its own input and output encoding, and then being able to suggest the appropriate intermediate filter to match the encodings would seem a good way to solve this. You could choose to override this, or perhaps only apply it when required, like the 'Remove prompting' option.

Can you give me two filters that would benefit from this approach?

Case changing filters might be a good option, but in order to prevent an explosion of language-specific filters within TextPipe, we would most likely convert incoming text to Unicode, apply the Case Conversion change using the underlying Windows API, and then convert the text back to its original encoding. This might introduce round-trip issues.

Can you give me a real-world example of how you see this working?

Re: Locale-sensitive filters and multilingual texts?

Posted: Mon Jun 11, 2012 1:54 am
by dfhtextpipe
Hi Simon,

You may safely assume that all the input files are already Unicode.
Generally I work on text files that are encoded as UTF-8 (without BOM).
I rarely need to work with files encoded using Windows Code Pages (or anything else for that matter).
And there are already filters to convert these to Unicode.

The TextPipe filter features therefore are those that use either
  • (a) Case change operations or case sensitive selections
    (b) Sorting in the defined alphabetical (or symbol) order for each specific language
You should assume that (b) also includes Count Duplicates, as this has an implicit sort for its outputs.

Sorting even European languages has complications when the alphabet contains letters with diacritics and/or ligatures or digraphs.
See http://en.wikipedia.org/wiki/Alphabetic ... onventions

There may some other aspects that I haven't fully thought out, such as languages that do not use spaces as word boundaries.

Best regards,
David

Re: Locale-sensitive filters and multilingual texts?

Posted: Mon Jun 11, 2012 2:13 am
by dfhtextpipe
Further information:

In many Latin scripted languages, accented letters are not counted as part of the alphabet, but in some they are!
Welsh is an example of the former. See http://en.wikipedia.org/wiki/Welsh_orthography
Azerbaijani is an example of the latter. See http://en.wikipedia.org/wiki/Azerbaijani_alphabet

David

Re: Locale-sensitive filters and multilingual texts?

Posted: Mon Jun 11, 2012 2:14 am
by dfhtextpipe

Re: Locale-sensitive filters and multilingual texts?

Posted: Mon Jun 11, 2012 2:15 am
by dfhtextpipe
And most important of all, http://en.wikipedia.org/wiki/Unicode_co ... _algorithm

David

PS. I had to split my reply merely because of the number of URLs.

Re: Locale-sensitive filters and multilingual texts?

Posted: Mon Jun 11, 2012 2:23 am
by dfhtextpipe
You referred to "to prevent an explosion of language-specific filters within TextPipe".

I see these as avoidable, providing each of the existing relevant filters gain a drop-down selector to specify the language of the input Unicode text files.

However, one cannot simply apply what Windows does, unless one knows which language the input file is.
For example, suppose you are filtering something written in Turkish or Azerbaijani, then your filter needs to know about dotted and dotless letter I. See
http://en.wikipedia.org/wiki/Dotted_and_dotless_I

David

Re: Locale-sensitive filters and multilingual texts?

Posted: Mon Jun 11, 2012 2:30 am
by dfhtextpipe
Example of such a task:

For any given Bible translation, with the digital text available, generate a count duplicates style word list.
This involves implicit sorting (not to mention how to deal with various punctuations that can exist within words).
The sorting should be in the required collation order for the language of the translation.

David