Page 1 of 1

Character cAsE filters and the Turkish alphabet

Posted: Tue Mar 03, 2020 10:12 pm
by dfhtextpipe
The help pages for the various Character cAsE filters states the following:
This filter expects UTF-8 data and will handle foreign character sets.
This is not quite true, in that there are exceptions in some bicameral alphabets such as Turkish and Northern Azeri.
Both these alphabets include the following two letters:

Code: Select all

U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE : i dot
U+0131 LATIN SMALL LETTER DOTLESS I
So for example pasting the following into the Trial Run area:

Code: Select all

İı
running the tOGGLE cASE filter makes no change.
On the other hand, it does change most accented Latin letters, e.g.

Code: Select all

Š
to

Code: Select all

š
Perhaps the sentence in the Help pages should be qualified.
This filter expects UTF-8 data and will handle some foreign character sets.
Not sure how you might implement the proper case rules for the Turkish alphabet, etc.
These filters would first need to have the writing system context specified by the user.

Furthermore, I would guess that you'd not given any consideration to extending these Character cAsE filters to cover the Cherokee supplement block of small letters that were defined by Unicode 8.0 (June 2015).


Best regards,
David

Re: Character cAsE filters and the Turkish alphabet

Posted: Wed Sep 23, 2020 10:48 pm
by DataMystic Support
Thanks David - we have made the clarification change to the help file.

Can you please provide sample text for the Cherokee supplement block?