Please add Unicode support to the Text to Word List filter
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Please add Unicode support to the Text to Word List filter
Please consider adding Unicode support to the Text to Word List filter.
Ideally, this should be feasible for UTF-8 encoded input files. How difficult would this be?
Notes:
Special consideration would be needed for the Narrow No-Break Space (U+202F).
This was introduced in Unicode 3.0 for Mongolian, to separate a suffix from the word stem without indicating a word boundary.
Special provision would be required for languages that do not use spaces (etc) to separate words.
See http://en.wikipedia.org/wiki/Category:W ... boundaries
I'd suggest imposing a maximum word length limit in the options dialogue. This could be used to trigger an error message, or whatever.
The advice "Normally you would follow this filter with a Sort and Remove Duplicates filter." would not be appropriate for non-ANSI word lists.
David
Ideally, this should be feasible for UTF-8 encoded input files. How difficult would this be?
Notes:
Special consideration would be needed for the Narrow No-Break Space (U+202F).
This was introduced in Unicode 3.0 for Mongolian, to separate a suffix from the word stem without indicating a word boundary.
Special provision would be required for languages that do not use spaces (etc) to separate words.
See http://en.wikipedia.org/wiki/Category:W ... boundaries
I'd suggest imposing a maximum word length limit in the options dialogue. This could be used to trigger an error message, or whatever.
The advice "Normally you would follow this filter with a Sort and Remove Duplicates filter." would not be appropriate for non-ANSI word lists.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Please add Unicode support to the Text to Word List filt
Hi David,
Yes, this could be done - but it would be designed to work with UTF16 - a UTF-8 conversion filter at the start and end would solve the UTF-8 issue.
Are we just talking about the Unicode isLetter property, with an exception for Narrow No-Break Space?
Yes, this could be done - but it would be designed to work with UTF16 - a UTF-8 conversion filter at the start and end would solve the UTF-8 issue.
Are we just talking about the Unicode isLetter property, with an exception for Narrow No-Break Space?
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
There ought to be some other notable exceptions:
For example, the Word Join U+2060. This too should count as a word character rather than a word boundary.
See http://en.wikipedia.org/wiki/Space_%28punctuation%29 for further background.
cf. I'm currently making use of the WJ within a transliteration filter, to ensure that the converse filter gives 100% accuracy for a round trip.
David
For example, the Word Join U+2060. This too should count as a word character rather than a word boundary.
See http://en.wikipedia.org/wiki/Space_%28punctuation%29 for further background.
cf. I'm currently making use of the WJ within a transliteration filter, to ensure that the converse filter gives 100% accuracy for a round trip.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
You could even have tick box options for such additional filter inclusions, just like you have elsewhere for Perl pattern matching options.
☐ Include No-Break Space (U+00A0)
☐ Include Figure Space (U+2007)
☒ Include Narrow No-Break Space (U+202F)
☒ Include Word Joiner (U+2060)
☐ Include Zero Width No-Break Space (U+FEFF) deprecated
Such a UI feature would provide user choice - and thus provide added value to your customers.
Programming these tick boxes in from when you first add Unicode support would also mean that it would be easier to extend the feature if and when any similar addition is requested.
David
PS. I'm happy with it being implemented for UTF-16 and understand why this is simpler to code.
☐ Include No-Break Space (U+00A0)
☐ Include Figure Space (U+2007)
☒ Include Narrow No-Break Space (U+202F)
☒ Include Word Joiner (U+2060)
☐ Include Zero Width No-Break Space (U+FEFF) deprecated
Such a UI feature would provide user choice - and thus provide added value to your customers.
Programming these tick boxes in from when you first add Unicode support would also mean that it would be easier to extend the feature if and when any similar addition is requested.
David
PS. I'm happy with it being implemented for UTF-16 and understand why this is simpler to code.
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
Simon,
There are some languages in which the apostrophe character is classed as a letter of the alphabet. e.g. As a glottal stop.
See http://en.wikipedia.org/wiki/Apostrophe ... ottal_stop
Read further to the subsequent subsections. ...
In Turkish, proper nouns are capitalized and an apostrophe is inserted between the noun and any following suffix,
e.g. İstanbul'da ("in Istanbul"), contrasting with okulda ("in school").
Extending my suggestion for tick boxes, we could therefore borrow a UI technique from Excel's Text to Columns Wizard, by having an "Other" tick box at the bottom.
See attached image. How about that for lateral thinking?
David
There are some languages in which the apostrophe character is classed as a letter of the alphabet. e.g. As a glottal stop.
See http://en.wikipedia.org/wiki/Apostrophe ... ottal_stop
Read further to the subsequent subsections. ...
In Turkish, proper nouns are capitalized and an apostrophe is inserted between the noun and any following suffix,
e.g. İstanbul'da ("in Istanbul"), contrasting with okulda ("in school").
Extending my suggestion for tick boxes, we could therefore borrow a UI technique from Excel's Text to Columns Wizard, by having an "Other" tick box at the bottom.
See attached image. How about that for lateral thinking?
David
- Attachments
-
- Excel Text to Columns Wizard (provided for illustration, under fair use conditions)
- Excel_Dialog_Text_to_Columns_Wizard.png (31.39 KiB) Viewed 16047 times
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Please add Unicode support to the Text to Word List filt
Whew David - I had no idea it would be so complicated!
A simple filter with one numeric parameter is certainly far less time-consuming to add than one with its own form (saving/loading/registry), COM API changes, documentation as well as the filter itself!
Anyway, what is the better approach? - defining the delimiters (and allowing for optional extras), or defining what is a letter - which would avoid numbers with numbers in them.
A simple filter with one numeric parameter is certainly far less time-consuming to add than one with its own form (saving/loading/registry), COM API changes, documentation as well as the filter itself!
Anyway, what is the better approach? - defining the delimiters (and allowing for optional extras), or defining what is a letter - which would avoid numbers with numbers in them.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
Hi Simon,
It is indeed rather complicated, and the more I think about it, the more options I can come up with.
e.g. How does the existing filter treat soft hyphens?
The help states, "Hyphenated words are recognised as single words, provided that they aren't broken across lines."
Yet I'd assume that the reference here is to plain hyphens only.
I can easily test what happens, but this is advance notice that we'd probably want to include a tick box for including soft hyphens.
Likewise, in text files transliterated from a non-Roman script for a language that uses a syllabary, we might wish to include the middle dot as a syllable separator.
These could be useful to include such that a reverse converter from the Latin script transliteration back to the non-Roman script is 100% accurate.
Even for some Latin script languages (e.g. Catalan) this code point is used to separate syllables.
See http://en.wikipedia.org/wiki/Interpunct
David
It is indeed rather complicated, and the more I think about it, the more options I can come up with.
e.g. How does the existing filter treat soft hyphens?
The help states, "Hyphenated words are recognised as single words, provided that they aren't broken across lines."
Yet I'd assume that the reference here is to plain hyphens only.
I can easily test what happens, but this is advance notice that we'd probably want to include a tick box for including soft hyphens.
Likewise, in text files transliterated from a non-Roman script for a language that uses a syllabary, we might wish to include the middle dot as a syllable separator.
These could be useful to include such that a reverse converter from the Latin script transliteration back to the non-Roman script is 100% accurate.
Even for some Latin script languages (e.g. Catalan) this code point is used to separate syllables.
See http://en.wikipedia.org/wiki/Interpunct
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
PS. Just to make you smile....
Here's a nice single word containing middle dots:
For details see http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll
Without using the IPA (and neither resorting to Anglicizing), one can thus separate the syllables to make this Welsh place name easier to pronounce.
Hyphens would be too strong here, and the actual place name is not hyphenated.
David
Here's a nice single word containing middle dots:
Code: Select all
Llan·fair·pwll·gwyn·gyll·go·ger·y·chwyrn·drob·wll·llantys·ilio·gogo·goch
Without using the IPA (and neither resorting to Anglicizing), one can thus separate the syllables to make this Welsh place name easier to pronounce.
Hyphens would be too strong here, and the actual place name is not hyphenated.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
Simon,
I think you'd meant to write:which would avoid numbers with numbers in them.
Davidwhich would avoid words with numbers in them.
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
Hi Simon.
My existing Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (aka ANSI).
There's no clue that characters U+00A0 to U+00FF are unsupported by the Count Duplicate Lines filter,
which follows the Text to Word List subfilter in my two stage filter.
The contrast with the Sort filter is brought to your attention:
please could you first extend the Count Duplicate Lines filter to support ANSI.
Meanwhile, I'll tweak my two stage filter to investigate further.
David
My existing Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (aka ANSI).
There's no clue that characters U+00A0 to U+00FF are unsupported by the Count Duplicate Lines filter,
which follows the Text to Word List subfilter in my two stage filter.
The contrast with the Sort filter is brought to your attention:
So before extending the Text to Word List filter to cope with Unicode in general,Sort Type
The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
...
please could you first extend the Count Duplicate Lines filter to support ANSI.
Meanwhile, I'll tweak my two stage filter to investigate further.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
If you do decide to implement the Include Other tickbox, please ensure that the form field for allowing the user to enter the "other" is wide enough to permit a fairly long character class.
Also, watch out for a user defined "other" that includes a space, and issue a suitable warning.
It makes no sense for the words in a word list to include a space, as that would defeat the purpose of this filter in most cases.
Remember, we are not providing a means to provide user-specified delimiters, but rather a means to provide non-delimiters.
David
Also, watch out for a user defined "other" that includes a space, and issue a suitable warning.
It makes no sense for the words in a word list to include a space, as that would defeat the purpose of this filter in most cases.
Remember, we are not providing a means to provide user-specified delimiters, but rather a means to provide non-delimiters.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
Currently, the Text to Word List filter only makes special provision for the hyphen-minus character.
It would be sensible (even for English text sources) to make optional provision for the apostrophe.
As it stands, words ending with apostrophe s are split into the main word and s. Yet s is not a word in its own right.
A complete provision that would extend to non-English text sources would need to address the differing and various ways in which the apostrophe is used in other languages.
See http://en.wikipedia.org/wiki/Apostrophe
It would be sensible (even for English text sources) to make optional provision for the apostrophe.
As it stands, words ending with apostrophe s are split into the main word and s. Yet s is not a word in its own right.
A complete provision that would extend to non-English text sources would need to address the differing and various ways in which the apostrophe is used in other languages.
See http://en.wikipedia.org/wiki/Apostrophe
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
My ad hoc workaround for apostrophe s is to temporarily replace it by AAAs, then revert after Count Duplicate Lines.
David
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Please add Unicode support to the Text to Word List filt
Likewise, there's a need to cope with using the apostrophe (or the single right quotation mark) in words such as these:
David
- aren’t can’t couldn’t didn’t doesn’t don’t hadn’t hasn’t haven’t isn’t shouldn’t wasn’t weren’t won’t wouldn’t
- I’m
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Please add Unicode support to the Text to Word List filt
Whew! These won't make it into the next release, because it really needs a file format overhaul that we have not had time to do for 12 months or more.
It is certainly looming closer than ever now.
It is certainly looming closer than ever now.