Page 1 of 2
Please add Unicode support to the Text to Word List filter
Posted: Sat May 12, 2012 7:45 pm
by dfhtextpipe
Please consider adding Unicode support to the
Text to Word List filter.
Ideally, this should be feasible for UTF-8 encoded input files. How difficult would this be?
Notes:
Special consideration would be needed for the
Narrow No-Break Space (U+202F).
This was introduced in Unicode 3.0 for Mongolian, to separate a suffix from the word stem
without indicating a word boundary.
Special provision would be required for languages that do not use spaces (etc) to separate words.
See
http://en.wikipedia.org/wiki/Category:W ... boundaries
I'd suggest imposing a maximum word length limit in the options dialogue. This could be used to trigger an error message, or whatever.
The advice "Normally you would follow this filter with a Sort and Remove Duplicates filter." would not be appropriate for non-ANSI word lists.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Mon May 14, 2012 1:04 pm
by DataMystic Support
Hi David,
Yes, this could be done - but it would be designed to work with UTF16 - a UTF-8 conversion filter at the start and end would solve the UTF-8 issue.
Are we just talking about the Unicode isLetter property, with an exception for Narrow No-Break Space?
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 12:38 am
by dfhtextpipe
There ought to be some other notable exceptions:
For example, the
Word Join U+2060. This too should count as a word character rather than a word boundary.
See
http://en.wikipedia.org/wiki/Space_%28punctuation%29 for further background.
cf. I'm currently making use of the WJ within a transliteration filter, to ensure that the converse filter gives 100% accuracy for a round trip.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 12:53 am
by dfhtextpipe
You could even have tick box options for such additional filter inclusions, just like you have elsewhere for Perl pattern matching options.
☐ Include No-Break Space (U+00A0)
☐ Include Figure Space (U+2007)
☒ Include Narrow No-Break Space (U+202F)
☒ Include Word Joiner (U+2060)
☐ Include Zero Width No-Break Space (U+FEFF) deprecated
Such a UI feature would provide user choice - and thus provide added value to your customers.
Programming these tick boxes in from when you first add Unicode support would also mean that it would be easier to extend the feature if and when any similar addition is requested.
David
PS. I'm happy with it being implemented for UTF-16 and understand why this is simpler to code.
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 1:05 am
by dfhtextpipe
Simon,
There are some languages in which the
apostrophe character is classed as a letter of the alphabet. e.g. As a glottal stop.
See
http://en.wikipedia.org/wiki/Apostrophe ... ottal_stop
Read further to the subsequent subsections. ...
In Turkish, proper nouns are capitalized and an apostrophe is inserted between the noun and any following suffix,
e.g. İstanbul'da ("in Istanbul"), contrasting with okulda ("in school").
Extending my suggestion for tick boxes, we could therefore borrow a UI technique from Excel's
Text to Columns Wizard, by having an "Other" tick box at the bottom.
See attached image. How about that for lateral thinking?
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 2:15 pm
by DataMystic Support
Whew David - I had no idea it would be so complicated!
A simple filter with one numeric parameter is certainly far less time-consuming to add than one with its own form (saving/loading/registry), COM API changes, documentation as well as the filter itself!
Anyway, what is the better approach? - defining the delimiters (and allowing for optional extras), or defining what is a letter - which would avoid numbers with numbers in them.
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 6:47 pm
by dfhtextpipe
Hi Simon,
It is indeed rather complicated, and the more I think about it, the more options I can come up with.
e.g. How does the existing filter treat
soft hyphens?
The help states, "Hyphenated words are recognised as single words, provided that they aren't broken across lines."
Yet I'd assume that the reference here is to
plain hyphens only.
I can easily test what happens, but this is advance notice that we'd probably want to include a tick box for including soft hyphens.
Likewise, in text files transliterated from a non-Roman script for a language that uses a syllabary, we might wish to include the
middle dot as a syllable separator.
These could be useful to include such that a reverse converter from the Latin script transliteration back to the non-Roman script is 100% accurate.
Even for some Latin script languages (e.g. Catalan) this code point is used to separate syllables.
See
http://en.wikipedia.org/wiki/Interpunct
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 8:09 pm
by dfhtextpipe
PS. Just to make you smile....
Here's a nice single word containing
middle dots:
Code: Select all
Llan·fair·pwll·gwyn·gyll·go·ger·y·chwyrn·drob·wll·llantys·ilio·gogo·goch
For details see
http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll
Without using the IPA (and neither resorting to Anglicizing), one can thus separate the syllables to make this Welsh place name easier to pronounce.
Hyphens would be too strong here, and the actual place name is not hyphenated.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 8:12 pm
by dfhtextpipe
Simon,
which would avoid numbers with numbers in them.
I think you'd meant to write:
which would avoid words with numbers in them.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Tue May 15, 2012 9:05 pm
by dfhtextpipe
Hi Simon.
My existing
Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (
aka ANSI).
There's no clue that characters U+00A0 to U+00FF are unsupported by the
Count Duplicate Lines filter,
which follows the
Text to Word List subfilter in my two stage filter.
The contrast with the
Sort filter is brought to your attention:
Sort Type
The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
...
So before extending the
Text to Word List filter to cope with Unicode in general,
please could you first extend the
Count Duplicate Lines filter to support ANSI.
Meanwhile, I'll tweak my two stage filter to investigate further.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Thu May 17, 2012 5:53 pm
by dfhtextpipe
If you do decide to implement the Include Other tickbox, please ensure that the form field for allowing the user to enter the "other" is wide enough to permit a fairly long character class.
Also, watch out for a user defined "other" that includes a space, and issue a suitable warning.
It makes no sense for the words in a word list to include a space, as that would defeat the purpose of this filter in most cases.
Remember, we are not providing a means to provide user-specified delimiters, but rather a means to provide non-delimiters.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Mon May 28, 2012 8:15 pm
by dfhtextpipe
Currently, the Text to Word List filter only makes special provision for the
hyphen-minus character.
It would be sensible (even for English text sources) to make
optional provision for the
apostrophe.
As it stands, words ending with
apostrophe s are split into the main word and
s. Yet
s is not a word in its own right.
A complete provision that would extend to non-English text sources would need to address the differing and various ways in which the apostrophe is used in other languages.
See
http://en.wikipedia.org/wiki/Apostrophe
Re: Please add Unicode support to the Text to Word List filt
Posted: Mon May 28, 2012 10:42 pm
by dfhtextpipe
My ad hoc workaround for apostrophe s is to temporarily replace it by AAAs, then revert after Count Duplicate Lines.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Wed May 30, 2012 6:39 am
by dfhtextpipe
Likewise, there's a need to cope with using the apostrophe (or the single right quotation mark) in words such as these:
- aren’t can’t couldn’t didn’t doesn’t don’t hadn’t hasn’t haven’t isn’t shouldn’t wasn’t weren’t won’t wouldn’t
and
.
David
Re: Please add Unicode support to the Text to Word List filt
Posted: Wed May 30, 2012 12:54 pm
by DataMystic Support
Whew! These won't make it into the next release, because it really needs a file format overhaul that we have not had time to do for 12 months or more.
It is certainly looming closer than ever now.