Page 1 of 2

Please add Unicode support to the Text to Word List filter

Posted: Sat May 12, 2012 7:45 pm
by dfhtextpipe
Please consider adding Unicode support to the Text to Word List filter.

Ideally, this should be feasible for UTF-8 encoded input files. How difficult would this be?

Notes:
Special consideration would be needed for the Narrow No-Break Space (U+202F).
This was introduced in Unicode 3.0 for Mongolian, to separate a suffix from the word stem without indicating a word boundary.

Special provision would be required for languages that do not use spaces (etc) to separate words.
See http://en.wikipedia.org/wiki/Category:W ... boundaries
I'd suggest imposing a maximum word length limit in the options dialogue. This could be used to trigger an error message, or whatever.

The advice "Normally you would follow this filter with a Sort and Remove Duplicates filter." would not be appropriate for non-ANSI word lists.


David

Re: Please add Unicode support to the Text to Word List filt

Posted: Mon May 14, 2012 1:04 pm
by DataMystic Support
Hi David,

Yes, this could be done - but it would be designed to work with UTF16 - a UTF-8 conversion filter at the start and end would solve the UTF-8 issue.

Are we just talking about the Unicode isLetter property, with an exception for Narrow No-Break Space?

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 12:38 am
by dfhtextpipe
There ought to be some other notable exceptions:

For example, the Word Join U+2060. This too should count as a word character rather than a word boundary.

See http://en.wikipedia.org/wiki/Space_%28punctuation%29 for further background.

cf. I'm currently making use of the WJ within a transliteration filter, to ensure that the converse filter gives 100% accuracy for a round trip.

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 12:53 am
by dfhtextpipe
You could even have tick box options for such additional filter inclusions, just like you have elsewhere for Perl pattern matching options.

☐ Include No-Break Space (U+00A0)
☐ Include Figure Space (U+2007)
☒ Include Narrow No-Break Space (U+202F)
☒ Include Word Joiner (U+2060)
☐ Include Zero Width No-Break Space (U+FEFF) deprecated

Such a UI feature would provide user choice - and thus provide added value to your customers.

Programming these tick boxes in from when you first add Unicode support would also mean that it would be easier to extend the feature if and when any similar addition is requested.

David

PS. I'm happy with it being implemented for UTF-16 and understand why this is simpler to code.

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 1:05 am
by dfhtextpipe
Simon,

There are some languages in which the apostrophe character is classed as a letter of the alphabet. e.g. As a glottal stop.
See http://en.wikipedia.org/wiki/Apostrophe ... ottal_stop

Read further to the subsequent subsections. ...
In Turkish, proper nouns are capitalized and an apostrophe is inserted between the noun and any following suffix,
e.g. İstanbul'da ("in Istanbul"), contrasting with okulda ("in school").


Extending my suggestion for tick boxes, we could therefore borrow a UI technique from Excel's Text to Columns Wizard, by having an "Other" tick box at the bottom.

See attached image. How about that for lateral thinking?

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 2:15 pm
by DataMystic Support
Whew David - I had no idea it would be so complicated!

A simple filter with one numeric parameter is certainly far less time-consuming to add than one with its own form (saving/loading/registry), COM API changes, documentation as well as the filter itself!

Anyway, what is the better approach? - defining the delimiters (and allowing for optional extras), or defining what is a letter - which would avoid numbers with numbers in them.

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 6:47 pm
by dfhtextpipe
Hi Simon,

It is indeed rather complicated, and the more I think about it, the more options I can come up with.

e.g. How does the existing filter treat soft hyphens?
The help states, "Hyphenated words are recognised as single words, provided that they aren't broken across lines."
Yet I'd assume that the reference here is to plain hyphens only.
I can easily test what happens, but this is advance notice that we'd probably want to include a tick box for including soft hyphens.

Likewise, in text files transliterated from a non-Roman script for a language that uses a syllabary, we might wish to include the middle dot as a syllable separator.
These could be useful to include such that a reverse converter from the Latin script transliteration back to the non-Roman script is 100% accurate.

Even for some Latin script languages (e.g. Catalan) this code point is used to separate syllables.
See http://en.wikipedia.org/wiki/Interpunct

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 8:09 pm
by dfhtextpipe
PS. Just to make you smile....

Here's a nice single word containing middle dots:

Code: Select all

Llan·fair·pwll·gwyn·gyll·go·ger·y·chwyrn·drob·wll·llantys·ilio·gogo·goch
For details see http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll

Without using the IPA (and neither resorting to Anglicizing), one can thus separate the syllables to make this Welsh place name easier to pronounce.
Hyphens would be too strong here, and the actual place name is not hyphenated.

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 8:12 pm
by dfhtextpipe
Simon,
which would avoid numbers with numbers in them.
I think you'd meant to write:
which would avoid words with numbers in them.
David

Re: Please add Unicode support to the Text to Word List filt

Posted: Tue May 15, 2012 9:05 pm
by dfhtextpipe
Hi Simon.

My existing Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (aka ANSI).

There's no clue that characters U+00A0 to U+00FF are unsupported by the Count Duplicate Lines filter,
which follows the Text to Word List subfilter in my two stage filter.

The contrast with the Sort filter is brought to your attention:
Sort Type

The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
...
So before extending the Text to Word List filter to cope with Unicode in general,
please could you first extend the Count Duplicate Lines filter to support ANSI.

Meanwhile, I'll tweak my two stage filter to investigate further.

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Thu May 17, 2012 5:53 pm
by dfhtextpipe
If you do decide to implement the Include Other tickbox, please ensure that the form field for allowing the user to enter the "other" is wide enough to permit a fairly long character class.

Also, watch out for a user defined "other" that includes a space, and issue a suitable warning.
It makes no sense for the words in a word list to include a space, as that would defeat the purpose of this filter in most cases.

Remember, we are not providing a means to provide user-specified delimiters, but rather a means to provide non-delimiters.

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Mon May 28, 2012 8:15 pm
by dfhtextpipe
Currently, the Text to Word List filter only makes special provision for the hyphen-minus character.

It would be sensible (even for English text sources) to make optional provision for the apostrophe.

As it stands, words ending with apostrophe s are split into the main word and s. Yet s is not a word in its own right.

A complete provision that would extend to non-English text sources would need to address the differing and various ways in which the apostrophe is used in other languages.

See http://en.wikipedia.org/wiki/Apostrophe

Re: Please add Unicode support to the Text to Word List filt

Posted: Mon May 28, 2012 10:42 pm
by dfhtextpipe
My ad hoc workaround for apostrophe s is to temporarily replace it by AAAs, then revert after Count Duplicate Lines.

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Wed May 30, 2012 6:39 am
by dfhtextpipe
Likewise, there's a need to cope with using the apostrophe (or the single right quotation mark) in words such as these:
  • aren’t can’t couldn’t didn’t doesn’t don’t hadn’t hasn’t haven’t isn’t shouldn’t wasn’t weren’t won’t wouldn’t
and
  • I’m
.

David

Re: Please add Unicode support to the Text to Word List filt

Posted: Wed May 30, 2012 12:54 pm
by DataMystic Support
Whew! These won't make it into the next release, because it really needs a file format overhaul that we have not had time to do for 12 months or more.
It is certainly looming closer than ever now.