Please add Unicode support to the Text to Word List filter

dfhtextpipe · Post by **dfhtextpipe** » Sat May 12, 2012 7:45 pm

Please consider adding Unicode support to the Text to Word List filter.

Ideally, this should be feasible for UTF-8 encoded input files. How difficult would this be?

Notes:
Special consideration would be needed for the Narrow No-Break Space (U+202F).
This was introduced in Unicode 3.0 for Mongolian, to separate a suffix from the word stem without indicating a word boundary.

Special provision would be required for languages that do not use spaces (etc) to separate words.
See http://en.wikipedia.org/wiki/Category:W ... boundaries
I'd suggest imposing a maximum word length limit in the options dialogue. This could be used to trigger an error message, or whatever.

The advice "Normally you would follow this filter with a Sort and Remove Duplicates filter." would not be appropriate for non-ANSI word lists.

David

Post by **DataMystic Support** » Mon May 14, 2012 1:04 pm

Hi David,

Yes, this could be done - but it would be designed to work with UTF16 - a UTF-8 conversion filter at the start and end would solve the UTF-8 issue.

Are we just talking about the Unicode isLetter property, with an exception for Narrow No-Break Space?

dfhtextpipe · Post by **dfhtextpipe** » Tue May 15, 2012 12:38 am

There ought to be some other notable exceptions:

For example, the Word Join U+2060. This too should count as a word character rather than a word boundary.

See http://en.wikipedia.org/wiki/Space_%28punctuation%29 for further background.

cf. I'm currently making use of the WJ within a transliteration filter, to ensure that the converse filter gives 100% accuracy for a round trip.

David

dfhtextpipe · Post by **dfhtextpipe** » Tue May 15, 2012 12:53 am

You could even have tick box options for such additional filter inclusions, just like you have elsewhere for Perl pattern matching options.

☐ Include No-Break Space (U+00A0)
☐ Include Figure Space (U+2007)
☒ Include Narrow No-Break Space (U+202F)
☒ Include Word Joiner (U+2060)
☐ Include Zero Width No-Break Space (U+FEFF) deprecated

Such a UI feature would provide user choice - and thus provide added value to your customers.

Programming these tick boxes in from when you first add Unicode support would also mean that it would be easier to extend the feature if and when any similar addition is requested.

David

PS. I'm happy with it being implemented for UTF-16 and understand why this is simpler to code.

dfhtextpipe · Post by **dfhtextpipe** » Tue May 15, 2012 1:05 am

Simon,

There are some languages in which the apostrophe character is classed as a letter of the alphabet. e.g. As a glottal stop.
See http://en.wikipedia.org/wiki/Apostrophe ... ottal_stop

Read further to the subsequent subsections. ...
In Turkish, proper nouns are capitalized and an apostrophe is inserted between the noun and any following suffix,
e.g. İstanbul'da ("in Istanbul"), contrasting with okulda ("in school").

Extending my suggestion for tick boxes, we could therefore borrow a UI technique from Excel's Text to Columns Wizard, by having an "Other" tick box at the bottom.

See attached image. How about that for lateral thinking?

David

Post by **DataMystic Support** » Tue May 15, 2012 2:15 pm

Whew David - I had no idea it would be so complicated!

A simple filter with one numeric parameter is certainly far less time-consuming to add than one with its own form (saving/loading/registry), COM API changes, documentation as well as the filter itself!

Anyway, what is the better approach? - defining the delimiters (and allowing for optional extras), or defining what is a letter - which would avoid numbers with numbers in them.

dfhtextpipe · Post by **dfhtextpipe** » Tue May 15, 2012 6:47 pm

Hi Simon,

It is indeed rather complicated, and the more I think about it, the more options I can come up with.

e.g. How does the existing filter treat soft hyphens?
The help states, "Hyphenated words are recognised as single words, provided that they aren't broken across lines."
Yet I'd assume that the reference here is to plain hyphens only.
I can easily test what happens, but this is advance notice that we'd probably want to include a tick box for including soft hyphens.

Likewise, in text files transliterated from a non-Roman script for a language that uses a syllabary, we might wish to include the middle dot as a syllable separator.
These could be useful to include such that a reverse converter from the Latin script transliteration back to the non-Roman script is 100% accurate.

Even for some Latin script languages (e.g. Catalan) this code point is used to separate syllables.
See http://en.wikipedia.org/wiki/Interpunct

David

dfhtextpipe · Post by **dfhtextpipe** » Tue May 15, 2012 8:09 pm

PS. Just to make you smile....

Here's a nice single word containing middle dots:

Code: Select all

Llan·fair·pwll·gwyn·gyll·go·ger·y·chwyrn·drob·wll·llantys·ilio·gogo·goch

For details see http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll

Without using the IPA (and neither resorting to Anglicizing), one can thus separate the syllables to make this Welsh place name easier to pronounce.
Hyphens would be too strong here, and the actual place name is not hyphenated.

David

dfhtextpipe · Post by **dfhtextpipe** » Tue May 15, 2012 8:12 pm

Simon,

which would avoid numbers with numbers in them.

I think you'd meant to write:

which would avoid words with numbers in them.

David

dfhtextpipe · Post by **dfhtextpipe** » Tue May 15, 2012 9:05 pm

Hi Simon.

My existing Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (aka ANSI).

There's no clue that characters U+00A0 to U+00FF are unsupported by the Count Duplicate Lines filter,
which follows the Text to Word List subfilter in my two stage filter.

The contrast with the Sort filter is brought to your attention:

Sort Type

The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
...

So before extending the Text to Word List filter to cope with Unicode in general,
please could you first extend the Count Duplicate Lines filter to support ANSI.

Meanwhile, I'll tweak my two stage filter to investigate further.

David

dfhtextpipe · Post by **dfhtextpipe** » Thu May 17, 2012 5:53 pm

If you do decide to implement the Include Other tickbox, please ensure that the form field for allowing the user to enter the "other" is wide enough to permit a fairly long character class.

Also, watch out for a user defined "other" that includes a space, and issue a suitable warning.
It makes no sense for the words in a word list to include a space, as that would defeat the purpose of this filter in most cases.

Remember, we are not providing a means to provide user-specified delimiters, but rather a means to provide non-delimiters.

David

dfhtextpipe · Post by **dfhtextpipe** » Mon May 28, 2012 8:15 pm

Currently, the Text to Word List filter only makes special provision for the hyphen-minus character.

It would be sensible (even for English text sources) to make optional provision for the apostrophe.

As it stands, words ending with apostrophe s are split into the main word and s. Yet s is not a word in its own right.

A complete provision that would extend to non-English text sources would need to address the differing and various ways in which the apostrophe is used in other languages.

See http://en.wikipedia.org/wiki/Apostrophe

dfhtextpipe · Post by **dfhtextpipe** » Mon May 28, 2012 10:42 pm

My ad hoc workaround for apostrophe s is to temporarily replace it by AAAs, then revert after Count Duplicate Lines.

David

dfhtextpipe · Post by **dfhtextpipe** » Wed May 30, 2012 6:39 am

Likewise, there's a need to cope with using the apostrophe (or the single right quotation mark) in words such as these:

aren’t can’t couldn’t didn’t doesn’t don’t hadn’t hasn’t haven’t isn’t shouldn’t wasn’t weren’t won’t wouldn’t

and

I’m

.

David

Post by **DataMystic Support** » Wed May 30, 2012 12:54 pm

Whew! These won't make it into the next release, because it really needs a file format overhaul that we have not had time to do for 12 months or more.
It is certainly looming closer than ever now.

DataMystic

Please add Unicode support to the Text to Word List filter

Please add Unicode support to the Text to Word List filter

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt

Re: Please add Unicode support to the Text to Word List filt