Bug in Find whole words only option for replace list filter

dfhtextpipe · Post by **dfhtextpipe** » Fri May 04, 2018 6:35 am

This is in the context of using the Replace list filter with Pattern (perl) as find type.

My tab-delimited external Replace list contains this as one of the many lines:

Code: Select all

\xF3	o

It's designed to replace the accented letter "ó" by the unaccented letter "o" and is set to apply with
Match case, Find whole words only, UTF-8 support.

I just found that instead of replacing only the 2 single letter words that were intended,
it also replaced the "ó" at the end of 55 words that ended with the letters "ñó".
viz.

Code: Select all

enseñó soñó riñó engañó ciñó constriñó apañó dañó

The letter U+00F1 LATIN SMALL LETTER N WITH TILDE "ñ" seems to be seen in this context as if it were a non-word character!
How else can one interpret this very unexpected result?

This is a surely a software bug!

Aside: The input file contains Spanish text. My Windows locale is English (UK).

Best regards,

David

Post by **DataMystic Support** » Wed May 09, 2018 8:18 am

Hi David,

The same issue occurs for perl pattern on its own, outside of the search/replace list.

Currently, TextPipe determines which characters are word characters on startup, and retains this throughout. It does this for (ANSI) characters 0..255, and hence does not appreciate a utf-8 view of the world. Hence it fails for the character below - windows must be telling TextPipe that it is not a word character.

The best approach I can see right now is that instead of relying on this Word Characters array to be checked at the start and end of each potential match, is instead to prefix/append the \b regex to each pattern, and allow the regex engine to use its internal unicode tables for what is and is not a word character.

If you could disable 'Whole Word' and instead add \b around your pattern, it would be interesting to see if it gives correct output for other use cases. It works fine for the case you've given here.

Simon

dfhtextpipe · Post by **dfhtextpipe** » Mon May 14, 2018 1:22 am

Hi SImon,

Then even the Windows view of which ANSI characters are word characters is faulty, seeing as ñ (U+00F1 aka \xF1) is within the decimal range 0-255 and it's not a punctuation mark!

I'll try with the \b suggestion when I get time - having read what this does; I've not made use of this before.

Code: Select all

  \b     matches at a word boundary
  \B     matches when not at a word boundary

Though this might work for the corner case reported, it seems a tedious fag to have to wrap each search word in the external replace list.
It would be simpler to remove the corner case from the list and deal with it using a separate replace list filter.

David

dfhtextpipe · Post by **dfhtextpipe** » Tue Mar 03, 2020 12:37 am

This bug has been fixed in TextPipe 11.4 or earlier.

Tried it using the Trial Run area.

David

DataMystic

Bug in Find whole words only option for replace list filter

Bug in Find whole words only option for replace list filter

Re: Bug in Find whole words only option for replace list filter

Re: Bug in Find whole words only option for replace list filter

Re: Bug in Find whole words only option for replace list filter