Replace list filter inserts spurious character Â

dfhtextpipe · Post by **dfhtextpipe** » Thu Jul 21, 2011 8:38 pm

I'm using a replace list tab file as follows:

<<	«
>>	»

The replacement characters are U+00AB and U+00BB respectively.

but TextPipe Standard v8.9.2 inserts a spurious character Â (U+00C2) in front of the replacement characters.
btw. The same thing happens when the replace list is a CSV file.

Where is this spurious character coming from?
I think there a very serious bug in the latest version!
i.e. As a consequence of "Updated internal pattern matching libraries."
cf. I just uninstalled v8.9.2 and re-installed v8.8.2 and the problem disappears!

The XHTML file to be processed is encoded UTF-8 (without BOM), as is the replacement list file.

David

Post by **DataMystic Support** » Fri Jul 22, 2011 10:41 am

The correct UTF-8 encoding of U+00AB is \xC2\xAB

The latest release of TP allows the search/replace lists (.tab and .csv) to be Unicode, however most of TP's internals are not Unicode aware. Each line from your search/replace list is converted from a UTF16-LE string to a UTF-8 string before being processed. The same also now applies to rows from the grid.

Previous versions of TP loaded the search/replace lists naively. If your search/replace list was saved as ANSI then you would not get any extra \xC2. If saved as UTF-8 (with or without BOM) the extra \xC2 will be in the file.

Let me know which way you would like to go on this.

dfhtextpipe · Post by **dfhtextpipe** » Fri Jul 22, 2011 6:04 pm

Hi Simon,

I have accumulated a considerable number of replace lists since I began using TextPipe.
Almost all of them are encoded as UTF-8 without BOM.

You're not telling me that I should have to change most of them and check/debug previously working filters, are you?
That would be a huge chore that I have not planned as part of my workload.

Many of them are used as Perl pattern replacements, with UTF-8 support ticked.
The latter is because the Files to Process are mostly UTF-8 themselves.
So if I don't use something with UTF-8 support, the files being processed get corrupted.

The phrase "backwards compatibility" springs to mind.
Surely, you owe it to your customers at least to provide an option that will not cause well-established filters to break?

Moreover, what about the longstanding descriptions in the pattern matching reference?

2. In a pattern, the escape sequence \x{...}, where the contents of the braces is a string of hexadecimal digits, is interpreted as a UTF-8 character whose code number is the given hexadecimal number, for example: \x{1234}. If a non-hexadecimal digit appears between the braces, the item is not recognized. This escape sequence can be used either as a literal, or within a character class.

3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

According to this, as an alternative, I should have been able to use \x{00AB} and \x{00BB) in the replace list file. Yet in version 8.9.2 this doesn't work either.
Neither did \x{ab} and \x{bb}, so the help reference I've been using to solve issues with naive replacements is also "broken".
That was one of the first things that I tried before reverting to v8.8.2 to find out what on earth was happening.

Best regards,
David

Best regards,
David

Post by **DataMystic Support** » Mon Jul 25, 2011 10:54 pm

Ok! We have changed this back for 8.9.3.

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 27, 2011 1:08 am

When will 8.9.3 become available?

David

Post by **DataMystic Support** » Wed Jul 27, 2011 1:11 am

In about one hour...

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 27, 2011 4:26 am

8.9.3 just installed.

No more spurious character Â after running my example filter. Issue solved.

Thanks.

David

DataMystic

Replace list filter inserts spurious character Â

Replace list filter inserts spurious character Â

Re: Replace list filter inserts spurious character Â

Re: Replace list filter inserts spurious character Â

Re: Replace list filter inserts spurious character Â

Re: Replace list filter inserts spurious character Â

Re: Replace list filter inserts spurious character Â

Re: Replace list filter inserts spurious character Â