Replace list filter inserts spurious character Â

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Replace list filter inserts spurious character Â

Post by dfhtextpipe »

I'm using a replace list tab file as follows:

Code: Select all

<<	«
>>	»
The replacement characters are U+00AB and U+00BB respectively.

but TextPipe Standard v8.9.2 inserts a spurious character  (U+00C2) in front of the replacement characters.
btw. The same thing happens when the replace list is a CSV file.

Where is this spurious character coming from?
I think there a very serious bug in the latest version!
i.e. As a consequence of "Updated internal pattern matching libraries."
cf. I just uninstalled v8.9.2 and re-installed v8.8.2 and the problem disappears!

The XHTML file to be processed is encoded UTF-8 (without BOM), as is the replacement list file.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Replace list filter inserts spurious character Â

Post by DataMystic Support »

The correct UTF-8 encoding of U+00AB is \xC2\xAB

The latest release of TP allows the search/replace lists (.tab and .csv) to be Unicode, however most of TP's internals are not Unicode aware. Each line from your search/replace list is converted from a UTF16-LE string to a UTF-8 string before being processed. The same also now applies to rows from the grid.

Previous versions of TP loaded the search/replace lists naively. If your search/replace list was saved as ANSI then you would not get any extra \xC2. If saved as UTF-8 (with or without BOM) the extra \xC2 will be in the file.

Let me know which way you would like to go on this.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Replace list filter inserts spurious character Â

Post by dfhtextpipe »

Hi Simon,

I have accumulated a considerable number of replace lists since I began using TextPipe.
Almost all of them are encoded as UTF-8 without BOM.

You're not telling me that I should have to change most of them and check/debug previously working filters, are you?
That would be a huge chore that I have not planned as part of my workload.

Many of them are used as Perl pattern replacements, with UTF-8 support ticked.
The latter is because the Files to Process are mostly UTF-8 themselves.
So if I don't use something with UTF-8 support, the files being processed get corrupted.

The phrase "backwards compatibility" springs to mind.
Surely, you owe it to your customers at least to provide an option that will not cause well-established filters to break?

Moreover, what about the longstanding descriptions in the pattern matching reference?
2. In a pattern, the escape sequence \x{...}, where the contents of the braces is a string of hexadecimal digits, is interpreted as a UTF-8 character whose code number is the given hexadecimal number, for example: \x{1234}. If a non-hexadecimal digit appears between the braces, the item is not recognized. This escape sequence can be used either as a literal, or within a character class.

3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.
According to this, as an alternative, I should have been able to use \x{00AB} and \x{00BB) in the replace list file. Yet in version 8.9.2 this doesn't work either.
Neither did \x{ab} and \x{bb}, so the help reference I've been using to solve issues with naive replacements is also "broken".
That was one of the first things that I tried before reverting to v8.8.2 to find out what on earth was happening.

Best regards,
David


Best regards,
David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Replace list filter inserts spurious character Â

Post by DataMystic Support »

Ok! We have changed this back for 8.9.3.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Replace list filter inserts spurious character Â

Post by dfhtextpipe »

When will 8.9.3 become available?

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Replace list filter inserts spurious character Â

Post by DataMystic Support »

In about one hour...
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Replace list filter inserts spurious character Â

Post by dfhtextpipe »

8.9.3 just installed.

No more spurious character  after running my example filter. Issue solved.

Thanks.

David
David
Post Reply