addNumberFilter is mishandling Unicode files

rconn · Post by **rconn** » Wed Apr 25, 2012 3:51 am

Using the TextPipe Engine 9.1:

If I call addNumberFilter(2,80) (telling it to wrap at 80 columns) and pass the engine a Unicode file (with a BOM), the output file is still Unicode and has the BOM, but everything is wrapped at 40 columns -- and the inserted CR/LF's are ASCII, not Unicode.

Is there something else that needs to be set to handle Unicode?

rconn · Post by **rconn** » Wed Apr 25, 2012 4:06 am

I just tried it with the TextPipe GUI (Wrap filter), and it does the same thing. (The GUI also refuses to process the Unicode text file unless I tell it not to skip binary files -- should it be treating a Unicode text file as binary?)

Post by **DataMystic Support** » Fri Apr 27, 2012 10:05 am

Hi Rex,

TextPipe doesn't do anything you don't tell it to, so it won't convert the file to and from Unicode either.

So at the top of your filter list, add a Convert UTF16LE to UTF-8 filter, and at the bottom, the reverse.

And yes, TextPipe will see this as a binary file, although that should probably be changed.

rconn · Post by **rconn** » Sun Apr 29, 2012 2:17 pm

Hi Simon:

I didn't want TextPipe to convert the file from / to Unicode; I wanted it to remain in Unicode but insert the (Unicode!) CR/LF's at the new column width. Instead, the file remained in Unicode but had ASCII CR/LF strings inserted, converting the file from Unicode to gibberish.

I can check the input file for Unicode and convert it from UTF-8 for addNumberFilter (and convert it back for the output file). Do I need to do the same thing for all the other filters? Is TextPipe treating all input as ASCII and/or UTF-8, regardless of the BOM?

rconn · Post by **rconn** » Mon Apr 30, 2012 12:07 am

DataMystic Support wrote:TextPipe doesn't do anything you don't tell it to, so it won't convert the file to and from Unicode either.

So at the top of your filter list, add a Convert UTF16LE to UTF-8 filter, and at the bottom, the reverse.

Hmm -- I tried to add this to my code, but there doesn't appear to be any (documented?) TextPipe filter API to convert Unicode to UTF-8. (It is apparently doable in the TextPipe GUI, but it isn't mentioned in the API documentation.)

So:

1) How can I do this?

2) How is TextPipe handling the text internally -- as UTF-8 or ANSI? (If UTF-8, then I assume that the non-English characters won't get mangled during the Unicode->Utf-8->Unicode conversion?)

Thanks for your help.

rconn · Post by **rconn** » Mon Apr 30, 2012 5:02 am

OK, after poking around some more I found the undocumented addUnicodeConversionFilter2 API, which hopefully will do what I need (once I figure out the possible string arguments).

But I'm still wondering about #2 -- how is TextPipe handling the text internally? The Unicode support appears somewhat half-hearted -- the addSimpleFilter API supports Unicode (if you set the Unicode flag), but most of the other APIs don't seem to recognize a Unicode text file. I can do a unilateral conversion of all Unicode UTF-16 to UTF-8, run the filters, and then convert back again (provided TextPipe will recognize the UTF-8 format), but I'll have to add a special case for the addSimpleFilter since it wants to do some of the conversions itself.

Post by **DataMystic Support** » Mon Apr 30, 2012 9:11 am

Hi Rex,

There are very few filters (even amongst the Simple Filters collection) that handle Unicode directly. The API call allows this parameter to be passed but it is unused in most cases. Here is the updated documentation:

addSimpleFilter( type : integer; isUnicode : boolean ) : treenode

Adds a simple filter type, requiring no special parameters.

type - the type of filter to add.
1 Convert ASCII to EBCDIC
2 Convert EBCDIC to ASCII
3 Convert ANSI to OEM
4 Convert OEM to ANSI
5 Convert to UPPERCASE (*)
6 Convert to lowercase (*)
7 Convert to Title Case (*)
8 Convert to Sentence Case
9 Convert to tOGGLE cASE
10 Remove blank lines
11 Remove blanks from End of Line
12 Remove blanks from Start of Line
13 Remove binary characters
14 Remove ANSI codes
15 Convert IBM drawing characters
16 Remove HTML and SGML
17 Remove backspaces
18 Resolve backspaces
19 Remove multiple whitespace
20 UUEncode
21 Hex Encode
22 Hex Decode
23 MIME Encode (Base 64)
24 MIME Decode (Base 64)
25 MIME Encode (Quoted printable)
26 MIME Decode (Quoted printable)
27 UUDecode
28 Extract email addresses
29 Unscramble (ROT13)
30 Hex dump
32 XXEncode
33 XXDecode
34 Reverse line order
35 Remove email headers
36 Decimal dump
37 HTTP Encode
38 HTTP Decode
39 Randomize lines
40 Create word list
41 Reverse each line
42 Convert to RanDOm case
43 Extract URLs
44 ANSI to Unicode
45 Unicode to ANSI
46 Display debug window
47 Word concordance
48 Delete all
49 Restrict to each line in turn
50 Convert CSV to Tab-delimited
51 Convert CSV to XML' )
52 Convert Tab-delimited to CSV
53 Convert Tab-delimited to XML
54 Convert CSV (with column headers) to XML
55 Convert Tab-delimited (with column headers) to XML
56 Convert CSV (with column headers) to Tab-delimited
57 Convert Tab-delimited (with column headers) to CSV
58 Restrict to file name
59 Convert Word documents to text (*)
60 Swap UTF-16 word order
61 Swap UTF-32 word order
62 Remove BOM (Byte Order Mark)
63 Make Big Endian
64 Make Little Endian
65 Compress to Packed Decimal
66 Compress to Zoned Decimal
67 Expand Binary Number to EBCDIC
68 Expand Binary Number to ASCII
69 NFC - Canonical Decomposition, followed by Canonical Composition
70 NFD - Canonical Decomposition
71 NFKD - Compatibility Decomposition
72 NFKC - Compatibility Decomposition, followed by Canonical Composition
73 Decompose
74 Compose
75Convert numeric HTML Entities to text
76 Convert PDF documents to text
77 Restrict to ANSI files
78 Restrict to Unicode UTF16 files
79 Restrict to Unicode UTF32 files
80 Convert Excel spreadsheets to text (*)
81 Shred file
82 Unicode to Escaped ASCII
83 Restrict to Unicode Files
84 T-filter

isUnicode - for those filters that support it (show above with a *), this flag indicates that the filter will be dealing with Unicode data. Default false.

----
Internally, TextPipe works with text on a character-by-character basis (ANSI/ASCII), although UTF-8 works fine.

Here is the missing API definition:

addUnicodeConversionFilter2( convertFrom, convertTo : string; errorChar : char) : treenode

convertFrom - the string description of the Input encoding e.g. 'UTF-16'
convertTo - the string description of the Output encoding. e.g 'UTF-8'
errorChar (optional) - the character to output if there is no match in the destination encoding, default space.

dfhtextpipe · Post by **dfhtextpipe** » Sat May 05, 2012 11:47 pm

Simon,

Will the above be added to the reference manual?

David

Post by **DataMystic Support** » Mon May 07, 2012 11:26 am

Yes! Already done for the next release.

rconn · Post by **rconn** » Tue May 15, 2012 1:28 pm

Internally, TextPipe works with text on a character-by-character basis (ANSI/ASCII), although UTF-8 works fine.

UTF-8 *mostly* works, but there are some places (like in the column wrapping filters) where TextPipe counts 2 & 3-byte UTF-8 characters as 2 and 3 characters, so the right column is wrong.

Any chance of this getting fixed in the next build?

Post by **DataMystic Support** » Thu May 17, 2012 9:18 am

Hi Ron,

TextPipe makes no assumptions about the data coming in. Where it needs to, it assumes ANSI (e.g. for an insert columns filter, line wrapping etc) unless you setup your filters to convert it or process it otherwise, such as by using a Unicode search/replace filter.

Changing this assumption could break existing filters. However, it might be time to start changing those assumptions.

e.g. internally, TP could handle everything as UTF-8, UTF16 or UTF32, and then down-convert at the end (having detected the input file type with 100% accuracy at the start!).
Even UTF16 has multi-byte characters.

However, search/replaces designed to operate at the byte level would then need to be specially marked.

There is no easy answer - the solution must break as few filters as possible.

dfhtextpipe · Post by **dfhtextpipe** » Thu May 17, 2012 4:31 pm

Simon,

Changing the level at which TextPipe operates could be fraught with huge problems.
You've already mentioned that you don't want to break anything.
The last thing you want to do is to cause huge issues for established customers.

Before going down this route, it would be sensible to internally go through a PFMEA (potential failure modes and effects analysis) for the proposed design change[s].

See http://en.wikipedia.org/wiki/Failure_mo ... s_analysis

This is something that we found very useful in the manufacturing industry environment where I worked.
And for some major customers, having FMEA is a mandatory procedure for the development process.

David

Post by **DataMystic Support** » Fri May 18, 2012 11:55 am

We've considered an approach where each filter knows what type of text encoding it expects as input, and what type of text it outputs.

Then TextPipe could (if the user desired), place implicit conversions between each filter to match up any inconsistencies, hence freeing the user from having to manage this. These filter would not be visible in the filter list - they would get added at compile time.

So at input, TextPipe could detect ansi, utf-8, utf16 or utf32 (fairly reliably).
Output could be set to match the input format, or to force all output to one encoding (which is a common use of TP).

If a text stream was utf-8, and the user had a Unicode search/replace, TP could insert an implicit Convert UTF-8 to UTF16 filter. The output would depend on what the following stage was.

The default for all old filters would be to not insert these conversions (as we always want to be backward compatible).

This would mean that (currently), adding an Add Line Number filter (ANSI filter), would force a down-convert of any incoming Unicode data. If the Output encoding was set to match the input file encoding, then the output would be up-converted back to Unicode before saving. This would avoid CR/LFs being inserted that did not match the file encoding, as Ron has encountered.
Ideally, the Add Line Number filter would only have Unicode program code, so there would never be down-conversion - only up-conversion from ANSI, and then a final stage to match the input.

How does that sound Ron and David?

dfhtextpipe · Post by **dfhtextpipe** » Wed May 23, 2012 12:42 am

Simon,

I'll reply later, after finding time to think about it.

David

DataMystic

addNumberFilter is mishandling Unicode files

addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files

Re: addNumberFilter is mishandling Unicode files