Texpipe and Unicode (16LE) files

niccolo · Post by **niccolo** » Tue Jul 18, 2006 4:24 am

I've heard a lot about Textpipe and decided to try it. Download 7.63 t&b and try to do simple things with Unicode files and can't. It seems it doesn't understand it completely. I tried to remove trailing spaces - nothing. Trying to do that with \t+\n (these are mostly tabs) and nothing again. All other perl pattern doesn't work here with but worked without any problen in Uedit and Emeditor. Why so? Do not propose to convert files to ANSI cause files contains symbols from 3 symbol sets - non standart western, cyrillic, greek.
The help when it talks about work with Unicode files is worse than very bad.
May be necessary to do Unicodepipe?

Post by **DataMystic Support** » Wed Jul 19, 2006 10:08 am

Hi there,

TextPipe has specific filters to deal with Unicode (UTF16LE) data, such as the Unicode search/replace and Unicode pattern filters. For backward compatability, the original ANSI/ASCII based filters have not been modified.

So, if you'd like to use the Remove Trailing Spaces filter (which is ASCII), first convert the file to UTF-8, apply the filters, then convert it back.

The initial conversion to UTF-8 is the key here. TextPipe is used for a lot of mainframe data files, so converting EBCDIC to Unicode for internal processing is not an option until the Mainframe record structure has been unravelled.

niccolo · Post by **niccolo** » Wed Dec 12, 2007 8:51 am

I have downloaded trial version of 8 Textpipe

Task
Need to create sorted wordlist from UTF8 (I've taken into account Your previous recommendations) txt file containg German and russian words (Umlauts and cyrillic).

Use extract matches \w+
Sort ANSI

and what

In trial output everything seems OK but resulting file have unknown encoding.
Opening it as ANSI makes russian text completely unreadable. Open it as UTF8 shows that all cyrillic words are damaged and can't be used.

What's a hell??? Who is wrong here - I or a program.

Post by **DataMystic Support** » Wed Dec 12, 2007 3:10 pm

You may need to add a new UTF-8 BOM to the resulting file - use
Filters\Add\File Header
with text of

Code: Select all

\xEF\xBB\xBF

niccolo · Post by **niccolo** » Wed Dec 12, 2007 4:32 pm

the problem is not an unknown encoding that BOM solves. The problem is corrupted cyrillic text in file. What to do with that?

Post by **DataMystic Support** » Wed Dec 12, 2007 7:43 pm

No, the problem may be that sorting moves the line with the BOM further into the file, hence a new BOM is required.

Anyway, please email us your filter and a sample file.

dfhtextpipe · Post by **dfhtextpipe** » Wed Dec 12, 2007 9:20 pm

Is niccolo using TextPipe Standard or TextPipe Professional?

For the task in hand does it matter which ?

niccolo · Post by **niccolo** » Thu Dec 13, 2007 3:41 am

DFH - Textpipe pro trial 8

Here the sample, filters used (1st with sorting 2nd simple wordlist creating) and results. In both results files cyrillic word are corrupted but everything is ok in trial run windows. It's not a BOM problem

http://rapidshare.com/files/76100564/pack.zip.html

I've solved this problem with other software but what's a hell when decide to try Textpipe there are always problem with this. When the native unicode support will be implemented with regexes etc?

niccolo · Post by **niccolo** » Fri Dec 14, 2007 2:36 am

Just now found that's not so good with trial run area - all german words loose umlauts.

So may for English Textpipe is a good tool but for multilanguage files it should be taken with care.

Post by **DataMystic Support** » Fri Dec 14, 2007 5:37 am

No - if you read the help, the trial run area handles either ANSI or Unicode UTF-16 text (check the box).

If you use any other format you will loose data.

dfhtextpipe · Post by **dfhtextpipe** » Fri Dec 14, 2007 8:11 am

I have been using TextPipe Standard to process lots of UTF-8 files, all with success, including many with non-Latin characters, such as Cyrillic, Chinese, Thai, Amharic, Japanese, Hebrew.

Only the trial area has those restrictions, just as Simon already explained.

niccolo · Post by **niccolo** » Fri Dec 14, 2007 8:50 pm

DFH - If You have everything OK may be You can explain where I'm wrong in my example?

And regarding textpipe - in regex line I can insert sybbols that is not in system locale encoding. But in filter list such symbols look corrupted. When this problem will be solved?

dfhtextpipe · Post by **dfhtextpipe** » Sat Dec 15, 2007 1:46 am

The link you posted took me to a page wanting me to pay for an account. Please make it easier for other members to help you.

niccolo · Post by **niccolo** » Sat Dec 15, 2007 2:21 am

DFH - if You don't use proxy there should be no problem with getting file.

Copy link into browser and press enter. In the opened screen press FREE.
Then appears another window where You are asked to enter code on a small picture (No premium Please enter). Type it in box below and press Download via ....... button.

dfhtextpipe · Post by **dfhtextpipe** » Sat Dec 15, 2007 4:59 am

I didn't see the buttons before - thanks for help.

DataMystic

Texpipe and Unicode (16LE) files

Texpipe and Unicode (16LE) files

see no progress for multilanguage files in 8

Standard or Pro ?

I have been using TextPipe to process lots of UTF-8 files

Don't have a rapidshare account

Downloaded it now, thanks !