Texpipe and Unicode (16LE) files

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

niccolo
Posts: 8
Joined: Mon Jul 17, 2006 3:20 pm

Texpipe and Unicode (16LE) files

Post by niccolo »

I've heard a lot about Textpipe and decided to try it. Download 7.63 t&b and try to do simple things with Unicode files and can't. It seems it doesn't understand it completely. I tried to remove trailing spaces - nothing. Trying to do that with \t+\n (these are mostly tabs) and nothing again. All other perl pattern doesn't work here with but worked without any problen in Uedit and Emeditor. Why so? Do not propose to convert files to ANSI cause files contains symbols from 3 symbol sets - non standart western, cyrillic, greek.
The help when it talks about work with Unicode files is worse than very bad.
May be necessary to do Unicodepipe?
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

Hi there,

TextPipe has specific filters to deal with Unicode (UTF16LE) data, such as the Unicode search/replace and Unicode pattern filters. For backward compatability, the original ANSI/ASCII based filters have not been modified.

So, if you'd like to use the Remove Trailing Spaces filter (which is ASCII), first convert the file to UTF-8, apply the filters, then convert it back.

The initial conversion to UTF-8 is the key here. TextPipe is used for a lot of mainframe data files, so converting EBCDIC to Unicode for internal processing is not an option until the Mainframe record structure has been unravelled.
niccolo
Posts: 8
Joined: Mon Jul 17, 2006 3:20 pm

see no progress for multilanguage files in 8

Post by niccolo »

I have downloaded trial version of 8 Textpipe

Task
Need to create sorted wordlist from UTF8 (I've taken into account Your previous recommendations) txt file containg German and russian words (Umlauts and cyrillic).

Use extract matches \w+
Sort ANSI

and what

In trial output everything seems OK but resulting file have unknown encoding.
Opening it as ANSI makes russian text completely unreadable. Open it as UTF8 shows that all cyrillic words are damaged and can't be used.

What's a hell??? Who is wrong here - I or a program.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

You may need to add a new UTF-8 BOM to the resulting file - use
Filters\Add\File Header
with text of

Code: Select all

\xEF\xBB\xBF
niccolo
Posts: 8
Joined: Mon Jul 17, 2006 3:20 pm

Post by niccolo »

the problem is not an unknown encoding that BOM solves. The problem is corrupted cyrillic text in file. What to do with that?
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

No, the problem may be that sorting moves the line with the BOM further into the file, hence a new BOM is required.

Anyway, please email us your filter and a sample file.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Standard or Pro ?

Post by dfhtextpipe »

Is niccolo using TextPipe Standard or TextPipe Professional?

For the task in hand does it matter which ?
niccolo
Posts: 8
Joined: Mon Jul 17, 2006 3:20 pm

Post by niccolo »

DFH - Textpipe pro trial 8

Here the sample, filters used (1st with sorting 2nd simple wordlist creating) and results. In both results files cyrillic word are corrupted but everything is ok in trial run windows. It's not a BOM problem

http://rapidshare.com/files/76100564/pack.zip.html

I've solved this problem with other software but what's a hell when decide to try Textpipe there are always problem with this. When the native unicode support will be implemented with regexes etc?
niccolo
Posts: 8
Joined: Mon Jul 17, 2006 3:20 pm

Post by niccolo »

Just now found that's not so good with trial run area - all german words loose umlauts.

So may for English Textpipe is a good tool but for multilanguage files it should be taken with care.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Post by DataMystic Support »

No - if you read the help, the trial run area handles either ANSI or Unicode UTF-16 text (check the box).

If you use any other format you will loose data.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

I have been using TextPipe to process lots of UTF-8 files

Post by dfhtextpipe »

I have been using TextPipe Standard to process lots of UTF-8 files, all with success, including many with non-Latin characters, such as Cyrillic, Chinese, Thai, Amharic, Japanese, Hebrew.

Only the trial area has those restrictions, just as Simon already explained.
niccolo
Posts: 8
Joined: Mon Jul 17, 2006 3:20 pm

Post by niccolo »

DFH - If You have everything OK may be You can explain where I'm wrong in my example?

And regarding textpipe - in regex line I can insert sybbols that is not in system locale encoding. But in filter list such symbols look corrupted. When this problem will be solved?
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Don't have a rapidshare account

Post by dfhtextpipe »

The link you posted took me to a page wanting me to pay for an account. Please make it easier for other members to help you.
niccolo
Posts: 8
Joined: Mon Jul 17, 2006 3:20 pm

Post by niccolo »

DFH - if You don't use proxy there should be no problem with getting file.

Copy link into browser and press enter. In the opened screen press FREE.
Then appears another window where You are asked to enter code on a small picture (No premium Please enter). Type it in box below and press Download via ....... button.
dfhtextpipe
Posts: 988
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Downloaded it now, thanks !

Post by dfhtextpipe »

I didn't see the buttons before - thanks for help.
Post Reply