Delete non-word characters

canis · Post by **canis** » Wed Jan 23, 2008 9:51 pm

Is it possible to delete non-word characters from Start of Line and from End of Line?
Thank You.

dfhtextpipe · Post by **dfhtextpipe** » Thu Jan 24, 2008 2:40 am

Use a Perl pattern replace list

^(\W+)     by nothing
(\W+)$     by nothing

canis · Post by **canis** » Thu Jan 24, 2008 4:22 pm

It doesn't work

It removes only first or last non-word character.

For example I have

:: . , TEXT ; . . , .

and I need only

TEXT

dfhtextpipe · Post by **dfhtextpipe** » Thu Jan 24, 2008 9:55 pm

Click on the button next to the Perl pattern (labelled with 3 dots).
Ensure that greedy matching is ticked.

Code: Select all

Filter List
-----------
Filter options
|  [ ] Log to file
|  [X] Append to logfile
|  Log filename: textpipe.log
|  Threshold 500
|
|--Input from file(s)
|     [ ] Confirm before processing each file
|     [ ] Confirm before processing read/only files
|     [ ] Delete input files after processing
|     Process binary files
|   
|--Comment...
|     Remove non Word characters from start and end of lines
|   
|--Perl pattern [^(\W+)] with []
|     [ ] Match case
|     [ ] Whole words only
|     [ ] Case sensitive replace
|     [ ] Prompt on replace
|     [ ] Skip prompt if identical
|     [ ] First only
|     [ ] Extract matches
|     Maximum text buffer size 4096
|     [X] Maximum match (greedy)
|     [ ] Allow comments
|     [ ] '.' matches newline
|     [ ] UTF-8 Support
Perl pattern [(\W+)$] with []
|     [ ] Match case
|     [ ] Whole words only
|     [ ] Case sensitive replace
|     [ ] Prompt on replace
|     [ ] Skip prompt if identical
|     [ ] First only
|     [ ] Extract matches
|     Maximum text buffer size 4096
|     [X] Maximum match (greedy)
|     [ ] Allow comments
|     [ ] '.' matches newline
|     [ ] UTF-8 Support
|   
+--Output to file(s)
      [ ] Only update date on changed files
      [X] Keep original file's date and time
      [ ] Append mode
      [ ] Change extension to: .txt
    Backup mode    

Files List
----------

This works with your example.

canis · Post by **canis** » Thu Jan 24, 2008 11:30 pm

Thanks' it works!!!

There is one more question - can I use IF-condition?

For example I have a line with latin and cyrillic characters:

some text, <cyrillic1> some text <cyrillic2> <cyrillic3> some text <cyrillic4>...

If between cyrillic words there are more then X characters I need to place carriage return after last cyrillic word.
For example in this case I need:

some text, <cyrillic1>
some text <cyrillic2> <cyrillic3>
some text <cyrillic4>...

I know how to identify cyrillic words - [a-z].
Is it possible to use IF-condition, or how to make such transform?

dfhtextpipe · Post by **dfhtextpipe** » Fri Jan 25, 2008 12:19 am

Coping with Cyrillic and Latin in the same line is much more difficult. The Cyrillic text could be encoded either as Unicode or as Codepage 1251 (MS-Windows ANSI), or as various other methods such as KOI8 as used on Apple Macintosh.

Since you are processing stuff found in emails, then presumably you can't be sure what platform the text originated from.

Do you already know how the Cyrillic text is encoded?
Do you anticipate coping with any other scripts apart from Latin and Cyrillic?

btw. TextPipe Standard can process Unicode files, but the special filters needed to convert between Unicode text encodings are in TextPipe Pro.

canis · Post by **canis** » Fri Jan 25, 2008 12:33 am

All text is in windows-1251.

But is it realy important? I think it is possible to identify cyrillic words isung perl [а-я].

2. This script will be the last in sequence of 3-5 scripts. I use TextPipe with cyrillic often - and I have no to convert text using latin and cyrillic in same lines.

DataMystic

Delete non-word characters

Delete non-word characters

Deleting non-word characters

Perl patterns - greedy or not greedy

Cyrillic and Latin text in the same line