Delete non-word characters

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
canis
Posts: 9
Joined: Fri Jan 18, 2008 5:45 pm

Delete non-word characters

Post by canis »

Is it possible to delete non-word characters from Start of Line and from End of Line?
Thank You.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Deleting non-word characters

Post by dfhtextpipe »

Use a Perl pattern replace list

Code: Select all

^(\W+)     by nothing
(\W+)$     by nothing
canis
Posts: 9
Joined: Fri Jan 18, 2008 5:45 pm

Post by canis »

It doesn't work :( It removes only first or last non-word character.

For example I have

:: . , TEXT ; . . , .

and I need only

TEXT
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Perl patterns - greedy or not greedy

Post by dfhtextpipe »

Click on the button next to the Perl pattern (labelled with 3 dots).
Ensure that greedy matching is ticked.

Code: Select all

Filter List
-----------
Filter options
|  [ ] Log to file
|  [X] Append to logfile
|  Log filename: textpipe.log
|  Threshold 500
|
|--Input from file(s)
|     [ ] Confirm before processing each file
|     [ ] Confirm before processing read/only files
|     [ ] Delete input files after processing
|     Process binary files
|   
|--Comment...
|     Remove non Word characters from start and end of lines
|   
|--Perl pattern [^(\W+)] with []
|     [ ] Match case
|     [ ] Whole words only
|     [ ] Case sensitive replace
|     [ ] Prompt on replace
|     [ ] Skip prompt if identical
|     [ ] First only
|     [ ] Extract matches
|     Maximum text buffer size 4096
|     [X] Maximum match (greedy)
|     [ ] Allow comments
|     [ ] '.' matches newline
|     [ ] UTF-8 Support
Perl pattern [(\W+)$] with []
|     [ ] Match case
|     [ ] Whole words only
|     [ ] Case sensitive replace
|     [ ] Prompt on replace
|     [ ] Skip prompt if identical
|     [ ] First only
|     [ ] Extract matches
|     Maximum text buffer size 4096
|     [X] Maximum match (greedy)
|     [ ] Allow comments
|     [ ] '.' matches newline
|     [ ] UTF-8 Support
|   
+--Output to file(s)
      [ ] Only update date on changed files
      [X] Keep original file's date and time
      [ ] Append mode
      [ ] Change extension to: .txt
    Backup mode    

Files List
----------
This works with your example.
canis
Posts: 9
Joined: Fri Jan 18, 2008 5:45 pm

Post by canis »

Thanks' it works!!!

There is one more question - can I use IF-condition?

For example I have a line with latin and cyrillic characters:

some text, <cyrillic1> some text <cyrillic2> <cyrillic3> some text <cyrillic4>...

If between cyrillic words there are more then X characters I need to place carriage return after last cyrillic word.
For example in this case I need:

some text, <cyrillic1>
some text <cyrillic2> <cyrillic3>
some text <cyrillic4>...


I know how to identify cyrillic words - [a-z].
Is it possible to use IF-condition, or how to make such transform?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Cyrillic and Latin text in the same line

Post by dfhtextpipe »

Coping with Cyrillic and Latin in the same line is much more difficult. The Cyrillic text could be encoded either as Unicode or as Codepage 1251 (MS-Windows ANSI), or as various other methods such as KOI8 as used on Apple Macintosh.

Since you are processing stuff found in emails, then presumably you can't be sure what platform the text originated from.

Do you already know how the Cyrillic text is encoded?
Do you anticipate coping with any other scripts apart from Latin and Cyrillic?

btw. TextPipe Standard can process Unicode files, but the special filters needed to convert between Unicode text encodings are in TextPipe Pro.
canis
Posts: 9
Joined: Fri Jan 18, 2008 5:45 pm

Post by canis »

All text is in windows-1251.

But is it realy important? I think it is possible to identify cyrillic words isung perl [а-я].

2. This script will be the last in sequence of 3-5 scripts. I use TextPipe with cyrillic often - and I have no to convert text using latin and cyrillic in same lines.
Post Reply