Page 1 of 1
Delete non-word characters
Posted: Wed Jan 23, 2008 9:51 pm
by canis
Is it possible to delete non-word characters from Start of Line and from End of Line?
Thank You.
Deleting non-word characters
Posted: Thu Jan 24, 2008 2:40 am
by dfhtextpipe
Use a Perl pattern replace list
Code: Select all
^(\W+) by nothing
(\W+)$ by nothing
Posted: Thu Jan 24, 2008 4:22 pm
by canis
It doesn't work
It removes only first or last non-word character.
For example I have
:: . , TEXT ; . . , .
and I need only
TEXT
Perl patterns - greedy or not greedy
Posted: Thu Jan 24, 2008 9:55 pm
by dfhtextpipe
Click on the button next to the Perl pattern (labelled with 3 dots).
Ensure that
greedy matching is ticked.
Code: Select all
Filter List
-----------
Filter options
| [ ] Log to file
| [X] Append to logfile
| Log filename: textpipe.log
| Threshold 500
|
|--Input from file(s)
| [ ] Confirm before processing each file
| [ ] Confirm before processing read/only files
| [ ] Delete input files after processing
| Process binary files
|
|--Comment...
| Remove non Word characters from start and end of lines
|
|--Perl pattern [^(\W+)] with []
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [ ] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
| Maximum text buffer size 4096
| [X] Maximum match (greedy)
| [ ] Allow comments
| [ ] '.' matches newline
| [ ] UTF-8 Support
Perl pattern [(\W+)$] with []
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [ ] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
| Maximum text buffer size 4096
| [X] Maximum match (greedy)
| [ ] Allow comments
| [ ] '.' matches newline
| [ ] UTF-8 Support
|
+--Output to file(s)
[ ] Only update date on changed files
[X] Keep original file's date and time
[ ] Append mode
[ ] Change extension to: .txt
Backup mode
Files List
----------
This works with your example.
Posted: Thu Jan 24, 2008 11:30 pm
by canis
Thanks' it works!!!
There is one more question - can I use IF-condition?
For example I have a line with latin and cyrillic characters:
some text, <cyrillic1> some text <cyrillic2> <cyrillic3> some text <cyrillic4>...
If between cyrillic words there are more then X characters I need to place carriage return after last cyrillic word.
For example in this case I need:
some text, <cyrillic1>
some text <cyrillic2> <cyrillic3>
some text <cyrillic4>...
I know how to identify cyrillic words - [a-z].
Is it possible to use IF-condition, or how to make such transform?
Cyrillic and Latin text in the same line
Posted: Fri Jan 25, 2008 12:19 am
by dfhtextpipe
Coping with Cyrillic and Latin in the same line is much more difficult. The Cyrillic text could be encoded either as Unicode or as Codepage 1251 (MS-Windows ANSI), or as various other methods such as KOI8 as used on Apple Macintosh.
Since you are processing stuff found in emails, then presumably you can't be sure what platform the text originated from.
Do you already know how the Cyrillic text is encoded?
Do you anticipate coping with any other scripts apart from Latin and Cyrillic?
btw. TextPipe Standard can process Unicode files, but the special filters needed to convert between Unicode text encodings are in TextPipe Pro.
Posted: Fri Jan 25, 2008 12:33 am
by canis
All text is in windows-1251.
But is it realy important? I think it is possible to identify cyrillic words isung perl [а-я].
2. This script will be the last in sequence of 3-5 scripts. I use TextPipe with cyrillic often - and I have no to convert text using latin and cyrillic in same lines.