Page 1 of 1

How to extract or remove bad words within a text.

Posted: Fri Aug 07, 2009 7:35 pm
by alexman
So this is another great task which I can't understand how to do at least partially: So what i mean under the "bad words" it a word that after scanner recognition has a symbols, numbers or letters.
For example:
----------------------------
The TextPipe dow^load contain8 over 200 filters and maps for TextPipe, 1isted below.
More filter files and examples can be fo*nd in the archiv% of the TextPipe discussion group.
You must sub$cribe in order to download the filter files and search the archives.
----------------------------
Deleting numbers
And deleting all non word characters: as !@#$%^&*()-__=\|
---------------------------
And all these words I have to extract / remove.
The main problem is how to define a single word as a set of characters between spaces...
Here is the one perl pattern:
(\w*[a-z][0-9]{1,}[a-z]\w*)
It removes some words with numbers but not all even when the number is obvious...

Re: How to extract or remove bad words within a text.

Posted: Mon Aug 10, 2009 7:56 pm
by DataMystic Support
deleting all non word characters, either use a Map (Filters\Maps\New map) or use an EasyPattern

Code: Select all

[ <!@#$%^&*()-__=\|> ]
replace with nothing.

A word can be defined using the perl pattern

Code: Select all

\b[_0-9a-z-]\b
Use a perl pattern match to match this and send the result to a subfilter. There, you can use a Filters\Remove\Remove lines\Remove matching lines with an EasyPattern of

Code: Select all

[ <!@#$%^&*()-__=\|> ]