How to extract or remove bad words within a text.

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
alexman
Posts: 1
Joined: Fri Aug 07, 2009 2:17 pm

How to extract or remove bad words within a text.

Post by alexman »

So this is another great task which I can't understand how to do at least partially: So what i mean under the "bad words" it a word that after scanner recognition has a symbols, numbers or letters.
For example:
----------------------------
The TextPipe dow^load contain8 over 200 filters and maps for TextPipe, 1isted below.
More filter files and examples can be fo*nd in the archiv% of the TextPipe discussion group.
You must sub$cribe in order to download the filter files and search the archives.
----------------------------
Deleting numbers
And deleting all non word characters: as !@#$%^&*()-__=\|
---------------------------
And all these words I have to extract / remove.
The main problem is how to define a single word as a set of characters between spaces...
Here is the one perl pattern:
(\w*[a-z][0-9]{1,}[a-z]\w*)
It removes some words with numbers but not all even when the number is obvious...
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: How to extract or remove bad words within a text.

Post by DataMystic Support »

deleting all non word characters, either use a Map (Filters\Maps\New map) or use an EasyPattern

Code: Select all

[ <!@#$%^&*()-__=\|> ]
replace with nothing.

A word can be defined using the perl pattern

Code: Select all

\b[_0-9a-z-]\b
Use a perl pattern match to match this and send the result to a subfilter. There, you can use a Filters\Remove\Remove lines\Remove matching lines with an EasyPattern of

Code: Select all

[ <!@#$%^&*()-__=\|> ]
Post Reply