How to extract or remove bad words within a text.
Posted: Fri Aug 07, 2009 7:35 pm
So this is another great task which I can't understand how to do at least partially: So what i mean under the "bad words" it a word that after scanner recognition has a symbols, numbers or letters.
For example:
----------------------------
The TextPipe dow^load contain8 over 200 filters and maps for TextPipe, 1isted below.
More filter files and examples can be fo*nd in the archiv% of the TextPipe discussion group.
You must sub$cribe in order to download the filter files and search the archives.
----------------------------
Deleting numbers
And deleting all non word characters: as !@#$%^&*()-__=\|
---------------------------
And all these words I have to extract / remove.
The main problem is how to define a single word as a set of characters between spaces...
Here is the one perl pattern:
(\w*[a-z][0-9]{1,}[a-z]\w*)
It removes some words with numbers but not all even when the number is obvious...
For example:
----------------------------
The TextPipe dow^load contain8 over 200 filters and maps for TextPipe, 1isted below.
More filter files and examples can be fo*nd in the archiv% of the TextPipe discussion group.
You must sub$cribe in order to download the filter files and search the archives.
----------------------------
Deleting numbers
And deleting all non word characters: as !@#$%^&*()-__=\|
---------------------------
And all these words I have to extract / remove.
The main problem is how to define a single word as a set of characters between spaces...
Here is the one perl pattern:
(\w*[a-z][0-9]{1,}[a-z]\w*)
It removes some words with numbers but not all even when the number is obvious...