So this is another great task which I can't understand how to do at least partially: So what i mean under the "bad words" it a word that after scanner recognition has a symbols, numbers or letters.
For example:
----------------------------
The TextPipe dow^load contain8 over 200 filters and maps for TextPipe, 1isted below.
More filter files and examples can be fo*nd in the archiv% of the TextPipe discussion group.
You must sub$cribe in order to download the filter files and search the archives.
----------------------------
Deleting numbers
And deleting all non word characters: as !@#$%^&*()-__=\|
---------------------------
And all these words I have to extract / remove.
The main problem is how to define a single word as a set of characters between spaces...
Here is the one perl pattern:
(\w*[a-z][0-9]{1,}[a-z]\w*)
It removes some words with numbers but not all even when the number is obvious...
How to extract or remove bad words within a text.
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: How to extract or remove bad words within a text.
deleting all non word characters, either use a Map (Filters\Maps\New map) or use an EasyPattern
replace with nothing.
A word can be defined using the perl pattern
Use a perl pattern match to match this and send the result to a subfilter. There, you can use a Filters\Remove\Remove lines\Remove matching lines with an EasyPattern of
Code: Select all
[ <!@#$%^&*()-__=\|> ]
A word can be defined using the perl pattern
Code: Select all
\b[_0-9a-z-]\b
Code: Select all
[ <!@#$%^&*()-__=\|> ]