Word frequency list
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
Word frequency list
Today, I wanted to create word frequency list of the words used in job descriptions in my company. I didn't find one ready made filter for exactly this in TextPipe Standard, but it was easy to do in two steps, using the "text to word list" filter to create a single text file of all the words in my text-format job description archive, and using that text file as input for the "count duplicate lines" filter.
The "text to word list" read all my job descriptions and put each word on a single line; the "count duplicate lines" filter then counted all the words and produced a second text file with the words and a word count.
Just what I needed. I'm sharing this here in case others search for something similar.
The "text to word list" read all my job descriptions and put each word on a single line; the "count duplicate lines" filter then counted all the words and produced a second text file with the words and a word count.
Just what I needed. I'm sharing this here in case others search for something similar.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Word frequency list
Thanks Grant - you can upload filters too provided they are zipped.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Word frequency list
Help for the Text to word list filter states:
One obvious limitation is how the filter should deal with English possessives (or other abbreviations) ending with ’s.
Currently, such words would be stripped of the ’s, which may not be what the user requires.
To workaround this, use a Search and Replace filter to replace ’ by an unused letter such as the small letter thorn þ.
Then restore the ’ afterwards by means of another Search and Replace filter.
Only the hyphen/minus is counted as a special case.This filter takes all the incoming words and outputs them one per line, with a DOS line feed between them. This can be used to generate word lists for Indexes, encryption programs etc. Hyphenated words are recognised as single words, provided that they aren't broken across lines. To get around this limitation, use a Search and Replace filter to replace hyphens followed by line feeds with just a hyphen.
One obvious limitation is how the filter should deal with English possessives (or other abbreviations) ending with ’s.
Currently, such words would be stripped of the ’s, which may not be what the user requires.
To workaround this, use a Search and Replace filter to replace ’ by an unused letter such as the small letter thorn þ.
Then restore the ’ afterwards by means of another Search and Replace filter.
Last edited by dfhtextpipe on Tue Sep 19, 2017 3:38 am, edited 1 time in total.
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Word frequency list
To what extent is the Text to word list filter UTF-8 aware?
The help page does not even indicate whether it's limited to ANSI or ASCII letters.
David
The help page does not even indicate whether it's limited to ANSI or ASCII letters.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
-
- Posts: 1
- Joined: Mon Oct 02, 2017 4:37 pm
- Contact:
Re: Word frequency list
Its Limited to to ANSI or ASCII letters.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Word frequency list
Yes, it is limited to ASCII or ANSI characters, although we are adding English possessives for 10.4.1+
You can also use a pattern match to find:
and replace with
$0\r\n
- this works for UTF-8 data.
You can also use a pattern match to find:
Code: Select all
[[:alnum:]\-\']+?
$0\r\n
- this works for UTF-8 data.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Word frequency list
The correct Unicode character that should be used in proper typography for possessives is not \x27 apostrophe but rather U+2019 right single quotation mark. That's true for both English and French as well as a few other languages based on the Latin script.
Users of the enhanced feature may not be aware of this.
David
Users of the enhanced feature may not be aware of this.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Word frequency list
Sticking to ANSI, many plural possessives do not end with 's but with s'.
However, the possessive for singular cockatrice is cockatrice' as in "a cockatrice' den".
Not many people know that, unless they are familiar with Isaiah 11:8 in the Authorised Version of the Bible.
It's not the only unusual singular possessive in the English language.
Best regards,
However, the possessive for singular cockatrice is cockatrice' as in "a cockatrice' den".
Not many people know that, unless they are familiar with Isaiah 11:8 in the Authorised Version of the Bible.
It's not the only unusual singular possessive in the English language.
Best regards,
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Word frequency list
An alternative method is to use a remove patterns matching [[:punct:]] after first replacing any special punctuation marks you want to keep as part of valid words.
After the words list has been made, the temporary replacements can be readily reverted.
After the words list has been made, the temporary replacements can be readily reverted.
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Word frequency list
Hereby attached an example for making a counted words list for French words in a complete Bible.
The data was derived from https://github.com/MarjorieBurghart/VulgateGlaire
NB. The data for the input file had already been preprocessed by means of other bespoke TextPipe filters.
The output is uploaded in this issue:
https://github.com/MarjorieBurghart/VulgateGlaire/issues/16
David
The data was derived from https://github.com/MarjorieBurghart/VulgateGlaire
NB. The data for the input file had already been preprocessed by means of other bespoke TextPipe filters.
The output is uploaded in this issue:
https://github.com/MarjorieBurghart/VulgateGlaire/issues/16
David
- Attachments
-
- Extract and count words in French Vulgate Glaire.zip
- Zip file contains a TextPipe filter.
- (1.26 KiB) Downloaded 754 times
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Word frequency list
Thanks David!