Listing character frequencies in a Unicode text file.
Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Listing character frequencies in a Unicode text file.
I'd like to have a TextPipe Standard filter that will analyse a Unicode text file, and list the number of occurrences of each code-point found, preferably in sorted order of codes.
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Listing character frequencies in a Unicode text file.
TextPipe Standard now can make use of the Scripting filter - so you could design a script to do this.
There is no inbuilt filter to do this.
There is no inbuilt filter to do this.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Listing character frequencies in a Unicode text file.
I have accomplished (more or less) what I was after, yet without resorting to a script filter.
I thought it would be instructive to share it.
The breakthrough is the fact that special filter "Count duplicate lines" does not require the input data to be sorted.NB. The TextPipe sort filter is for ANSI only, whereas all the files I wish to analyse are UTF-8.
I thought it would be instructive to share it.
The breakthrough is the fact that special filter "Count duplicate lines" does not require the input data to be sorted.
Code: Select all
Comment...
| Special filter to count the characters that occur in Bible VPL verse text
|
|--Comment...
| | For verse text from VPL files only
| |
| +--Remove fields:Space-delimited field 1 .. field 2
| Delimiter Type: 4
| Custom delimiter:
| [ ] Has Header
|
|--Comment...
| | Join all lines to a single line
| |
| |--Remove blanks from End of Line
| |
| +--Replace [\r\n] with [ ]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [ ] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
|
|--Comment...
| | Add new line after every character
| |
| |--Perl pattern [(.)] with [$1\r\n]
| | [ ] Match case
| | [ ] Whole words only
| | [ ] Case sensitive replace
| | [ ] Prompt on replace
| | [ ] Skip prompt if identical
| | [ ] First only
| | [ ] Extract matches
| | Maximum text buffer size 4096
| | [ ] Maximum match (greedy)
| | [ ] Allow comments
| | [ ] '.' matches newline
| | [X] UTF-8 Support
| |
| |--Remove blanks from Start of Line
| |
| +--Remove blank lines
|
|--Comment...
| | Count duplicate lines - input need not be sorted
| |
| +--Count duplicate lines
| [ ] Ignore case
| Start column 1
| Length 5
| [X] Include One
| format: %1:s\t%0:d
|
+--Comment...
Output is unsorted
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Listing character frequencies in a Unicode text file.
Though the UTF-8 output file is unsorted, it can easily be sorted using Notepad++ TextFX Tools.
Sorting lines of Unicode text is a huge topic in its own right, one which is too involved to take up further here.
Sorting lines of Unicode text is a huge topic in its own right, one which is too involved to take up further here.
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Listing character frequencies in a Unicode text file.
Such a filter can be very useful for analyzing large text files, such as those encountered in Bible translations.
For example, it can help find any unexpected characters, such as those for a different language.
Or to check whether there are unmatched punctuation pairs, e.g. For braces, brackets, parentheses and quotation marks.
My filter assumes that the input text is already in Verse Per Line format, with the verse reference at the start of each line of text.
Gen 1:1 In the beginning, God created ...
For example, it can help find any unexpected characters, such as those for a different language.
Or to check whether there are unmatched punctuation pairs, e.g. For braces, brackets, parentheses and quotation marks.
My filter assumes that the input text is already in Verse Per Line format, with the verse reference at the start of each line of text.
Gen 1:1 In the beginning, God created ...
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Listing character frequencies in a Unicode text file.
Nice work David!
Could you possible zip and upload this filter for others?
Could you possible zip and upload this filter for others?
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Listing character frequencies in a Unicode text file.
I suppose I could, yet my filter (as it stands) still points to a particular input file that I was working on.
For the time being, I don't wish to mess with that.
What's wrong with other reader's just pasting the code from the box above?
For the time being, I don't wish to mess with that.
What's wrong with other reader's just pasting the code from the box above?
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Listing character frequencies in a Unicode text file.
Thanks David,
Readers can't just paste your text above - that is the human readable form.
In order to easily re-create your filter, either the .fll file, or the File Menu\Export in VBScript or JScript form is needed.
An input file is not necessary.
Readers can't just paste your text above - that is the human readable form.
In order to easily re-create your filter, either the .fll file, or the File Menu\Export in VBScript or JScript form is needed.
An input file is not necessary.
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Listing character frequencies in a Unicode text file.
Here goes.
I've removed the input file specifier, and amplified the comments for the first subfilter.
This section can be disabled (ticked) to obtain a more general application.
I usually structure and comment my subfilters, to make them easier to maintain and understand.
This simple filter is made freely available for anyone to use.
btw. I have assumed that there is no requirement to count spaces!
David
I've removed the input file specifier, and amplified the comments for the first subfilter.
This section can be disabled (ticked) to obtain a more general application.
I usually structure and comment my subfilters, to make them easier to maintain and understand.
This simple filter is made freely available for anyone to use.
btw. I have assumed that there is no requirement to count spaces!
David
- Attachments
-
- Special filter to count the characters that occur in Bible VPL verse text.zip
- ZIP file contains my TextPipe filter.
- (1.09 KiB) Downloaded 575 times
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Listing character frequencies in a Unicode text file.
Thank you David!
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Listing character frequencies in a Unicode text file.
Simon,
I think that there is a bug in the Special filter called Count duplicate lines.
I have "Ignore case" unticked in the form, but it is not counting uppercase and lowercase letters correctly.
After stripping the reference columns from an original VPL file, I have used Notepad++ to count occurrences manually letter by letter.
The TextPipe count results do not match the Notepad++ results.
Example:
Using Notepad++:
A few uppercase letters did have a tally using TextPipe, where the uppercase instance occurred before the lowercase.
Using Notepad++:
David
I think that there is a bug in the Special filter called Count duplicate lines.
I have "Ignore case" unticked in the form, but it is not counting uppercase and lowercase letters correctly.
After stripping the reference columns from an original VPL file, I have used Notepad++ to count occurrences manually letter by letter.
The TextPipe count results do not match the Notepad++ results.
Example:
Using Notepad++:
- A occurs 5995 times
a occurs 335106 times
- A not listed
a occurs 341101 times
A few uppercase letters did have a tally using TextPipe, where the uppercase instance occurred before the lowercase.
Using Notepad++:
- D occurs 7088 times
d occurs 155666 times
- D occurs 162754 times
d not listed
NB. I am using English (United Kingdom) in Windows Regional Settings.If ignore case is checked, lines do not need to be cased identically to be considered duplicates. Two identical lines, one in upper case, and one in lower case, would be considered duplicates and counted together by this filter. If ignore case is unchecked, the lines must be absolutely identical to be considered duplicates. The case checking routines are ANSI aware, so their behaviour may change depending on your locale.
David
David
-
- Posts: 988
- Joined: Sun Dec 09, 2007 2:49 am
- Location: UK
Re: Listing character frequencies in a Unicode text file.
Simon,
I also tried ticking the box marked "Ignore case" to compare with the it unticked results.
The two output files were identical!
Perhaps the programmer has 'remmed out' this part of the code (to test something else) and forgot to enable again it afterward?
Or maybe there is an 'off by one' error, such that lines as short as a single character are incorrectly processed?
Either way, there is a serious bug.
David
I also tried ticking the box marked "Ignore case" to compare with the it unticked results.
The two output files were identical!
Perhaps the programmer has 'remmed out' this part of the code (to test something else) and forgot to enable again it afterward?
Or maybe there is an 'off by one' error, such that lines as short as a single character are incorrectly processed?
Either way, there is a serious bug.
David
David
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Listing character frequencies in a Unicode text file.
Thanks David - will check into it.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Listing character frequencies in a Unicode text file.
This also affects the Remove Duplicate Lines filter - Ignore Case option.
It will be fixed in the 8.6.2 release.
It will be fixed in the 8.6.2 release.
- DataMystic Support
- Site Admin
- Posts: 2227
- Joined: Mon Jun 30, 2003 12:32 pm
- Location: Melbourne, Australia
- Contact:
Re: Listing character frequencies in a Unicode text file.
This also affects the Remove Duplicate Lines filter - Ignore Case option.
It will be fixed in the 8.6.2 release.
Note - the letter shown in the output file depends on what is found first in the input file.
It will be fixed in the 8.6.2 release.
Note - the letter shown in the output file depends on what is found first in the input file.