Listing character frequencies in a Unicode text file.

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

I'd like to have a TextPipe Standard filter that will analyse a Unicode text file, and list the number of occurrences of each code-point found, preferably in sorted order of codes.
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Listing character frequencies in a Unicode text file.

Post by DataMystic Support »

TextPipe Standard now can make use of the Scripting filter - so you could design a script to do this.

There is no inbuilt filter to do this.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

I have accomplished (more or less) what I was after, yet without resorting to a script filter.
I thought it would be instructive to share it.
The breakthrough is the fact that special filter "Count duplicate lines" does not require the input data to be sorted.

Code: Select all

Comment...
|  Special filter to count the characters that occur in Bible VPL verse text
|
|--Comment...
|  |  For verse text from VPL files only
|  |
|  +--Remove fields:Space-delimited field 1 .. field 2
|        Delimiter Type: 4
|        Custom delimiter: 
|        [ ] Has Header
|      
|--Comment...
|  |  Join all lines to a single line
|  |
|  |--Remove blanks from End of Line
|  |   
|  +--Replace [\r\n] with [ ]
|        [ ] Match case
|        [ ] Whole words only
|        [ ] Case sensitive replace
|        [ ] Prompt on replace
|        [ ] Skip prompt if identical
|        [ ] First only
|        [ ] Extract matches
|      
|--Comment...
|  |  Add new line after every character
|  |
|  |--Perl pattern [(.)] with [$1\r\n]
|  |     [ ] Match case
|  |     [ ] Whole words only
|  |     [ ] Case sensitive replace
|  |     [ ] Prompt on replace
|  |     [ ] Skip prompt if identical
|  |     [ ] First only
|  |     [ ] Extract matches
|  |     Maximum text buffer size 4096
|  |     [ ] Maximum match (greedy)
|  |     [ ] Allow comments
|  |     [ ] '.' matches newline
|  |     [X] UTF-8 Support
|  |   
|  |--Remove blanks from Start of Line
|  |   
|  +--Remove blank lines
|      
|--Comment...
|  |  Count duplicate lines - input need not be sorted
|  |
|  +--Count duplicate lines
|        [ ] Ignore case
|        Start column 1
|        Length 5
|        [X] Include One
|        format: %1:s\t%0:d
|      
+--Comment...
      Output is unsorted
    
NB. The TextPipe sort filter is for ANSI only, whereas all the files I wish to analyse are UTF-8.
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

Though the UTF-8 output file is unsorted, it can easily be sorted using Notepad++ TextFX Tools.

Sorting lines of Unicode text is a huge topic in its own right, one which is too involved to take up further here.
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

Such a filter can be very useful for analyzing large text files, such as those encountered in Bible translations.

For example, it can help find any unexpected characters, such as those for a different language.

Or to check whether there are unmatched punctuation pairs, e.g. For braces, brackets, parentheses and quotation marks.

My filter assumes that the input text is already in Verse Per Line format, with the verse reference at the start of each line of text.

Gen 1:1 In the beginning, God created ...
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Listing character frequencies in a Unicode text file.

Post by DataMystic Support »

Nice work David!

Could you possible zip and upload this filter for others?
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

I suppose I could, yet my filter (as it stands) still points to a particular input file that I was working on.
For the time being, I don't wish to mess with that.

What's wrong with other reader's just pasting the code from the box above?
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Listing character frequencies in a Unicode text file.

Post by DataMystic Support »

Thanks David,

Readers can't just paste your text above - that is the human readable form.

In order to easily re-create your filter, either the .fll file, or the File Menu\Export in VBScript or JScript form is needed.

An input file is not necessary.
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

Here goes.

I've removed the input file specifier, and amplified the comments for the first subfilter.
This section can be disabled (ticked) to obtain a more general application.

I usually structure and comment my subfilters, to make them easier to maintain and understand.

This simple filter is made freely available for anyone to use.

btw. I have assumed that there is no requirement to count spaces!

David
Attachments
Special filter to count the characters that occur in Bible VPL verse text.zip
ZIP file contains my TextPipe filter.
(1.09 KiB) Downloaded 464 times
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

Simon,

I think that there is a bug in the Special filter called Count duplicate lines.

I have "Ignore case" unticked in the form, but it is not counting uppercase and lowercase letters correctly.

After stripping the reference columns from an original VPL file, I have used Notepad++ to count occurrences manually letter by letter.

The TextPipe count results do not match the Notepad++ results.

Example:
Using Notepad++:
  • A occurs 5995 times
    a occurs 335106 times
Using TextPipe:
  • A not listed
    a occurs 341101 times
Now clearly 5995 + 335106 = 341101, so this implies that it is ignoring case, even though I told it not to do so.

A few uppercase letters did have a tally using TextPipe, where the uppercase instance occurred before the lowercase.

Using Notepad++:
  • D occurs 7088 times
    d occurs 155666 times
Using TextPipe:
  • D occurs 162754 times
    d not listed
The help for this filter states,
If ignore case is checked, lines do not need to be cased identically to be considered duplicates. Two identical lines, one in upper case, and one in lower case, would be considered duplicates and counted together by this filter. If ignore case is unchecked, the lines must be absolutely identical to be considered duplicates. The case checking routines are ANSI aware, so their behaviour may change depending on your locale.
NB. I am using English (United Kingdom) in Windows Regional Settings.

David
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Listing character frequencies in a Unicode text file.

Post by dfhtextpipe »

Simon,

I also tried ticking the box marked "Ignore case" to compare with the it unticked results.
The two output files were identical!

Perhaps the programmer has 'remmed out' this part of the code (to test something else) and forgot to enable again it afterward?
Or maybe there is an 'off by one' error, such that lines as short as a single character are incorrectly processed?

Either way, there is a serious bug.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Listing character frequencies in a Unicode text file.

Post by DataMystic Support »

Thanks David - will check into it.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Listing character frequencies in a Unicode text file.

Post by DataMystic Support »

This also affects the Remove Duplicate Lines filter - Ignore Case option.

It will be fixed in the 8.6.2 release.
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Listing character frequencies in a Unicode text file.

Post by DataMystic Support »

This also affects the Remove Duplicate Lines filter - Ignore Case option.

It will be fixed in the 8.6.2 release.

Note - the letter shown in the output file depends on what is found first in the input file.
Post Reply