Page 1 of 2

Listing character frequencies in a Unicode text file.

Posted: Sun Jul 11, 2010 2:50 am
by dfhtextpipe
I'd like to have a TextPipe Standard filter that will analyse a Unicode text file, and list the number of occurrences of each code-point found, preferably in sorted order of codes.

Re: Listing character frequencies in a Unicode text file.

Posted: Mon Jul 12, 2010 10:51 am
by DataMystic Support
TextPipe Standard now can make use of the Scripting filter - so you could design a script to do this.

There is no inbuilt filter to do this.

Re: Listing character frequencies in a Unicode text file.

Posted: Tue Jul 13, 2010 5:48 am
by dfhtextpipe
I have accomplished (more or less) what I was after, yet without resorting to a script filter.
I thought it would be instructive to share it.
The breakthrough is the fact that special filter "Count duplicate lines" does not require the input data to be sorted.

Code: Select all

Comment...
|  Special filter to count the characters that occur in Bible VPL verse text
|
|--Comment...
|  |  For verse text from VPL files only
|  |
|  +--Remove fields:Space-delimited field 1 .. field 2
|        Delimiter Type: 4
|        Custom delimiter: 
|        [ ] Has Header
|      
|--Comment...
|  |  Join all lines to a single line
|  |
|  |--Remove blanks from End of Line
|  |   
|  +--Replace [\r\n] with [ ]
|        [ ] Match case
|        [ ] Whole words only
|        [ ] Case sensitive replace
|        [ ] Prompt on replace
|        [ ] Skip prompt if identical
|        [ ] First only
|        [ ] Extract matches
|      
|--Comment...
|  |  Add new line after every character
|  |
|  |--Perl pattern [(.)] with [$1\r\n]
|  |     [ ] Match case
|  |     [ ] Whole words only
|  |     [ ] Case sensitive replace
|  |     [ ] Prompt on replace
|  |     [ ] Skip prompt if identical
|  |     [ ] First only
|  |     [ ] Extract matches
|  |     Maximum text buffer size 4096
|  |     [ ] Maximum match (greedy)
|  |     [ ] Allow comments
|  |     [ ] '.' matches newline
|  |     [X] UTF-8 Support
|  |   
|  |--Remove blanks from Start of Line
|  |   
|  +--Remove blank lines
|      
|--Comment...
|  |  Count duplicate lines - input need not be sorted
|  |
|  +--Count duplicate lines
|        [ ] Ignore case
|        Start column 1
|        Length 5
|        [X] Include One
|        format: %1:s\t%0:d
|      
+--Comment...
      Output is unsorted
    
NB. The TextPipe sort filter is for ANSI only, whereas all the files I wish to analyse are UTF-8.

Re: Listing character frequencies in a Unicode text file.

Posted: Tue Jul 13, 2010 5:54 am
by dfhtextpipe
Though the UTF-8 output file is unsorted, it can easily be sorted using Notepad++ TextFX Tools.

Sorting lines of Unicode text is a huge topic in its own right, one which is too involved to take up further here.

Re: Listing character frequencies in a Unicode text file.

Posted: Tue Jul 13, 2010 6:04 am
by dfhtextpipe
Such a filter can be very useful for analyzing large text files, such as those encountered in Bible translations.

For example, it can help find any unexpected characters, such as those for a different language.

Or to check whether there are unmatched punctuation pairs, e.g. For braces, brackets, parentheses and quotation marks.

My filter assumes that the input text is already in Verse Per Line format, with the verse reference at the start of each line of text.

Gen 1:1 In the beginning, God created ...

Re: Listing character frequencies in a Unicode text file.

Posted: Tue Jul 13, 2010 7:14 am
by DataMystic Support
Nice work David!

Could you possible zip and upload this filter for others?

Re: Listing character frequencies in a Unicode text file.

Posted: Wed Jul 14, 2010 6:04 am
by dfhtextpipe
I suppose I could, yet my filter (as it stands) still points to a particular input file that I was working on.
For the time being, I don't wish to mess with that.

What's wrong with other reader's just pasting the code from the box above?

Re: Listing character frequencies in a Unicode text file.

Posted: Wed Jul 14, 2010 9:03 am
by DataMystic Support
Thanks David,

Readers can't just paste your text above - that is the human readable form.

In order to easily re-create your filter, either the .fll file, or the File Menu\Export in VBScript or JScript form is needed.

An input file is not necessary.

Re: Listing character frequencies in a Unicode text file.

Posted: Thu Jul 15, 2010 2:26 am
by dfhtextpipe
Here goes.

I've removed the input file specifier, and amplified the comments for the first subfilter.
This section can be disabled (ticked) to obtain a more general application.

I usually structure and comment my subfilters, to make them easier to maintain and understand.

This simple filter is made freely available for anyone to use.

btw. I have assumed that there is no requirement to count spaces!

David

Re: Listing character frequencies in a Unicode text file.

Posted: Thu Jul 15, 2010 9:19 am
by DataMystic Support
Thank you David!

Re: Listing character frequencies in a Unicode text file.

Posted: Fri Jul 16, 2010 12:47 am
by dfhtextpipe
Simon,

I think that there is a bug in the Special filter called Count duplicate lines.

I have "Ignore case" unticked in the form, but it is not counting uppercase and lowercase letters correctly.

After stripping the reference columns from an original VPL file, I have used Notepad++ to count occurrences manually letter by letter.

The TextPipe count results do not match the Notepad++ results.

Example:
Using Notepad++:
  • A occurs 5995 times
    a occurs 335106 times
Using TextPipe:
  • A not listed
    a occurs 341101 times
Now clearly 5995 + 335106 = 341101, so this implies that it is ignoring case, even though I told it not to do so.

A few uppercase letters did have a tally using TextPipe, where the uppercase instance occurred before the lowercase.

Using Notepad++:
  • D occurs 7088 times
    d occurs 155666 times
Using TextPipe:
  • D occurs 162754 times
    d not listed
The help for this filter states,
If ignore case is checked, lines do not need to be cased identically to be considered duplicates. Two identical lines, one in upper case, and one in lower case, would be considered duplicates and counted together by this filter. If ignore case is unchecked, the lines must be absolutely identical to be considered duplicates. The case checking routines are ANSI aware, so their behaviour may change depending on your locale.
NB. I am using English (United Kingdom) in Windows Regional Settings.

David

Re: Listing character frequencies in a Unicode text file.

Posted: Fri Jul 16, 2010 4:35 pm
by dfhtextpipe
Simon,

I also tried ticking the box marked "Ignore case" to compare with the it unticked results.
The two output files were identical!

Perhaps the programmer has 'remmed out' this part of the code (to test something else) and forgot to enable again it afterward?
Or maybe there is an 'off by one' error, such that lines as short as a single character are incorrectly processed?

Either way, there is a serious bug.

David

Re: Listing character frequencies in a Unicode text file.

Posted: Mon Jul 19, 2010 1:09 pm
by DataMystic Support
Thanks David - will check into it.

Re: Listing character frequencies in a Unicode text file.

Posted: Tue Jul 20, 2010 11:16 pm
by DataMystic Support
This also affects the Remove Duplicate Lines filter - Ignore Case option.

It will be fixed in the 8.6.2 release.

Re: Listing character frequencies in a Unicode text file.

Posted: Tue Jul 20, 2010 11:17 pm
by DataMystic Support
This also affects the Remove Duplicate Lines filter - Ignore Case option.

It will be fixed in the 8.6.2 release.

Note - the letter shown in the output file depends on what is found first in the input file.