Listing character frequencies in a Unicode text file.

dfhtextpipe · Post by **dfhtextpipe** » Sun Jul 11, 2010 2:50 am

I'd like to have a TextPipe Standard filter that will analyse a Unicode text file, and list the number of occurrences of each code-point found, preferably in sorted order of codes.

Post by **DataMystic Support** » Mon Jul 12, 2010 10:51 am

TextPipe Standard now can make use of the Scripting filter - so you could design a script to do this.

There is no inbuilt filter to do this.

dfhtextpipe · Post by **dfhtextpipe** » Tue Jul 13, 2010 5:48 am

I have accomplished (more or less) what I was after, yet without resorting to a script filter.
I thought it would be instructive to share it.
The breakthrough is the fact that special filter "Count duplicate lines" does not require the input data to be sorted.

Code: Select all

Comment...
|  Special filter to count the characters that occur in Bible VPL verse text
|
|--Comment...
|  |  For verse text from VPL files only
|  |
|  +--Remove fields:Space-delimited field 1 .. field 2
|        Delimiter Type: 4
|        Custom delimiter: 
|        [ ] Has Header
|      
|--Comment...
|  |  Join all lines to a single line
|  |
|  |--Remove blanks from End of Line
|  |   
|  +--Replace [\r\n] with [ ]
|        [ ] Match case
|        [ ] Whole words only
|        [ ] Case sensitive replace
|        [ ] Prompt on replace
|        [ ] Skip prompt if identical
|        [ ] First only
|        [ ] Extract matches
|      
|--Comment...
|  |  Add new line after every character
|  |
|  |--Perl pattern [(.)] with [$1\r\n]
|  |     [ ] Match case
|  |     [ ] Whole words only
|  |     [ ] Case sensitive replace
|  |     [ ] Prompt on replace
|  |     [ ] Skip prompt if identical
|  |     [ ] First only
|  |     [ ] Extract matches
|  |     Maximum text buffer size 4096
|  |     [ ] Maximum match (greedy)
|  |     [ ] Allow comments
|  |     [ ] '.' matches newline
|  |     [X] UTF-8 Support
|  |   
|  |--Remove blanks from Start of Line
|  |   
|  +--Remove blank lines
|      
|--Comment...
|  |  Count duplicate lines - input need not be sorted
|  |
|  +--Count duplicate lines
|        [ ] Ignore case
|        Start column 1
|        Length 5
|        [X] Include One
|        format: %1:s\t%0:d
|      
+--Comment...
      Output is unsorted

NB. The TextPipe sort filter is for ANSI only, whereas all the files I wish to analyse are UTF-8.

dfhtextpipe · Post by **dfhtextpipe** » Tue Jul 13, 2010 5:54 am

Though the UTF-8 output file is unsorted, it can easily be sorted using Notepad++ TextFX Tools.

Sorting lines of Unicode text is a huge topic in its own right, one which is too involved to take up further here.

dfhtextpipe · Post by **dfhtextpipe** » Tue Jul 13, 2010 6:04 am

Such a filter can be very useful for analyzing large text files, such as those encountered in Bible translations.

For example, it can help find any unexpected characters, such as those for a different language.

Or to check whether there are unmatched punctuation pairs, e.g. For braces, brackets, parentheses and quotation marks.

My filter assumes that the input text is already in Verse Per Line format, with the verse reference at the start of each line of text.

Gen 1:1 In the beginning, God created ...

Post by **DataMystic Support** » Tue Jul 13, 2010 7:14 am

Nice work David!

Could you possible zip and upload this filter for others?

dfhtextpipe · Post by **dfhtextpipe** » Wed Jul 14, 2010 6:04 am

I suppose I could, yet my filter (as it stands) still points to a particular input file that I was working on.
For the time being, I don't wish to mess with that.

What's wrong with other reader's just pasting the code from the box above?

Post by **DataMystic Support** » Wed Jul 14, 2010 9:03 am

Thanks David,

Readers can't just paste your text above - that is the human readable form.

In order to easily re-create your filter, either the .fll file, or the File Menu\Export in VBScript or JScript form is needed.

An input file is not necessary.

dfhtextpipe · Post by **dfhtextpipe** » Thu Jul 15, 2010 2:26 am

Here goes.

I've removed the input file specifier, and amplified the comments for the first subfilter.
This section can be disabled (ticked) to obtain a more general application.

I usually structure and comment my subfilters, to make them easier to maintain and understand.

This simple filter is made freely available for anyone to use.

btw. I have assumed that there is no requirement to count spaces!

David

Post by **DataMystic Support** » Thu Jul 15, 2010 9:19 am

Thank you David!

dfhtextpipe · Post by **dfhtextpipe** » Fri Jul 16, 2010 12:47 am

Simon,

I think that there is a bug in the Special filter called Count duplicate lines.

I have "Ignore case" unticked in the form, but it is not counting uppercase and lowercase letters correctly.

After stripping the reference columns from an original VPL file, I have used Notepad++ to count occurrences manually letter by letter.

The TextPipe count results do not match the Notepad++ results.

Example:
Using Notepad++:

A occurs 5995 times
a occurs 335106 times

Using TextPipe:

A not listed
a occurs 341101 times

Now clearly 5995 + 335106 = 341101, so this implies that it is ignoring case, even though I told it not to do so.

A few uppercase letters did have a tally using TextPipe, where the uppercase instance occurred before the lowercase.

Using Notepad++:

D occurs 7088 times
d occurs 155666 times

Using TextPipe:

D occurs 162754 times
d not listed

The help for this filter states,

If ignore case is checked, lines do not need to be cased identically to be considered duplicates. Two identical lines, one in upper case, and one in lower case, would be considered duplicates and counted together by this filter. If ignore case is unchecked, the lines must be absolutely identical to be considered duplicates. The case checking routines are ANSI aware, so their behaviour may change depending on your locale.

NB. I am using English (United Kingdom) in Windows Regional Settings.

David

dfhtextpipe · Post by **dfhtextpipe** » Fri Jul 16, 2010 4:35 pm

Simon,

I also tried ticking the box marked "Ignore case" to compare with the it unticked results.
The two output files were identical!

Perhaps the programmer has 'remmed out' this part of the code (to test something else) and forgot to enable again it afterward?
Or maybe there is an 'off by one' error, such that lines as short as a single character are incorrectly processed?

Either way, there is a serious bug.

David

Post by **DataMystic Support** » Mon Jul 19, 2010 1:09 pm

Thanks David - will check into it.

Post by **DataMystic Support** » Tue Jul 20, 2010 11:16 pm

This also affects the Remove Duplicate Lines filter - Ignore Case option.

It will be fixed in the 8.6.2 release.

Post by **DataMystic Support** » Tue Jul 20, 2010 11:17 pm

This also affects the Remove Duplicate Lines filter - Ignore Case option.

It will be fixed in the 8.6.2 release.

Note - the letter shown in the output file depends on what is found first in the input file.

DataMystic

Listing character frequencies in a Unicode text file.

Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.

Re: Listing character frequencies in a Unicode text file.