Page 1 of 1

Count Duplicate Lines filter

Posted: Tue May 29, 2012 4:01 am
by dfhtextpipe
This new topic is copied from one of my comments in the thread headed: Please add Unicode support to the Text to Word List filter.
It's reposted here to focus attention on the Count Duplicate Lines filter.


My existing Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (aka ANSI).

There's no clue that characters U+00A0 to U+00FF are unsupported by the Count Duplicate Lines filter,
which follows the Text to Word List subfilter in my two stage filter.

The contrast with the Sort filter is brought to your attention:
Sort Type

The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
...
So before extending the Text to Word List filter to cope with Unicode in general,
please could you first extend the Count Duplicate Lines filter to support ANSI.

Meanwhile, I'll tweak my two stage filter to investigate further.

David

Re: Count Duplicate Lines filter

Posted: Fri Jun 01, 2012 12:41 am
by dfhtextpipe
The help for the Count Duplicate Lines filter includes,
The file need NOT be sorted prior to this filter.
Yet it's clear that the filter does actually sort the results.
The sort seems to be
ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed.
Here's an example of the first few output lines from my KJV NT Word List.

Code: Select all

001911	a
000004	Aaron
000001	Aaron's
000001	Abaddon
000004	abased
000001	abasing
000003	Abba
000004	Abel
000001	Abhor
000001	abhorrest
000003	Abia
000001	Abiathar
000032	abide
000020	abideth
000004	abiding
000001	Abilene
000003	ability
000002	Abiud
000061	able
000001	aboard

Re: Count Duplicate Lines filter

Posted: Wed Feb 24, 2016 8:58 am
by DataMystic Support
Just found this - the ANSI change was made some time ago.