Count Duplicate Lines filter

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Count Duplicate Lines filter

Post by dfhtextpipe »

This new topic is copied from one of my comments in the thread headed: Please add Unicode support to the Text to Word List filter.
It's reposted here to focus attention on the Count Duplicate Lines filter.


My existing Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (aka ANSI).

There's no clue that characters U+00A0 to U+00FF are unsupported by the Count Duplicate Lines filter,
which follows the Text to Word List subfilter in my two stage filter.

The contrast with the Sort filter is brought to your attention:
Sort Type

The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
...
So before extending the Text to Word List filter to cope with Unicode in general,
please could you first extend the Count Duplicate Lines filter to support ANSI.

Meanwhile, I'll tweak my two stage filter to investigate further.

David
David
dfhtextpipe
Posts: 986
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Count Duplicate Lines filter

Post by dfhtextpipe »

The help for the Count Duplicate Lines filter includes,
The file need NOT be sorted prior to this filter.
Yet it's clear that the filter does actually sort the results.
The sort seems to be
ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed.
Here's an example of the first few output lines from my KJV NT Word List.

Code: Select all

001911	a
000004	Aaron
000001	Aaron's
000001	Abaddon
000004	abased
000001	abasing
000003	Abba
000004	Abel
000001	Abhor
000001	abhorrest
000003	Abia
000001	Abiathar
000032	abide
000020	abideth
000004	abiding
000001	Abilene
000003	ability
000002	Abiud
000061	able
000001	aboard
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Count Duplicate Lines filter

Post by DataMystic Support »

Just found this - the ANSI change was made some time ago.
Post Reply