Improve the Sort filter

Get help with installation and running here.

Moderators: DataMystic Support, Moderators, DataMystic Support, Moderators, DataMystic Support, Moderators

Post Reply
dfhtextpipe
Posts: 987
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Improve the Sort filter

Post by dfhtextpipe »

The Sort filter is described as:
The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· Numeric sort
· Sort by length of line
It doesn't support sorting of UTF-8 text.

On the other hand, I regularly use the Count Duplicate Lines filter, and find that it handles UTF-8 text quite happily, and that the output is Sorted.

So why not improve the Sort filter using the code underlying the Count Duplicate Lines filter?

Please!

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Improve the Sort filter

Post by DataMystic Support »

Hi David,

That is really strange, because they both use the same underlying list to do comparisons.

Do you have a set of test files and filters that you could share with me?

Thanks,

Simon
dfhtextpipe
Posts: 987
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Improve the Sort filter

Post by dfhtextpipe »

I'll get back to you with some test files.

Remind me in a week if I forget, please.

David
David
User avatar
DataMystic Support
Site Admin
Posts: 2227
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Improve the Sort filter

Post by DataMystic Support »

Will do!
dfhtextpipe
Posts: 987
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Improve the Sort filter

Post by dfhtextpipe »

The Sort filter (ANSI case senstive selected) and the Count Duplicate Lines filter give identical results
(once the count column has been removed)

However, neither filter is good at sorting Unicode text files.

Worse than that, the Count Duplicate Lines filter Help page doesn't inform users of its sort limitations.
At least the Sort filter shows what are the nine available options in the drop down selector.

What's really needed (IMHO) is a Sort filter that provides the following further options:
  • UCA = Unicode collation algorithm
    CLDR = Common Locale Data Repository
    EOR = European Ordering Rules
In addition, it would be very useful to provide custom sort method for some scripts, such as Unicode Hebrew with accents and points.
For another slant on this in particular, see https://github.com/ninjaaron/ivsort.py

Unicode text sorts should be applicable for both UTF-8 and UTF-16LE input data.
i.e. One shouldn't have to convert UTF-8 to UTF-16 before doin the sort.

Best regards,

David
David
Post Reply