Page 1 of 1

Improve the Sort filter

Posted: Sun Jan 10, 2016 2:53 am
by dfhtextpipe
The Sort filter is described as:
The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· Numeric sort
· Sort by length of line
It doesn't support sorting of UTF-8 text.

On the other hand, I regularly use the Count Duplicate Lines filter, and find that it handles UTF-8 text quite happily, and that the output is Sorted.

So why not improve the Sort filter using the code underlying the Count Duplicate Lines filter?

Please!

David

Re: Improve the Sort filter

Posted: Tue Jan 19, 2016 4:39 pm
by DataMystic Support
Hi David,

That is really strange, because they both use the same underlying list to do comparisons.

Do you have a set of test files and filters that you could share with me?

Thanks,

Simon

Re: Improve the Sort filter

Posted: Thu Jan 21, 2016 10:12 pm
by dfhtextpipe
I'll get back to you with some test files.

Remind me in a week if I forget, please.

David

Re: Improve the Sort filter

Posted: Thu Jan 21, 2016 11:44 pm
by DataMystic Support
Will do!

Re: Improve the Sort filter

Posted: Fri Jan 22, 2016 3:21 am
by dfhtextpipe
The Sort filter (ANSI case senstive selected) and the Count Duplicate Lines filter give identical results
(once the count column has been removed)

However, neither filter is good at sorting Unicode text files.

Worse than that, the Count Duplicate Lines filter Help page doesn't inform users of its sort limitations.
At least the Sort filter shows what are the nine available options in the drop down selector.

What's really needed (IMHO) is a Sort filter that provides the following further options:
  • UCA = Unicode collation algorithm
    CLDR = Common Locale Data Repository
    EOR = European Ordering Rules
In addition, it would be very useful to provide custom sort method for some scripts, such as Unicode Hebrew with accents and points.
For another slant on this in particular, see https://github.com/ninjaaron/ivsort.py

Unicode text sorts should be applicable for both UTF-8 and UTF-16LE input data.
i.e. One shouldn't have to convert UTF-8 to UTF-16 before doin the sort.

Best regards,

David