Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

Strange sorting order - why?

If none of the specific sub-forums seem right for your thread, ask here.
Post Reply
Message
Author
PCXT
Posts: 6
Joined: 2017-02-10 16:45

Strange sorting order - why?

#1 Post by PCXT »

Hello
I'm looking for explanation of the behavior of sorting in Debian and probably many other GNU tools too (unfortunately I don't have access to other Unix systems with pl_PL.UTF-8 locale). Let's say we have the file tosort.txt containing:

Code: Select all

atx001b.jpg
atx001l.jpg
atx001k.jpg
atx001j.jpg
atx001m.jpg
atx001h.jpg
atx001.jpg
atx001i.jpg
atx001чернее.jpg
atx001z.jpg
Now:

Code: Select all

user@m4800:/tmp$ echo $LANG
pl_PL.UTF-8

user@m4800:/tmp$ sort tosort.txt
atx001b.jpg
atx001h.jpg
atx001i.jpg
atx001j.jpg
atx001.jpg
atx001k.jpg
atx001l.jpg
atx001m.jpg
atx001z.jpg
atx001чернее.jpg

user@m4800:/tmp$ LANG=C sort tosort.txt
atx001.jpg
atx001b.jpg
atx001h.jpg
atx001i.jpg
atx001j.jpg
atx001k.jpg
atx001l.jpg
atx001m.jpg
atx001z.jpg
atx001чернее.jpg

user@m4800:/tmp$ LANG=en_US.UTF-8 sort tosort.txt
atx001.jpg
atx001b.jpg
atx001h.jpg
atx001i.jpg
atx001j.jpg
atx001k.jpg
atx001l.jpg
atx001m.jpg
atx001z.jpg
atx001чернее.jpg

user@m4800:/tmp$
What caused the atx001.jpg to pop specifically between j and k in pl_PL.UTF-8 locale?
Is this a bug or a specified behavior?

P.S. This is not sort-specific, it "leaks" to many other programs, it can be seen e.g. in GTK file dialogs, especially the hints or some file managers, so it may not be a sort's thing but some library function's characteristic.

User avatar
sunrat
Administrator
Administrator
Posts: 6470
Joined: 2006-08-29 09:12
Location: Melbourne, Australia
Has thanked: 117 times
Been thanked: 473 times

Re: Strange sorting order - why?

#2 Post by sunrat »

I can't explain but am intrigued so checked. The only way I got sensible sort was with LANG=C, even LANG=en_US.UTF-8 gave the strange result. I'm in Au so checked that too.

Code: Select all

$ LANG=en_US.UTF-8 sort tosort.txt
atx001b.jpg
atx001h.jpg
atx001i.jpg
atx001j.jpg
atx001.jpg
atx001k.jpg
atx001l.jpg
atx001m.jpg
atx001z.jpg
atx001чернее.jpg

$ LANG=en_AU.UTF-8 sort tosort.txt
atx001b.jpg
atx001h.jpg
atx001i.jpg
atx001j.jpg
atx001.jpg
atx001k.jpg
atx001l.jpg
atx001m.jpg
atx001z.jpg
atx001чернее.jpg

$ LANG=C sort tosort.txt
atx001.jpg
atx001b.jpg
atx001h.jpg
atx001i.jpg
atx001j.jpg
atx001k.jpg
atx001l.jpg
atx001m.jpg
atx001z.jpg
atx001чернее.jpg
“ computer users can be divided into 2 categories:
Those who have lost data
...and those who have not lost data YET ”
Remember to BACKUP!

trinidad
Posts: 297
Joined: 2016-08-04 14:58
Been thanked: 15 times

Re: Strange sorting order - why?

#3 Post by trinidad »

Curious behaviour. It may be that after the double j i/e j.jpg the single .jpg is located as if beginning a new sequence, or j.j would be the last value in a sequence beginning with a.j and thus start over with .j when continued files i/e k.j are located. I've sometimes found that sorted recursive image display with a framebuffer device is only made reliable by putting a timestamp directly in the file name.

TC
You can't believe your eyes if your imagination is out of focus.

p.H
Global Moderator
Global Moderator
Posts: 3049
Joined: 2017-09-17 07:12
Has thanked: 5 times
Been thanked: 132 times

Re: Strange sorting order - why?

#4 Post by p.H »

trinidad wrote:It may be that after the double j i/e j.jpg the single .jpg is located as if beginning a new sequence, or j.j would be the last value in a sequence beginning with a.j and thus start over with .j when continued files i/e k.j are located.
That sounds overly complicated but gave me an idea : maybe the period "." is just ignored ?

PCXT
Posts: 6
Joined: 2017-02-10 16:45

Re: Strange sorting order - why?

#5 Post by PCXT »

p.H wrote:
trinidad wrote:It may be that after the double j i/e j.jpg the single .jpg is located as if beginning a new sequence, or j.j would be the last value in a sequence beginning with a.j and thus start over with .j when continued files i/e k.j are located.
That sounds overly complicated but gave me an idea : maybe the period "." is just ignored ?
Thank You, I checked it:

Code: Select all

user@m4800:/tmp$ cat tosort.txt
atx001b.bat
atx001l.bat
atx001k.bat
atx001j.bat
atx001m.bat
atx001h.bat
atx001.bat
atx001i.bat
atx001чернее.bat
atx001z.bat
atx001a.bat
atx001c.bat

user@m4800:/tmp$ sort tosort.txt
atx001a.bat
atx001.bat
atx001b.bat
atx001c.bat
atx001h.bat
atx001i.bat
atx001j.bat
atx001k.bat
atx001l.bat
atx001m.bat
atx001z.bat
atx001чернее.bat
it looks like yes, it has something to do with the dot. Now it put the non-suffixed item before "b", while in previous example it put after "j".

EDIT: Yes, yes, after trying "bpg" I see, it looks like it ignores the dot entirely.

User avatar
sunrat
Administrator
Administrator
Posts: 6470
Joined: 2006-08-29 09:12
Location: Melbourne, Australia
Has thanked: 117 times
Been thanked: 473 times

Re: Strange sorting order - why?

#6 Post by sunrat »

Curious behaviour, almost certainly not the desired one for most users. Now why does it sort correctly when LANG=C is set?
“ computer users can be divided into 2 categories:
Those who have lost data
...and those who have not lost data YET ”
Remember to BACKUP!

User avatar
Head_on_a_Stick
Posts: 14114
Joined: 2014-06-01 17:46
Location: London, England
Has thanked: 81 times
Been thanked: 133 times

Re: Strange sorting order - why?

#7 Post by Head_on_a_Stick »

deadbang

User avatar
sunrat
Administrator
Administrator
Posts: 6470
Joined: 2006-08-29 09:12
Location: Melbourne, Australia
Has thanked: 117 times
Been thanked: 473 times

Re: Strange sorting order - why?

#8 Post by sunrat »

Head_on_a_Stick wrote:http://unicode.org/reports/tr10/
Sorry just woke up. Too much reading. Which section is relevant?
“ computer users can be divided into 2 categories:
Those who have lost data
...and those who have not lost data YET ”
Remember to BACKUP!

User avatar
Head_on_a_Stick
Posts: 14114
Joined: 2014-06-01 17:46
Location: London, England
Has thanked: 81 times
Been thanked: 133 times

Re: Strange sorting order - why?

#9 Post by Head_on_a_Stick »

I have no idea, I'm just about to go to sleep :mrgreen:

But that's the documentation so...
deadbang

User avatar
sunrat
Administrator
Administrator
Posts: 6470
Joined: 2006-08-29 09:12
Location: Melbourne, Australia
Has thanked: 117 times
Been thanked: 473 times

Re: Strange sorting order - why?

#10 Post by sunrat »

Head_on_a_Stick wrote:I have no idea, I'm just about to go to sleep :mrgreen:

But that's the documentation so...
Now your dreams will be in Unicode. Sleep tight. :P
“ computer users can be divided into 2 categories:
Those who have lost data
...and those who have not lost data YET ”
Remember to BACKUP!

drasar
Posts: 12
Joined: 2020-01-11 19:50

Re: Strange sorting order - why?

#11 Post by drasar »

You can change the sorting behaviour by LC_COLLATE variable settings:
https://wiki.archlinux.org/index.php/lo ... _collation

User avatar
pylkko
Posts: 1802
Joined: 2014-11-06 19:02

Re: Strange sorting order - why?

#12 Post by pylkko »

The sort man page shows multiple flags that can be used to sort in many ways. For example, only using alphanumerical chars...
see:
https://www.gnu.org/software/coreutils/ ... invocation
Or you can use a sorting library in scripting language that will do it the same way no matter the platform, for example sort or natsort in python.

User avatar
sunrat
Administrator
Administrator
Posts: 6470
Joined: 2006-08-29 09:12
Location: Melbourne, Australia
Has thanked: 117 times
Been thanked: 473 times

Re: Strange sorting order - why?

#13 Post by sunrat »

Thanks pylkko. I looked at the man yesterday and tried a few options to no avail. Another check today with fresh brain and found that --version-sort works as one would expect:

Code: Select all

$ sort -V tosort.txt
atx001.jpg
atx001b.jpg
atx001h.jpg
atx001i.jpg
atx001j.jpg
atx001k.jpg
atx001l.jpg
atx001m.jpg
atx001z.jpg
atx001чернее.jpg
“ computer users can be divided into 2 categories:
Those who have lost data
...and those who have not lost data YET ”
Remember to BACKUP!

Post Reply