[WORKAROUND] The Ls Command's Weird Sort Order

Need help with C, C++, perl, python, etc?

[WORKAROUND] The Ls Command's Weird Sort Order

Postby Caitlin » 2016-11-20 06:00

By default, the ls command lists files in "sorted" order -- but it's not the order you would expect from the strcmp() or strcasecmp() functions. For example: "Iberia" will sort ahead of "I'm sure" even though apostrophe (hex 27) is clearly less than b (hex 62).

I need to match up two lists of files, one generated by ls, the other generated by my own program. Clearly, they have to be in the same order. I could sort them both in normal alphabetical order, but the first is quite large and is not sort friendly (path names and file names are segregated).

I wrote a function called weirdcmp() that attempts to compare in the same way ls does: it squeezes out spaces, punctuation, and everything else other than letters, digits, and slashes, and then does a case-insensitive compare on what's left. But it's not perfect -- a test shows I'm still not duplicating what ls is doing.

My questions:

(1) Is there a formal definition of the sort order somewhere?

(2) What do I do if the compare strings are identical after squeezing out the "irrelevant" stuff? Is there a comparison of last resort I should be using?

(3) Some file names have German letters in them -- Ä, Ö, Ü, ä, ö, ü, or ß. What do I do about those? (Especially ß, which is the equivalent of ss or sz.)

I would appreciate any help.

Caitlin
Last edited by Caitlin on 2017-01-18 18:30, edited 1 time in total.
Caitlin
 
Posts: 236
Joined: 2012-05-24 07:32

Re: The Ls Command's Weird Sort Order

Postby cpoakes » 2016-11-20 09:00

The "locale" sets the sorting order, and I would guess you have a locale of UTF-8 or a related international value which tends to ignore punctuation but does get the accented characters next to the unaccented ones. Sounds like you may want a "C" or "POSIX" locale. Check out environment variables LC_COLLATE and LC_ALL or possibly command update-locale.

My .profile (used by bash and dash) sets LC_COLLATE=POSIX because I like old-school strict ASCII sorting ordering (uppercase before all lowercase, every character considered).

Demonstrate the difference to yourself with:
Code: Select all
LC_COLLATE=POSIX ls
User avatar
cpoakes
 
Posts: 95
Joined: 2015-03-29 04:54

Re: The Ls Command's Weird Sort Order

Postby Caitlin » 2016-11-20 18:53

Is there a way to display which locale I have? When I installed Jessie, I just took the USA defaults.

How can I tell what the sort order for my locale is?

Caitlin
Caitlin
 
Posts: 236
Joined: 2012-05-24 07:32

Re: The Ls Command's Weird Sort Order

Postby cpoakes » 2016-11-21 05:47

Try the Debian locale wiki and the Arch locale wiki for help with locale.
User avatar
cpoakes
 
Posts: 95
Joined: 2015-03-29 04:54

Re: The Ls Command's Weird Sort Order

Postby Caitlin » 2016-11-21 11:33

cpoakes wrote:The "locale" sets the sorting order, and I would guess you have a locale of UTF-8 or a related international value which tends to ignore punctuation but does get the accented characters next to the unaccented ones. Sounds like you may want a "C" or "POSIX" locale. Check out environment variables LC_COLLATE and LC_ALL or possibly command update-locale.

My .profile (used by bash and dash) sets LC_COLLATE=POSIX because I like old-school strict ASCII sorting ordering (uppercase before all lowercase, every character considered).

Demonstrate the difference to yourself with:
Code: Select all
LC_COLLATE=POSIX ls

I'm not trying to get ls to sort in strcmp() order. I'm trying to write a function to compare two strings in the order ls is currently using.

cpoakes wrote:Try the Debian locale wiki and the Arch locale wiki for help with locale.

Didn't help. It was all about paper sizes and which day of the week comes first, and nothing about ordering.

Caitlin
Caitlin
 
Posts: 236
Joined: 2012-05-24 07:32

Re: The Ls Command's Weird Sort Order

Postby cpoakes » 2016-11-21 13:47

Sorry about my misunderstanding. Try the man pages for strcoll(3) for a strcmp function using the current local sort order (LC_COLLATE).
User avatar
cpoakes
 
Posts: 95
Joined: 2015-03-29 04:54

Re: The Ls Command's Weird Sort Order

Postby Caitlin » 2016-11-22 14:59

If found strcoll() and it seems to work. And after I just got done coding my own solution. Oh, well, back to the drawing board.

In theory, any order should do as long as it's consistent.

Thanks, cpoakes.

Caitlin
Caitlin
 
Posts: 236
Joined: 2012-05-24 07:32

Re: The Ls Command's Weird Sort Order

Postby Caitlin » 2016-11-24 14:17

I have two files that are (allegedly) in strcoll() order, but in trying to match them, I come up with a few errors. I'm using a standard record match control-break algorithm, so it should work. But I get a few errors.

After further investigation, I've found that ls does NOT output its listing in strcoll() order. Perhaps it's a combo like squeeze out all the spaces and some of the punctuation, then compare them with strcoll() (with case insensitivity thrown in for good measure).

So I think I'm going to have to go back to a variation of my old solution. Take the output of ls, and if any two adjacent records are out of sequence (according to my weirdcmp() function), switch them. This won't help if THREE adjacent records are mixed up, but that doesn't seem to be a problem.

Back to the drawing board again.

Caitlin
Caitlin
 
Posts: 236
Joined: 2012-05-24 07:32

Re: The Ls Command's Weird Sort Order

Postby Caitlin » 2017-01-18 18:30

After some time fooling around with this, I've been unable to duplicate the order ls is using.

The collating sequence is only part of the story. There also seems to be some mucking around with parsing, deleting whitespace, and other **** which is not part of the locale specification. So I've resorted to C (strcmp) order.

This almost works well, except for the fact that I really need case insensitivity. So I've taken certain file names and put them in a table. I then compare every entry in the table with every other entry, and discover which files are only changed in case. A brute force solution, but it was the best I could come up with.

I consider what I did to be a workaround.

Caitlin
Caitlin
 
Posts: 236
Joined: 2012-05-24 07:32

Re: [WORKAROUND] The Ls Command's Weird Sort Order

Postby luvr » 2017-01-21 18:45

If I remember correctly, a C program uses the C locale by default.

If you want the strcoll function to assume the same locale as the rest of your system, I believe you will have to add a call to the setlocale function somewhere near the start of your program, i.e., something like:
Code: Select all
setlocale(LC_ALL, "");

That should, in my opinion, set all locale categories for the program equal to those for your system, including the string comparison and sort order.
luvr
 
Posts: 79
Joined: 2016-07-21 19:39
Location: Boom - The Home Town of Tomorrowland, Belgium


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest

fashionable