Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

[WORKAROUND] The Ls Command's Weird Sort Order

Programming languages, Coding, Executables, Package Creation, and Scripting.
Post Reply
Message
Author
Caitlin
Posts: 329
Joined: 2012-05-24 07:32
Has thanked: 3 times
Been thanked: 2 times

[WORKAROUND] The Ls Command's Weird Sort Order

#1 Post by Caitlin »

By default, the ls command lists files in "sorted" order -- but it's not the order you would expect from the strcmp() or strcasecmp() functions. For example: "Iberia" will sort ahead of "I'm sure" even though apostrophe (hex 27) is clearly less than b (hex 62).

I need to match up two lists of files, one generated by ls, the other generated by my own program. Clearly, they have to be in the same order. I could sort them both in normal alphabetical order, but the first is quite large and is not sort friendly (path names and file names are segregated).

I wrote a function called weirdcmp() that attempts to compare in the same way ls does: it squeezes out spaces, punctuation, and everything else other than letters, digits, and slashes, and then does a case-insensitive compare on what's left. But it's not perfect -- a test shows I'm still not duplicating what ls is doing.

My questions:

(1) Is there a formal definition of the sort order somewhere?

(2) What do I do if the compare strings are identical after squeezing out the "irrelevant" stuff? Is there a comparison of last resort I should be using?

(3) Some file names have German letters in them -- Ä, Ö, Ü, ä, ö, ü, or ß. What do I do about those? (Especially ß, which is the equivalent of ss or sz.)

I would appreciate any help.

Caitlin
Last edited by Caitlin on 2017-01-18 18:30, edited 1 time in total.

User avatar
cpoakes
Posts: 99
Joined: 2015-03-29 04:54

Re: The Ls Command's Weird Sort Order

#2 Post by cpoakes »

The "locale" sets the sorting order, and I would guess you have a locale of UTF-8 or a related international value which tends to ignore punctuation but does get the accented characters next to the unaccented ones. Sounds like you may want a "C" or "POSIX" locale. Check out environment variables LC_COLLATE and LC_ALL or possibly command update-locale.

My .profile (used by bash and dash) sets LC_COLLATE=POSIX because I like old-school strict ASCII sorting ordering (uppercase before all lowercase, every character considered).

Demonstrate the difference to yourself with:

Code: Select all

LC_COLLATE=POSIX ls

Caitlin
Posts: 329
Joined: 2012-05-24 07:32
Has thanked: 3 times
Been thanked: 2 times

Re: The Ls Command's Weird Sort Order

#3 Post by Caitlin »

Is there a way to display which locale I have? When I installed Jessie, I just took the USA defaults.

How can I tell what the sort order for my locale is?

Caitlin

User avatar
cpoakes
Posts: 99
Joined: 2015-03-29 04:54

Re: The Ls Command's Weird Sort Order

#4 Post by cpoakes »

Try the Debian locale wiki and the Arch locale wiki for help with locale.

Caitlin
Posts: 329
Joined: 2012-05-24 07:32
Has thanked: 3 times
Been thanked: 2 times

Re: The Ls Command's Weird Sort Order

#5 Post by Caitlin »

cpoakes wrote:The "locale" sets the sorting order, and I would guess you have a locale of UTF-8 or a related international value which tends to ignore punctuation but does get the accented characters next to the unaccented ones. Sounds like you may want a "C" or "POSIX" locale. Check out environment variables LC_COLLATE and LC_ALL or possibly command update-locale.

My .profile (used by bash and dash) sets LC_COLLATE=POSIX because I like old-school strict ASCII sorting ordering (uppercase before all lowercase, every character considered).

Demonstrate the difference to yourself with:

Code: Select all

LC_COLLATE=POSIX ls
I'm not trying to get ls to sort in strcmp() order. I'm trying to write a function to compare two strings in the order ls is currently using.
cpoakes wrote:Try the Debian locale wiki and the Arch locale wiki for help with locale.
Didn't help. It was all about paper sizes and which day of the week comes first, and nothing about ordering.

Caitlin

User avatar
cpoakes
Posts: 99
Joined: 2015-03-29 04:54

Re: The Ls Command's Weird Sort Order

#6 Post by cpoakes »

Sorry about my misunderstanding. Try the man pages for strcoll(3) for a strcmp function using the current local sort order (LC_COLLATE).

Caitlin
Posts: 329
Joined: 2012-05-24 07:32
Has thanked: 3 times
Been thanked: 2 times

Re: The Ls Command's Weird Sort Order

#7 Post by Caitlin »

If found strcoll() and it seems to work. And after I just got done coding my own solution. Oh, well, back to the drawing board.

In theory, any order should do as long as it's consistent.

Thanks, cpoakes.

Caitlin

Caitlin
Posts: 329
Joined: 2012-05-24 07:32
Has thanked: 3 times
Been thanked: 2 times

Re: The Ls Command's Weird Sort Order

#8 Post by Caitlin »

I have two files that are (allegedly) in strcoll() order, but in trying to match them, I come up with a few errors. I'm using a standard record match control-break algorithm, so it should work. But I get a few errors.

After further investigation, I've found that ls does NOT output its listing in strcoll() order. Perhaps it's a combo like squeeze out all the spaces and some of the punctuation, then compare them with strcoll() (with case insensitivity thrown in for good measure).

So I think I'm going to have to go back to a variation of my old solution. Take the output of ls, and if any two adjacent records are out of sequence (according to my weirdcmp() function), switch them. This won't help if THREE adjacent records are mixed up, but that doesn't seem to be a problem.

Back to the drawing board again.

Caitlin

Caitlin
Posts: 329
Joined: 2012-05-24 07:32
Has thanked: 3 times
Been thanked: 2 times

Re: The Ls Command's Weird Sort Order

#9 Post by Caitlin »

After some time fooling around with this, I've been unable to duplicate the order ls is using.

The collating sequence is only part of the story. There also seems to be some mucking around with parsing, deleting whitespace, and other **** which is not part of the locale specification. So I've resorted to C (strcmp) order.

This almost works well, except for the fact that I really need case insensitivity. So I've taken certain file names and put them in a table. I then compare every entry in the table with every other entry, and discover which files are only changed in case. A brute force solution, but it was the best I could come up with.

I consider what I did to be a workaround.

Caitlin

luvr
Posts: 85
Joined: 2016-07-21 19:39
Location: Boom - The Home Town of Tomorrowland, Belgium

Re: [WORKAROUND] The Ls Command's Weird Sort Order

#10 Post by luvr »

If I remember correctly, a C program uses the C locale by default.

If you want the strcoll function to assume the same locale as the rest of your system, I believe you will have to add a call to the setlocale function somewhere near the start of your program, i.e., something like:

Code: Select all

setlocale(LC_ALL, "");
That should, in my opinion, set all locale categories for the program equal to those for your system, including the string comparison and sort order.

Post Reply