Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230
[WORKAROUND] The Ls Command's Weird Sort Order
[WORKAROUND] The Ls Command's Weird Sort Order
By default, the ls command lists files in "sorted" order -- but it's not the order you would expect from the strcmp() or strcasecmp() functions. For example: "Iberia" will sort ahead of "I'm sure" even though apostrophe (hex 27) is clearly less than b (hex 62).
I need to match up two lists of files, one generated by ls, the other generated by my own program. Clearly, they have to be in the same order. I could sort them both in normal alphabetical order, but the first is quite large and is not sort friendly (path names and file names are segregated).
I wrote a function called weirdcmp() that attempts to compare in the same way ls does: it squeezes out spaces, punctuation, and everything else other than letters, digits, and slashes, and then does a case-insensitive compare on what's left. But it's not perfect -- a test shows I'm still not duplicating what ls is doing.
My questions:
(1) Is there a formal definition of the sort order somewhere?
(2) What do I do if the compare strings are identical after squeezing out the "irrelevant" stuff? Is there a comparison of last resort I should be using?
(3) Some file names have German letters in them -- Ä, Ö, Ü, ä, ö, ü, or ß. What do I do about those? (Especially ß, which is the equivalent of ss or sz.)
I would appreciate any help.
Caitlin
I need to match up two lists of files, one generated by ls, the other generated by my own program. Clearly, they have to be in the same order. I could sort them both in normal alphabetical order, but the first is quite large and is not sort friendly (path names and file names are segregated).
I wrote a function called weirdcmp() that attempts to compare in the same way ls does: it squeezes out spaces, punctuation, and everything else other than letters, digits, and slashes, and then does a case-insensitive compare on what's left. But it's not perfect -- a test shows I'm still not duplicating what ls is doing.
My questions:
(1) Is there a formal definition of the sort order somewhere?
(2) What do I do if the compare strings are identical after squeezing out the "irrelevant" stuff? Is there a comparison of last resort I should be using?
(3) Some file names have German letters in them -- Ä, Ö, Ü, ä, ö, ü, or ß. What do I do about those? (Especially ß, which is the equivalent of ss or sz.)
I would appreciate any help.
Caitlin
Last edited by Caitlin on 2017-01-18 18:30, edited 1 time in total.
Re: The Ls Command's Weird Sort Order
The "locale" sets the sorting order, and I would guess you have a locale of UTF-8 or a related international value which tends to ignore punctuation but does get the accented characters next to the unaccented ones. Sounds like you may want a "C" or "POSIX" locale. Check out environment variables LC_COLLATE and LC_ALL or possibly command update-locale.
My .profile (used by bash and dash) sets LC_COLLATE=POSIX because I like old-school strict ASCII sorting ordering (uppercase before all lowercase, every character considered).
Demonstrate the difference to yourself with:
My .profile (used by bash and dash) sets LC_COLLATE=POSIX because I like old-school strict ASCII sorting ordering (uppercase before all lowercase, every character considered).
Demonstrate the difference to yourself with:
Code: Select all
LC_COLLATE=POSIX ls
Re: The Ls Command's Weird Sort Order
Try the Debian locale wiki and the Arch locale wiki for help with locale.
Re: The Ls Command's Weird Sort Order
I'm not trying to get ls to sort in strcmp() order. I'm trying to write a function to compare two strings in the order ls is currently using.cpoakes wrote:The "locale" sets the sorting order, and I would guess you have a locale of UTF-8 or a related international value which tends to ignore punctuation but does get the accented characters next to the unaccented ones. Sounds like you may want a "C" or "POSIX" locale. Check out environment variables LC_COLLATE and LC_ALL or possibly command update-locale.
My .profile (used by bash and dash) sets LC_COLLATE=POSIX because I like old-school strict ASCII sorting ordering (uppercase before all lowercase, every character considered).
Demonstrate the difference to yourself with:Code: Select all
LC_COLLATE=POSIX ls
Didn't help. It was all about paper sizes and which day of the week comes first, and nothing about ordering.cpoakes wrote:Try the Debian locale wiki and the Arch locale wiki for help with locale.
Caitlin
Re: The Ls Command's Weird Sort Order
Sorry about my misunderstanding. Try the man pages for strcoll(3) for a strcmp function using the current local sort order (LC_COLLATE).
Re: The Ls Command's Weird Sort Order
If found strcoll() and it seems to work. And after I just got done coding my own solution. Oh, well, back to the drawing board.
In theory, any order should do as long as it's consistent.
Thanks, cpoakes.
Caitlin
In theory, any order should do as long as it's consistent.
Thanks, cpoakes.
Caitlin
Re: The Ls Command's Weird Sort Order
I have two files that are (allegedly) in strcoll() order, but in trying to match them, I come up with a few errors. I'm using a standard record match control-break algorithm, so it should work. But I get a few errors.
After further investigation, I've found that ls does NOT output its listing in strcoll() order. Perhaps it's a combo like squeeze out all the spaces and some of the punctuation, then compare them with strcoll() (with case insensitivity thrown in for good measure).
So I think I'm going to have to go back to a variation of my old solution. Take the output of ls, and if any two adjacent records are out of sequence (according to my weirdcmp() function), switch them. This won't help if THREE adjacent records are mixed up, but that doesn't seem to be a problem.
Back to the drawing board again.
Caitlin
After further investigation, I've found that ls does NOT output its listing in strcoll() order. Perhaps it's a combo like squeeze out all the spaces and some of the punctuation, then compare them with strcoll() (with case insensitivity thrown in for good measure).
So I think I'm going to have to go back to a variation of my old solution. Take the output of ls, and if any two adjacent records are out of sequence (according to my weirdcmp() function), switch them. This won't help if THREE adjacent records are mixed up, but that doesn't seem to be a problem.
Back to the drawing board again.
Caitlin
Re: The Ls Command's Weird Sort Order
After some time fooling around with this, I've been unable to duplicate the order ls is using.
The collating sequence is only part of the story. There also seems to be some mucking around with parsing, deleting whitespace, and other **** which is not part of the locale specification. So I've resorted to C (strcmp) order.
This almost works well, except for the fact that I really need case insensitivity. So I've taken certain file names and put them in a table. I then compare every entry in the table with every other entry, and discover which files are only changed in case. A brute force solution, but it was the best I could come up with.
I consider what I did to be a workaround.
Caitlin
The collating sequence is only part of the story. There also seems to be some mucking around with parsing, deleting whitespace, and other **** which is not part of the locale specification. So I've resorted to C (strcmp) order.
This almost works well, except for the fact that I really need case insensitivity. So I've taken certain file names and put them in a table. I then compare every entry in the table with every other entry, and discover which files are only changed in case. A brute force solution, but it was the best I could come up with.
I consider what I did to be a workaround.
Caitlin
Re: [WORKAROUND] The Ls Command's Weird Sort Order
If I remember correctly, a C program uses the C locale by default.
If you want the strcoll function to assume the same locale as the rest of your system, I believe you will have to add a call to the setlocale function somewhere near the start of your program, i.e., something like:
That should, in my opinion, set all locale categories for the program equal to those for your system, including the string comparison and sort order.
If you want the strcoll function to assume the same locale as the rest of your system, I believe you will have to add a call to the setlocale function somewhere near the start of your program, i.e., something like:
Code: Select all
setlocale(LC_ALL, "");