bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42986: sort: possible bug when sorting special characters


From: Eric Blake
Subject: bug#42986: sort: possible bug when sorting special characters
Date: Sat, 22 Aug 2020 10:51:23 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0

tag 42986 notabug
thanks

On 8/22/20 6:46 AM, Wolter H. V. wrote:
The following commands:

     echo 'Pará,9\nParacito,0' | sort --field-separator=, -k1

Use of echo with \ is non-portable, more portable is to use printf.


and

     echo 'Pará,Z\nParacito,A' | sort --field-separator=, -k1

Using -k1 (rather than -k1,1) says to use the entire remainder of the line in the sort field comparison. Furthermore, sorting is locale dependent, and some locales treat punctuation as insignificant in the collation process. You can see this yourself by using the --debug option:

$ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,9
______
______
Paracito,0
__________
__________

In the en_US.UTF-8 locale, commas and accents are ignored, and since you did not end the field at the first comma, you end up getting the same sort as 'Para9' vs. 'Parac', where 9 sorts before c.


$ printf 'Pará,9\nParacito,0\n' | sort --field-separator=, -k1,1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,9
____
______
Paracito,0
________
__________

In the same locale, but using a more limited field, you now have two prefixes 'Para' that compare identically, so the shorter string sorts first.

$ printf 'Pará,9\nParacito,0\n' | LC_ALL=C sort --field-separator=, -k1 --debug
sort: text ordering performed using simple byte comparison
Paracito,0
__________
__________
Pará,9
_______
_______

In the C locale, every byte sorts distinct, so accents become important, and 'a' sorts before 'á'.


give

     Pará,9
     Paracito,0

and

     Paracito,A
     Pará,Z

respectively.

$ printf 'Pará,Z\nParacito,A\n' | sort --field-separator=, -k1,1 --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
Pará,Z
____
______
Paracito,A
________
__________

Forcing the shorter sort field by using -k1,1 gets the results you seem to be looking for.



Sorting the string 'á\na' results in 'a\ná', so I would expect the commands 
above to put Paracito before Pará, but this is not the case for the first 
command. Why is that?

Rather, you were probably sorting in a locale where 'a' and 'á' collate identically, to the point where the tie was broken by a later point in the line.

At any rate, since sort is behaving as required by POSIX by honoring your locale, and the --debug option lets you see what is going on, I see nothing to fix, so I'm marking this as not a bug. However, feel free to respond with further followups.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org






reply via email to

[Prev in Thread] Current Thread [Next in Thread]