bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

square bracket vs. curly brace character ranges


From: Felix
Subject: square bracket vs. curly brace character ranges
Date: Thu, 13 Sep 2012 22:49:44 -0700

I believe I've found an inconsistency in bash or its documentation.

I know the fact that things like [a-c] are highly locale-dependent in
bash (doesn't mean I have to like it, but there it is). Fine. I've
learned to live with it.

But the other day I was on a fresh install (hadn't set
LC_COLLATE=C yet, so I was in en_US.UTF-8), and this happened:

$ touch {a..c}
$ ls
a  b  c
$ touch {A..C}
$ ls
a  A  b  B  c  C
$ ls {a..c}
a  b  c
$ ls [a-c]
a  A  b  B  c

Curly brace range expressions behave differently from square-bracket
ranges. Is this intentional? This is under Arch Linux, bash version
"4.2.37(2)-release (i686-pc-linux-gnu)".

The man page seems to imply that the curly brace behaviour above is a
bug:

"When characters are supplied, the expression expands to each character
lexicographically between x and y, inclusive."

...although this documentation suffers from the same problem as the
passage about character class ranges, namely that it confuses
lexicographic sort order (character collation *weights*) with
character collation *sequence values* (they are not quite the same thing
-- if they were, 'c' and 'C' would *always always always* appear
together in a range expansion, because:
$ touch aa B cd C
$ ls -1
aa
B
C
cd
). The phrases "sorts between" and "lexicographically between" refer to
collation *weights*, but bash clearly uses sequence values.

It's a subtle distinction; I beat it to death in a thread
from 2011, subject "documentation bug re character range expressions",
but I don't think the documentation actually got changed.

It seems the thinking goes something like, "since no one is supposed to
use expressions like [a-c], we don't have precisely
document, care, or even *know* what it means" -- a shame, because with
LC_COLLATE=C set, [a-c] is actually quite useful, and in all other
locales it isn't useful at all (it would be slightly useful if it used
weights like the documentation says because then it would be like a
case-insensitive range, but with it using sequence values instead, it's
useless).

The sheer number of threads we've got complaining about
locale-dependent [a-c] suggests to me that the software should be
changed to just do what people expect, especially since nothing is
really lost by doing so.

Oh well. Dead horses and all that -- but can we at least make the dead
horses consistent? :)

~Felix.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]