bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

documentation bug re character range expressions


From: Marcel (Felix) Giannelia
Subject: documentation bug re character range expressions
Date: Thu, 02 Jun 2011 18:12:23 -0700
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100330 Shredder/3.0.4

Hello,

I realize the issue of character range expressions not working as expected (because of locale settings) has been done to death, but I thought I should point this out.

The bash man page says:

"A pair of characters separated by a hyphen denotes a range expression; any character that ***sorts between those two characters,*** inclusive, using the current locale's collating sequence and character set, is matched." (emphasis mine)

That is incorrect because, for instance, an uppercase 'C' sorts between lowercase 'a' and lowercase 'c' (sometimes), as in this example (locale is en_GB.UTF-8):

$ touch aa B cd C
$ ls -1
aa
B
C
cd

However, bash's behaviour does not reflect what the man page says. Observe:

$ touch aa B cd C
$ ls -1 [a-c]*
aa
B
cd

Now, I'm firmly of the opinion that character range expressions paying any attention at all to the locale collation settings, in any shape or form, is completely broken behaviour. I really wish that [a-c] meant [abc] and not [aAbBc].

But, it looks as if that's not going to change, so it is my hope that the documentation will at least be updated to reflect what really happens.

Previous posters who've complained about this character range issue have been directed to some comments made by Ulrich Drepper (who, I understand, is a maintainer of some underlying code that bash uses in its evaluation of range expressions?). Those comments include this:

"The strcoll result has nothing whatsoever to do with the range match. strcoll uses collation weights, ranges use collation sequence values, completely different concept."

I believe that same confusion is behind the problem in that paragraph from the man page and has led to the inappropriate use of the phrase "sorts between." The bit of man page text I quoted above should read:

"A pair of characters separated by a hyphen denotes a range expression; any character that ***occurs between those two characters in collation sequence value,*** inclusive, using the current locale's collating sequence and character set, is matched."

I believe it would also be helpful for the documentation to then go on to say something like this:

"This means that character ranges are neither case-sensitive nor case-insensitive in most locales. For instance (in the en_ locales), the range [a-c] is equivalent to [aAbBc] (note the absence of uppercase 'C'!). Thus, sub-ranges of the character class [[:alpha:]] must be used with great care, and probably should not be used at all, in locales other than C. It is not possible, for example, to specify a range of greater than one or fewer than 26 lowercase letters in the en_US.UTF-8 locale. If you desire to match [abcdefghij] in this locale, you must not use a range, but specify all of those characters explicitly, or use LC_COLLATE from the C locale."

In closing, it is my fervent hope that the insanity of that last paragraph will be recognized (when is [a-c] being equivalent to [aAbBc] ever useful?!), and that this will eventually lead to character ranges becoming useful again regardless of the current locale.

But in the mean time, I would settle for a documentation change, and will continue to "export LC_COLLATE=C"! :)

~Felix.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]