bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: documentation bug re character range expressions


From: Chet Ramey
Subject: Re: documentation bug re character range expressions
Date: Tue, 07 Jun 2011 16:45:58 -0400
User-agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10

On 6/2/11 9:12 PM, Marcel (Felix) Giannelia wrote:
> Hello,
> 
> I realize the issue of character range expressions not working as expected
> (because of locale settings) has been done to death, but I thought I should
> point this out.
> 
> The bash man page says:
> 
> "A pair of characters separated by a hyphen denotes a range expression; any
> character that ***sorts between those two characters,*** inclusive, using
> the current locale's collating sequence and character set, is matched."
> (emphasis mine)
> 
> That is incorrect because, for instance, an uppercase 'C' sorts between
> lowercase 'a' and lowercase 'c' (sometimes), as in this example (locale is
> en_GB.UTF-8):

I'm not going to add much to this discussion except to note that I believe
`sorts' is correct.  Consider the following script:

unset LANG LC_ALL LC_COLLATE

export LC_COLLATE=de_DE.UTF-8
printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
echo

export LC_COLLATE=en_GB.UTF-8
printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
echo

export LC_COLLATE=C
printf "%s\n" {A..Z} {a..z} | sort | tr $'\n' ' '
echo

It uses the system `sort' to decide how things sort according to the
locale.  When I run it on a random Linux system, RHEL5 in this case,
I get

a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S
t T u U v V w W x X y Y z Z
a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S
t T u U v V w W x X y Y z Z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l
m n o p q r s t u v w x y z

That sure looks like `C' doesn't sort between `a' and `c' in de_DE.UTF-8
and en_GB.UTF-8.

> I believe it would also be helpful for the documentation to then go on to
> say something like this:
> 
> "This means that character ranges are neither case-sensitive nor
> case-insensitive in most locales. For instance (in the en_ locales), the
> range [a-c] is equivalent to [aAbBc] (note the absence of uppercase 'C'!).
> Thus, sub-ranges of the character class [[:alpha:]] must be used with great
> care, and probably should not be used at all, in locales other than C. It
> is not possible, for example, to specify a range of greater than one or
> fewer than 26 lowercase letters in the en_US.UTF-8 locale. If you desire to
> match [abcdefghij] in this locale, you must not use a range, but specify
> all of those characters explicitly, or use LC_COLLATE from the C locale."

You might like the text in item 13 of the COMPAT file included in the bash
distribution.  It doesn't take quite so cautionary a tone, but the basic
information is there.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    chet@case.edu    http://cnswww.cns.cwru.edu/~chet/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]