bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?


From: Eric Blake
Subject: Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?
Date: Mon, 21 May 2012 14:39:03 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

On 05/21/2012 01:51 PM, Linda Walsh wrote:

> POSIX is not supposed to be prescriptive -- but **descriptive**...
> 
> I can't think of anywhere that a-z or A-Z would have included letters
> from the opposite case... so how did POSIX come to *prescribe* that this
> be the case... since I can't see that as being descriptive.

POSIX 1992 was the culprit that proscribed that [A-Z] must be in
collation order across all locales, but without giving good guidance on
how to write a collation sequence, and without defining a C function to
easily get at that collation ordering.  And remember, 20 years ago when
POSIX 1992 was written, there was very little implementation experience
with internationalization, compared to what has happened in the meantime
(that was back when Unicode was brand new, and most users still had
single-byte locales or used shift-lock encodings like Big5).  It is
possible to write a locale definition where [A-Z] gives only upper-case
letters while still providing case-insensitive sorting, but not all
locale writes know how to do this (even now in 2012, while most glibc
locales have been corrected in this manner, there still exist several
glibc locales that aren't written very well - the complication stems
from the fact that your locale file becomes exponentially harder to
write: instead of having a single upper and lower case rule, you have to
have one rule per letter, with rules intermixed in a different order).
As soon as people started obeying POSIX 1992 to the letter, and
realizing that range expressions had unusual semantics as a result of
the 1992 specification, POSIX 2001 quickly reverted things, but by then,
the cat was out of the bag.  POSIX 2001 had to continue to allow
existing implementations, by stating that range expressions in anything
but the C locale are explicitly undefined.

There is currently a movement under way to introduce 'Rational Range
Intepretation' (RRI), where [A-Z] means the 26 uppercase letters across
ALL locales, by omitting all accented letters and ignoring collation
ordering.  Since POSIX 2001 and later allow this behavior, it is gaining
traction - already, GNU sed, GNU grep, and GNU awk have had patches
applied or under consideration to introduce this consistent behavior.
Search those mailing list archives if you want more details.  Gnulib has
already had patches as part of this movement, and GNU coreutils and bash
should be picking up on these improvements in a future version; we also
hope to get glibc to agree to them.  In other words, we recognize that
this is an issue, and eventually, we _do_ want to reach the point where
all GNU tools use RRI, since POSIX 2001 already allows RRI as part of
its recognition that the decision made in POSIX 1992 causes pain when
coupled with poorly-written locale definitions.

For example, here is an RRI patch for gnulib:
https://lists.gnu.org/archive/html/bug-gnulib/2012-04/msg00185.html

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]