[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?
From: |
Paolo Bonzini |
Subject: |
Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z? |
Date: |
Fri, 28 Jun 2013 15:05:29 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130514 Thunderbird/17.0.6 |
Il 28/06/2013 14:49, Eli Zaretskii ha scritto:
> > > When being consistent means being buggy, I don't want the consistency.
> > > I want the bug solved in all the programs I use, but if it takes time
> > > to do that, I will be glad in the meantime to use some programs that
> > > don't have that bug, i.e. are "inconsistent".
> >
> > I will be less glad to move a regex or piece of code from one to
> > another, and find inconsistency.
>
> You should report a bug in that case.
In the case of sed, I'll gladly to direct the reporter to the "Non-bugs"
section of the manual. Which also explains why you should anyway use
LC_ALL=C:
`[a-z]' is case insensitive
`s/.*//' does not clear pattern space
You are encountering problems with locales. POSIX mandates that `[a-z]'
uses the current locale's collation order -- in C parlance, that means
strcoll(3) instead of strcmp(3). Some locales have a case insensitive
strcoll, others don't.
Another problem is that [a-z] tries to use collation symbols. This
only happens if you are on the GNU system, using GNU libc's regular
expression matcher instead of compiling the one supplied with GNU sed.
In a Danish locale, for example, the regular expression `^[a-z]$'
matches the string `aa', because `aa' is a single collating symbol that
comes after `a' and before `b'; `ll' behaves similarly in Spanish
locales, or `ij' in Dutch locales.
Another common localization-related problem happens if your input stream
includes invalid multibyte sequences. POSIX mandates that such
sequences are _not_ matched by `.', so that `s/.*//' will not clear
pattern space as you would expect. In fact, there is no way to clear
sed's buffers in the middle of the script in most multibyte locales
(including UTF-8 locales). For this reason, GNU sed provides a `z'
command (for `zap') as an extension.
However, to work around both of these problems, which may cause bugs
in shell scripts, you can set the LC_ALL environment variable to `C',
or set the locale on a more fine-grained basis with the other LC_*
environment variables.
Paolo
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, (continued)
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Eli Zaretskii, 2013/06/27
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Chet Ramey, 2013/06/27
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Paolo Bonzini, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Johannes Meixner, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Eli Zaretskii, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Paolo Bonzini, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Eli Zaretskii, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?,
Paolo Bonzini <=
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Eli Zaretskii, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Paolo Bonzini, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Eli Zaretskii, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Eric Blake, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Paul Eggert, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Chet Ramey, 2013/06/27
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Paolo Bonzini, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Chet Ramey, 2013/06/28
- Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?, Chet Ramey, 2013/06/27
- Re: locale specific ordering in EN_US vs. characterset collation rules for UTF-8, Linda Walsh, 2013/06/28