bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?


From: Eli Zaretskii
Subject: Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?
Date: Fri, 28 Jun 2013 16:18:45 +0300

> Date: Fri, 28 Jun 2013 15:05:29 +0200
> From: Paolo Bonzini <bonzini@gnu.org>
> CC: jsmeix@suse.de, chet.ramey@case.edu, arnold@skeeve.com, 
>  bug-bash@gnu.org, bug-sed@gnu.org
> 
> Il 28/06/2013 14:49, Eli Zaretskii ha scritto:
> > > > When being consistent means being buggy, I don't want the consistency.
> > > > I want the bug solved in all the programs I use, but if it takes time
> > > > to do that, I will be glad in the meantime to use some programs that
> > > > don't have that bug, i.e. are "inconsistent".
> > > 
> > > I will be less glad to move a regex or piece of code from one to
> > > another, and find inconsistency.
> > 
> > You should report a bug in that case.
> 
> In the case of sed, I'll gladly to direct the reporter to the "Non-bugs"
> section of the manual.  Which also explains why you should anyway use
> LC_ALL=C:

I meant the inconsistency.

> `[a-z]' is case insensitive
> `s/.*//' does not clear pattern space
> 
>   You are encountering problems with locales.  POSIX mandates that `[a-z]'
>   uses the current locale's collation order -- in C parlance, that means
>   strcoll(3) instead of strcmp(3).  Some locales have a case insensitive
>   strcoll, others don't.
> 
>   Another problem is that [a-z] tries to use collation symbols.  This
>   only happens if you are on the GNU system, using GNU libc's regular
>   expression matcher instead of compiling the one supplied with GNU sed.
>   In a Danish locale, for example, the regular expression `^[a-z]$'
>   matches the string `aa', because `aa' is a single collating symbol that
>   comes after `a' and before `b'; `ll' behaves similarly in Spanish
>   locales, or `ij' in Dutch locales.
> 
>   Another common localization-related problem happens if your input stream
>   includes invalid multibyte sequences.  POSIX mandates that such
>   sequences are _not_ matched by `.', so that `s/.*//' will not clear
>   pattern space as you would expect.  In fact, there is no way to clear
>   sed's buffers in the middle of the script in most multibyte locales
>   (including UTF-8 locales).  For this reason, GNU sed provides a `z'
>   command (for `zap') as an extension.
> 
>   However, to work around both of these problems, which may cause bugs
>   in shell scripts, you can set the LC_ALL environment variable to `C',
>   or set the locale on a more fine-grained basis with the other LC_*
>   environment variables.

Thanks for the lecture, I already know all that.

GNU projects used to treat Posix standards in a less dogmatic way at
some point, reserving strict Posix compliance to special options.  If
that changed lately, I'm sorry.  I think our tools need to make sense
first and only after that be Posix-compliant, not the other way
around.  The sheer amount of energy this discussion takes, let alone
the number of "bug reports" we all need to reply to, is already
evidence that the current behavior is simply wrong.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]