[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
From: |
Pádraig Brady |
Subject: |
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8 |
Date: |
Fri, 01 Jun 2012 22:09:51 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 |
On 06/01/2012 08:52 PM, Paolo Bonzini wrote:
> Here is a report from a GNU sed user.
>
> Paolo
>
>> SETUP:
>> $ sw_vers
>> ProductName: Mac OS X
>> ProductVersion: 10.7.4
>> BuildVersion: 11E53
>>
>> $ ~/gnu/bin/sed --version
>> GNU sed version 4.2.1
>>
>> PROBLEM: With UTF-8 input, but LANG and LC_ALL set to C, sed regular
>> expressions break on multibyte sequences. For example (constructed
>> from part of a git command):
>>
>> $ echo "Rémi Leblond" | LANG=C LC_ALL=C ~/gnu/bin/sed -ne
>> 's/.*/GIT_AUTHOR_NAME='\''&'\''/p'
>>
>> EXPECTED: GIT_AUTHOR_NAME='Rémi Leblond'
>> ACTUAL: GIT_AUTHOR_NAME='R'émi Leblond
>>
>> DISCUSSION: The problem starts in sed/lib/localcharset.c,
>> locale_charset, line 334
>>
>> # if HAVE_LANGINFO_CODESET
>>
>> /* Most systems support nl_langinfo (CODESET) nowadays. */
>> codeset = nl_langinfo (CODESET);
>>
>> Since we set LC_ALL to C, we trigger this code in Libc:
>>
>> http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c,
>> line 54:
>>
>> case CODESET:
>> ret = "";
>> if ((s = querylocale(LC_CTYPE_MASK, loc)) != NULL) {
>> if ((cs = strchr(s, '.')) != NULL)
>> ret = cs + 1;
>> else if (strcmp(s, "C") == 0 ||
>> strcmp(s, "POSIX") == 0)
>> ret = "US-ASCII";
>> else if (strcmp(s, "UTF-8") == 0)
>> ret = "UTF-8";
>> }
>> break;
>>
>> As you can see, querylocale() will return "C", and
>> nl_langinfo(CODESET) will return "US-ASCII". The other thing to
>> realize is that on OS X MB_CUR_MAX is a macro for ___mb_cur_max(),
>> which returns 1 when LC_ALL is C.
>>
>> Back to sed/lib/localcharset.c, we end up at locale_charset(), line 483:
>>
>> /* Resolve alias. */
>> for (aliases = get_charset_aliases ();
>> *aliases != '\0';
>> aliases += strlen (aliases) + 1, aliases += strlen (aliases) + 1)
>> if (strcmp (codeset, aliases) == 0
>> || (aliases[0] == '*' && aliases[1] == '\0'))
>> {
>> codeset = aliases + strlen (aliases) + 1;
>> break;
>> }
>>
>> This tries to alias our charset, "US-ASCII", to something sed
>> understands. get_charset_aliases() is at line 112 in the same file. On
>> OS X 10.7, DARWIN7 is defined (always for OS X 10.3 or newer), so we
>> end up at line 223:
>>
>> /* To avoid the trouble of installing a file that is shared by many
>> GNU packages -- many packaging systems have problems with this --,
>> simply inline the aliases here. */
>> cp = "ISO8859-1" "\0" "ISO-8859-1" "\0"
>> "ISO8859-2" "\0" "ISO-8859-2" "\0"
>> "ISO8859-4" "\0" "ISO-8859-4" "\0"
>> "ISO8859-5" "\0" "ISO-8859-5" "\0"
>> "ISO8859-7" "\0" "ISO-8859-7" "\0"
>> "ISO8859-9" "\0" "ISO-8859-9" "\0"
>> "ISO8859-13" "\0" "ISO-8859-13" "\0"
>> "ISO8859-15" "\0" "ISO-8859-15" "\0"
>> "KOI8-R" "\0" "KOI8-R" "\0"
>> "KOI8-U" "\0" "KOI8-U" "\0"
>> "CP866" "\0" "CP866" "\0"
>> "CP949" "\0" "CP949" "\0"
>> "CP1131" "\0" "CP1131" "\0"
>> "CP1251" "\0" "CP1251" "\0"
>> "eucCN" "\0" "GB2312" "\0"
>> "GB2312" "\0" "GB2312" "\0"
>> "eucJP" "\0" "EUC-JP" "\0"
>> "eucKR" "\0" "EUC-KR" "\0"
>> "Big5" "\0" "BIG5" "\0"
>> "Big5HKSCS" "\0" "BIG5-HKSCS" "\0"
>> "GBK" "\0" "GBK" "\0"
>> "GB18030" "\0" "GB18030" "\0"
>> "SJIS" "\0" "SHIFT_JIS" "\0"
>> "ARMSCII-8" "\0" "ARMSCII-8" "\0"
>> "PT154" "\0" "PT154" "\0"
>> /*"ISCII-DEV" "\0" "?" "\0"*/
>> "*" "\0" "UTF-8" "\0";
>>
>> And here is the root problem. This table does not have an entry for
>> US-ASCII. So it catches the default entry, "*", which maps everything
>> to "UTF-8", and that's what get_charset_aliases() returns, and what
>> locale_charset(), which then sets a UTF-8 flag in sed that gets used
>> by many parts.
>>
>> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
>> and various parts of sed interpret "Rémi Leblond" as an invalid
>> character sequence for a UTF-8 character set. This is why /.*/ in the
>> regular expression only matches the "R" before bailing on the "é".
>>
>> POSIX says that the "C" locale should treat text data is binary input,
>> but in this situation sed is trying to treat it as a multibyte
>> encoding.
>>
>> FIX: the DARWIN7 table in get_charset_aliases() should not contain a
>> default that maps everything not defined to "UTF-8". Or at the very
>> least, it should include an entry for "US-ASCII" that maps to "ASCII",
>> as a charset.aliases file might.
So this is the third time this change has been proposed:
If you following the previous one:
http://lists.gnu.org/archive/html/bug-gnulib/2012-03/threads.html#00104
it will refer to Bruno's argument for not changing this:
http://lists.gnu.org/archive/html/bug-gnulib/2012-01/msg00342.html
It's very unfortunate that US-ASCII doesn't reflect reality on Mac OS X.
I don't have such a system to test this myself unfortunately.
cheers,
Pádraig.
- Fwd: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/01
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8,
Pádraig Brady <=
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Bruno Haible, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07