Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

From:	Bruno Haible
Subject:	Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Date:	Thu, 07 Jun 2012 14:07:20 +0200
User-agent:	KMail/4.7.4 (Linux/3.1.10-1.9-desktop; KDE/4.7.4; x86_64; ; )

Stephen Butler wrote:
> POSIX says that the "C" locale should treat text data is binary input,

Can you please point to where this is written?

IMO [1] describes the behaviour of the "C" locale only for characters
that belong to what we know as "US-ASCII" (i.e. bytes 0x00..0x7F).
As soon as you pass the string "Rémi" to a program running in the "C" locale,
you are speculating on implementation-dependent behaviour.

> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
> and various parts of sed interpret "Rémi Leblond" as an invalid
> character sequence for a UTF-8 character set.

Indeed, I can see how this inconsistency leads to bugs like the described
ones.

The fix could be to have two different locale_charset() functions,
one that returns "US-ASCII" and another one that returns "UTF-8".
The first one to be used when MB_CUR_MAX and mbrtowc() are used as
well, the second one to be used by gettext(). But the separation
line between the two cases is not yet clear to me. Any insights?

Bruno

[1] 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02

[Prev in Thread]

Current Thread

[Next in Thread]

Fwd: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/01
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/01
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Stephen J. Butler, 2012/06/01
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paul Eggert, 2012/06/02
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/02
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Bruno Haible <=
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/10
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/18
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/23
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paul Eggert, 2012/06/23

Prev by Date: Re: [PATCH] maint.mk: fix VPATH issues
Next by Date: apparent complexity of mkdir-p module (was: Re: dirchownmod and savewd modules uses unavailable functions under) Windows
Previous by thread: Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Next by thread: Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Index(es):
- Date
- Thread