bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: BUG? RFE? printf lacking unicode support in multiple areas


From: Eric Blake
Subject: Re: BUG? RFE? printf lacking unicode support in multiple areas
Date: Fri, 20 May 2011 14:49:22 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10

On 05/20/2011 02:30 PM, Linda Walsh wrote:
> i.e. it's showing me a 16-bit value: 0x203c, which I thought would be the
> wide-char value for the double-exclamation.  Going from the wchar
> definition
> on NT, it is a 16-bit value.  Perhaps it is different under POSIX? but
> 0x203c taken as 32 bits with 2 high bytes of zeros would seem to specify
> the same codepoint for the Dbl-EXcl.

POSIX allows wchar_t to be either 2-byte or 4-byte, although only a
4-byte wchar_t can properly represent all of Unicode (with 2-byte
wchar_t as on windows or Cygwin, you are inherently restricted from
using any Unicode character larger than 0xffff if you want to maintain
POSIX compliance).

> 
>> Since there is no way to produce a word containing a NUL character it is
>> impossible to support %lc in any useful way.
> ----
>     That's annoying.   How can one print out unicode characters
> that are supposed to be 1 char long?

I think you are misunderstanding the difference between wide characters
(exactly one wchar_t per character) and multi-byte characters (1 or more
char [byte] per character).

Unicode can be represented in two different ways.  One way is with wide
characters (every character represents exactly one Unicode codepoint,
and code points < 0x100 have embedded NUL bytes if you view the memory
containing those wchar_t as an array of bytes).  The other way is with
multi-byte encodings, such as UTF-8 (every character occupies a variable
number of bytes, and the only character that can contain an embedded NUL
byte is the NUL character at codepoint 0).

Bash _only_ uses multi-byte characters for input and output.  %lc only
uses wchar_t.  Since wchar_t output is not useful for a shell that does
not do input in wchar_t, that explains why bash printf need not support
%lc.  POSIX doesn't require it, at any rate, but it also doesn't forbid
it as an extension.

> This isn't just a bash problem given how well most of the unix "character"
> utils work with unicode -- that's something that really needs to be solved
> if those character utils are going to continue to be _as useful_ in the
> future.
> Sure they will have their current functionality which is of use in many
> ways, but
> for anyone not processing ASCII text it becomes a problem, but this
> isn't really
> a bash is.

Most utilities that work with Unicode work with UTF-8 (that is, with
multi-byte-characters using variable number of bytes), and NOT with wide
characters (that is, with all characters occupying a fixed width).  But
you can switch between encodings using the iconv(1) utility, so it
shouldn't really be a problem in practice in converting from one
encoding type to another.

>     That said, it was my impression that a wchar was 16-bits (at least it
> is on MS.  Is it different under POSIX?

POSIX allows 16-bit wchar_t, but if you have a 16-bit wchar_t, you
cannot support all of Unicode.

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]