[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fix u32toutf8 so it encodes values > 0xFFFF correctly.

From: Eric Blake
Subject: Re: Fix u32toutf8 so it encodes values > 0xFFFF correctly.
Date: Wed, 22 Feb 2012 20:54:00 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120209 Thunderbird/10.0.1

On 02/22/2012 07:43 PM, John Kearney wrote:
> ^ caviot you can represent the full 0x10ffff in UTF-16, you just need 2
> UTF-16 characters. check out the latest version of unicode.c for an
> example how.

Yes, and Cygwin actually does this.

A strict reading of POSIX states that wchar_t must be wide enough for
all supported characters, technically limiting things to just the basic
plane if you have 16-bit wchar_t and a POSIX-compliant app.  But cygwin
has exploited a loophole in the POSIX wording - POSIX does not require
that all bit patterns are valid characters.  So the actual Cygwin
implementation is that on paper, rather than representing all 65536
patterns as valid characters, the values used in surrogate halves
(0xd800 to 0xdfff) are listed as non-characters (so the use of them
triggers undefined behavior per POSIX), but actually using them treats
them as surrogate pairs (leading to the full Unicode character set, but
reintroducing the headaches that multibyte characters had with 'char',
but now with wchar_t, where you are back to dealing with variable-sized
character elements).

Furthermore, the mess of 16-bit vs. 32-bit wchar_t is one of the reasons
why C11 has introduced two new character types, 16-bit and 32-bit
characters, designed to fully map to the full Unicode set, regardless of
what size wchar_t is.  It will be interesting to see how the next
version of POSIX takes the additions of C11 and retrofits the other
wide-character functions in POSIX but not C99 to handle the new
character types.

Eric Blake   address@hidden    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]