[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
16-bit wchar_t on Windows and Cygwin
From: |
Bruno Haible |
Subject: |
16-bit wchar_t on Windows and Cygwin |
Date: |
Mon, 31 Jan 2011 03:04:42 +0100 |
User-agent: |
KMail/1.9.9 |
Hi,
It is known for a long time that on native Windows, the wchar_t[] encoding on
strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the same
for Cygwin >= 1.7. [2]
Other platforms have either a 32-bit wchar_t (such as glibc, Solaris, *BSD,
and many others), or have a 16-bit wchar_t that, in UTF-8 locales, uses the
UCS-2 encoding (namely AIX).[3]
What consequences does this have?
1) All code that uses the functions from <wctype.h> (wide character
classification and mapping) or wcwidth() malfunctions on strings that
contains Unicode characters outside the BMP, i.e. outside the range
U+0000..U+FFFF.
2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
but somewhat surprising way: wcrtomb() may return 0, that is, produce no
output bytes when it consumes a wchar_t.
On native Windows, I could not test it (I could not enable any UTF-8 or
GB18030 locale on Windows XP), but due to the behaviour of the functions
MultiByteToWideChar and WideCharToMultiByte [4] it looks like the
implementations of mbrtowc() and wcrtomb() will not be able to cope
with characters outside the BMP.
Examples of such code are:
- In gnulib, the files
file uses
exclude.c towlower
fnmatch.c towlower
mbchar.h isw*
mbmemcasecoll.c towlower
mbscasestr.c towlower
mbswidth.c iswcntrl, wcwidth
quotearg.c iswprint
regcomp.c towlower
regex_internal.h iswalnum, iswlower
regex_internal.c towupper
strftime.c towlower, towupper
strtol.c iswalpha, iswspace, towupper
- In coreutils, the program 'wc':
The correct behaviour is:
$ echo 'a b' | wc -w -m
2 4
Now with an U+2002 space:
$ printf 'a\xe2\x80\x82b\n' | wc -w -m
2 4
Now with a chinese character from the BMP:
$ printf 'a\xe3\x91\x96b\n' | wc -w -m
1 4
$ printf 'a \xe3\x91\x96 b\n' | wc -w -m
3 6
Now with a chinese character outside the BMP:
$ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
1 4
$ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
3 6
On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
$ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
1 5
$ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
2 7
So both the number of characters and the number of words are counted
wrong as soon as non-BMP characters occur.
What can we do about it?
Adding lots of conditional code to the above listed gnulib, coreutils, gettext
etc. source files? That would be and endless amount of work.
I'm more in favour of overriding wchar_t and all functions that depend on it -
like we did successfully for the socket functions.
In practice, this would mean that on Windows (both native Windows and
Cygwin >= 1.7) the use of a 'wchar_t' module will
- override wchar_t to be 32 bits, like in glibc,
- cause functions from mbrtowc() to wcwidth() to be overridden. Since the
corresponding system functions are unusable, the replacements will use the
modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).
It also means that we will have separate modules for 'iswalnum', ...,
'towupper',
which are currently all in the module 'wctype'.
How does that sound? Other thoughts?
Bruno
[1] http://msdn.microsoft.com/en-us/library/dd319072%28v=vs.85%29.aspx
[2] http://cygwin.com/ml/cygwin/2011-01/msg00410.html
[3] Found by running the attached program multibyte-utf16-unix.c
[4] See the attached program multibyte-utf16-win32.c
multibyte-utf16-unix.c
Description: Text Data
multibyte-utf16-win32.c
Description: Text Data
- 16-bit wchar_t on Windows and Cygwin,
Bruno Haible <=