[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnulib] Re: ISSLASH on Woe32
[bug-gnulib] Re: ISSLASH on Woe32
Wed, 27 Apr 2005 19:19:28 +0100
Mozilla Thunderbird 1.0.2 (Windows/20050317)
On 27/04/2005 15:56, Bruno Haible wrote:
Tor Lillqvist <address@hidden> brought this up:
The technique of searching for directory separators in strings through the
ISSLASH macro does, on Woe32, not support non-ASCII pathnames in most CJK
locale encodings. Why? ISSLASH looks for a _byte_ with value 0x5C. However,
0x5C -or- 0x3F on Woe32.
in these locale encodings
Japanese: CP932 SHIFT-JIS
Chinese: GBK GB18030 BIG5 BIG5-HKSCS CP950
the byte 0x5C occurs as second byte of some multibyte characters. If such a
character is used inside a directory name, code that uses ISSLASH does not
work correctly. All gnulib modules that use ISSLASH are affected.
Could this also be a problem on Unix systems using multibyte encoded
(UTF-8) filesystems, if not now then in the future? Maybe some (future)
Unix systems support multi-byte encoded filenames containing 0x3F in the
second+ byte of a multi-byte character.
What can we do?
1) On Woe32, use 'wchar_t*' instead of 'char*' to denote pathnames.
Use conditional macros like _TCHAR, _TEXT(), _tcslen() etc. that
allow to accomodate these platform differences without too much #ifs.
2) On Woe32, expect UTF-8 encoded 'char*' strings to denote pathnames.
3) Use mbtowc() to step through pathnames while looking for a backslash.
4) Document this as a limitation. The workaround for the user is to
switch to an UTF-8 locale.
It's probably best to choose one internal representation of pathnames
and stick to it, but any representation other than single 'char' is a
lot of work, as you say!
The drawbacks are:
1) Tons of code that deals with pathnames has to be changed to use
typedef'ed types. Also, support for WindowsME and older is dropped.
Wouldn't it be possible to link against unicows.dll to support
Win95/98/ME? Are there licensing problems with this?
2) Extra code must be added for every system call to convert pathname
arguments from UTF-8 to UTF-16, and pathname results from UTF-16
to UTF-8. Also, the user of the gnulib modules must be aware of the
semantic difference. Also, support for WindowsME and older is dropped.
It sounds like replacing the system calls with some wrapper functions
with lots of conditional code. Maybe the wrapper functions could avoid
converting to and from UTF-16 if they are running on WinME and earlier.
Strictly speaking, UTF-16 is a multi-16-bit-word encoding, but I don't
know what support Woe32 systems have for characters outside the Unicode
BMP (requiring more than one 16-bit word in their encoding).
I think (2) also implies (3). If you use UTF-8 internally, any parsing
of pathnames need to change, e.g. the IS_PATH_WITH_DIR macro in pathname.h.
3) Tons of code that deals with pathnames has to be changed to use
mbtowc(), _mbschr(), _mbsrchr() etc.
4) For users in CJK locales on Woe32, the contents of directories with
some non-ASCII pathnames is inaccessible to GNU tools.
Microsoft recomments approach 1. GNOME has chosen approach 2. I would
favour answer 4.
What do you think?
My gut instinct would be to use UTF-8 internally, but I'm not doing the
-=( Ian Abbott @ MEV Ltd. E-mail: <address@hidden> )=-
-=( Tel: +44 (0)161 477 1898 FAX: +44 (0)161 718 3587 )=-
- [bug-gnulib] ISSLASH on Woe32, Bruno Haible, 2005/04/27
- [bug-gnulib] Re: ISSLASH on Woe32,
Ian Abbott <=
- [bug-gnulib] Re: ISSLASH on Woe32, Bruno Haible, 2005/04/27
- Re: [bug-gnulib] Re: ISSLASH on Woe32, Paul Eggert, 2005/04/27
- Re: [bug-gnulib] Re: ISSLASH on Woe32, Stepan Kasal, 2005/04/28
- Re: [bug-gnulib] ISSLASH on Woe32, Bruno Haible, 2005/04/28
- [bug-gnulib] Re: ISSLASH on Woe32, Ian Abbott, 2005/04/28
- Re: [bug-gnulib] ISSLASH on Woe32, Paul Eggert, 2005/04/28
- Re: [bug-gnulib] ISSLASH on Woe32, Bruno Haible, 2005/04/29