[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Multibyte and unibyte file names
From: |
Eli Zaretskii |
Subject: |
Re: Multibyte and unibyte file names |
Date: |
Wed, 23 Jan 2013 21:04:46 +0200 |
> Date: Wed, 23 Jan 2013 10:08:25 -0800
> From: Paul Eggert <address@hidden>
> CC: address@hidden, Kazuhiro Ito <address@hidden>,
> Michael Albinus <address@hidden>
>
> On 01/23/13 09:45, Eli Zaretskii wrote:
>
> > if (srclen > 1
> > && IS_DIRECTORY_SEP (dst[srclen - 1]))
> > {
> > dst[srclen - 1] = 0;
> > srclen--;
> > }
> >
> > If dst[] is an encoded string that uses a multibyte encoding, it is
> > wrong to look at just the last byte of the string, because it could be
> > a trailing byte of some multibyte sequence, right?
>
> If memory serves, the answer to that question is different for
> GNU / POSIX / etc (GNUish) systems than for MS-Windows systems.
> On GNUish systems, the kernel doesn't know about encodings,
> so the above code is correct for the file system even if
> it produces a byte string that is not properly encoded for
> the file name coding system.
I understand that, but what it means is that encoding a file name,
then removing its last "slash" as above, then decoding it again will
yield a wrong or even an invalid string, right? IOW, Emacs will still
have a bug, even though from the OS point of view that slash would
have been regarded as a directory separator.
> On MS-Windows systems, as I understand it, the operating system is
> cognizant of which file name encoding you're using, so the above is
> indeed an error.
The OS uses UTF-16 for file names, but APIs Emacs uses accept
single-byte or DBCS encoded file names, which are converted to UTF-16
internally, before handing them to the filesystem layer. It is this
conversion that must support the original encoding, or else the UTF-16
result will be incorrect, or in extreme cases the API itself will fail
and reject the file name.
> In practice nobody in the GNUish world uses encodings that
> are unsafe for '/', so to some extent this is just a theoretical
> issue in the GNUish world -- it just doesn't come up.
Yes, that part is quite clear. Likewise, since UTF-8 is almost always
the file-name encoding, bugs whereby un-encoded file names are passed
to system APIs can easily go unnoticed.