Re: Multibyte and unibyte file names

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names

From:	Eli Zaretskii
Subject:	Re: Multibyte and unibyte file names
Date:	Wed, 23 Jan 2013 21:04:46 +0200

> Date: Wed, 23 Jan 2013 10:08:25 -0800
> From: Paul Eggert <address@hidden>
> CC: address@hidden, Kazuhiro Ito <address@hidden>, 
>  Michael Albinus <address@hidden>
> 
> On 01/23/13 09:45, Eli Zaretskii wrote:
> 
> >   if (srclen > 1
> >       && IS_DIRECTORY_SEP (dst[srclen - 1]))
> >     {
> >       dst[srclen - 1] = 0;
> >       srclen--;
> >     }
> > 
> > If dst[] is an encoded string that uses a multibyte encoding, it is
> > wrong to look at just the last byte of the string, because it could be
> > a trailing byte of some multibyte sequence, right?
> 
> If memory serves, the answer to that question is different for
> GNU / POSIX / etc (GNUish) systems than for MS-Windows systems.
> On GNUish systems, the kernel doesn't know about encodings,
> so the above code is correct for the file system even if
> it produces a byte string that is not properly encoded for
> the file name coding system.

I understand that, but what it means is that encoding a file name,
then removing its last "slash" as above, then decoding it again will
yield a wrong or even an invalid string, right?  IOW, Emacs will still
have a bug, even though from the OS point of view that slash would
have been regarded as a directory separator.

> On MS-Windows systems, as I understand it, the operating system is
> cognizant of which file name encoding you're using, so the above is
> indeed an error.

The OS uses UTF-16 for file names, but APIs Emacs uses accept
single-byte or DBCS encoded file names, which are converted to UTF-16
internally, before handing them to the filesystem layer.  It is this
conversion that must support the original encoding, or else the UTF-16
result will be incorrect, or in extreme cases the API itself will fail
and reject the file name.

> In practice nobody in the GNUish world uses encodings that
> are unsafe for '/', so to some extent this is just a theoretical
> issue in the GNUish world -- it just doesn't come up.

Yes, that part is quite clear.  Likewise, since UTF-8 is almost always
the file-name encoding, bugs whereby un-encoded file names are passed
to system APIs can easily go unnoticed.

[Prev in Thread]

Current Thread

[Next in Thread]

Multibyte and unibyte file names, Eli Zaretskii, 2013/01/23
- Re: Multibyte and unibyte file names, Paul Eggert, 2013/01/23
  - Re: Multibyte and unibyte file names, Eli Zaretskii <=
    - Re: Multibyte and unibyte file names, Paul Eggert, 2013/01/23
- Re: Multibyte and unibyte file names, Michael Albinus, 2013/01/23
  - Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/23
    - Re: Multibyte and unibyte file names, Michael Albinus, 2013/01/23
    - Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/24
- Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/23
  - Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/24
    - Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/24
    - Re: Multibyte and unibyte file names, Eli Zaretskii, 2013/01/24
    - Re: Multibyte and unibyte file names, Stefan Monnier, 2013/01/24

Prev by Date: Re: Multibyte and unibyte file names
Next by Date: Re: Multibyte and unibyte file names
Previous by thread: Re: Multibyte and unibyte file names
Next by thread: Re: Multibyte and unibyte file names
Index(es):
- Date
- Thread