emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names


From: Eli Zaretskii
Subject: Re: Multibyte and unibyte file names
Date: Sat, 26 Jan 2013 13:27:12 +0200

> From: "Stephen J. Turnbull" <address@hidden>
> Cc: Stefan Monnier <address@hidden>,
>     address@hidden,
>     address@hidden,
>     address@hidden
> Date: Sat, 26 Jan 2013 12:04:50 +0900
> 
> "Unibyte" as implemented in Emacs is a premature optimization, and a
> disaster in search of places to happen.  Remove it, and you'll never
> notice it's gone.  The consequence of that removal would be to fix
> this problem, permanently.

I don't think you are entirely correct.  We still need to send encoded
(unibyte) strings to the outside world.  IOW, file names are not the
only user of unibyte strings.

> As Stefan says, there would remain a more general problem that -- with
> the exception of Windows Unicode APIs -- that there is no absolutely
> reliable way of determining the user's intended encoding.

That's a non-issue: we treat unibyte file names as encoded in
file-name-coding-system.  Nothing else is supported, or needed.

> However, the only important cases where this interferes with usual
> filename parsing needs are Shift JIS and Big 5 on Windows, where you
> *do* have that absolutely reliable alternative.

Again, detecting the encoding is a non-issue.  When I see an encoded
file name, I always _know_ how it was encoded, and I can decode it by
using DECODE_FILE.

> The right thing to do in some sense is to have an "external file name
> type" which stores both the Emacs string name and (if the name was
> received as bytes from outside) a representation of those bytes.
> Rather than change the Lisp_String structure, I would recommend
> putting a property (`text-as-received', `externally-coded-text', or
> whatever) on the string.  The content of the property would be the
> filename decoded as 'binary (or perhaps using Emacs's
> undecodable-bytes representation).
> 
> Although Emacs doesn't seem to have string properties (ie, on the
> object), one can put a text property on the string (or use an overlay,
> which might work for the degenerate case of a 0-length string).  This
> would allow callers (and sufficiently Type A users) to retry decoding
> with a different encoding.
> 
> Of course this requires rather smart callers if they slice-n-dice the
> file name.

Exactly.  Moreover, what you suggest is a large project that won't
happen without a motivated individual.  Given the overall "cannot
happen on POSIX, so it's SEP" reaction I got to this thread, what do
you think are the chances of such a project to materialize any time
soon?

And that is even before we start to talk about the details of your
proposal and consider its downsides (what to do when
file-name-coding-system is changed, too many overlays adversely impact
performance, ...).



reply via email to

[Prev in Thread] Current Thread [Next in Thread]