[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fun with dired sorting

From: Andrew Innes
Subject: Re: fun with dired sorting
Date: 08 Dec 2000 11:51:59 +0000
User-agent: Gnus/5.0803 (Gnus v5.8.3) Emacs/20.7

On Fri, 24 Nov 2000 13:42:51 +0000, "Dr Francis J. Wright" <address@hidden> 
>If I understand the position correctly, Robert generated some garbage
>filenames erroneously from within a program which Windows nevertheless
>accepted and Explorer was able to display (and delete), but Emacs 20.7
>complained.  The complaint arises when ls-lisp tries to sort by name, to
>be precise when it tries to upcase this:

Do you know what the actual name was?  In Robert's original mail, it
came through to me as "œ a" (ie. "\x9c\x8f1\x61") which passes
through `upcase' okay on 20.7.  However, it may well have been mangled
by our Exchange server.

Robert, if necessary I can write a Windows program that will capture the
full Unicode file names, unless you happen to know what the characters

>(error "Invalid character: 013361, 5873, 0x16f1")
>Anyway, I get this error if I just try to evaluate the string "\x16f1"
>in Emacs 20.7.  0x16f1 is indeed invalid as a multibyte character.  It
>does not satisfy the rules set out in the Elisp manual: the first byte
>value is too small.
>It seems to me that a filename containing this "character" must have
>been passed into ls-lisp from directory-files.  I think that 0x16f1 is a
>valid unicode character code (and apparently Windows thinks so too) but
>directory-files is not converting it into a valid Emacs multibyte
>character.  So, is Emacs (20 or 21) supposed to be able to handle
>arbitrary unicode filenames?  Is there anything that ls-lisp could do to
>handle this problem, by way of setting coding systems etc?

NT-Emacs doesn't ever see the actual Unicode names for files, since we
use the multibyte OS interface for directory enumeration.
`directory-files' then decodes all file names using
`file-name-coding-system' if set.

Actually, it looks like some implicit interpretation of the bytes
returned by the OS takes place first, when Emacs converts the filename
returned by readdir to a lisp string.  In particular, that conversion to
a string may decide some of the bytes should be interpreted as multibyte
characters (using the internal Emacs encoding).

I think that is wrong.  I believe the raw bytes returned by readdir
should be directly decoded using `file-name-coding-system'.

I don't know for sure whether this is how an invalid multibyte character
got into a lisp string though.  As far as I can make out, make_string
will return a unibyte string if it comes across any byte sequences in
the file name that aren't valid characters in the internal Mule

It will help to know exactly what readdir returned in this case.

>I would welcome any comments from the developers on this.

I've cc'd this to address@hidden  This is an issue affecting 21.1.

>For the record, I am in the process of changing my version of ls-lisp to
>use compare-strings instead of string-lessp and upcase.  But I doubt
>that this change will solve the present problem although it should avoid
>problems that might arise if directory-files returns both unibyte and
>multibyte strings.  (Could that happen?)

Yes, that could happen even when the above problem is taken care of.

>I also plan to trap sorting
>errors, so that at least an unsorted dired listing should always be
>available.  I don't think there is much more that ls-lisp can do if
>directory-files returns invalid character codes.  (Is there?)

Well, I suppose you could try to refine the ordering function to do
numerical comparison when lexicographic ordering breaks down
(eg. because of invalid characters).  That might be quite fiddly though.

It seems to me that `directory-files' is always in danger of coming
across files that don't decode properly using `file-name-coding-system'
- at least on Unix (*) where file names are really just a sequence of
bytes.  But I would expect it to be the case that the results of
decoding are always safe to "handle".  However, I don't really
understand how Emacs copes with invalid byte sequences in multibyte

(*) On Windows, file names _are_ composed of characters, stored as
Unicode on disk.  On NT and Windows 2000, we can read them as Unicode
(and will do eventually when Emacs gets a Unicode-based character
representation internally), but otherwise the OS will convert them to a
specific codepage for us.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]