guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guile can't find a chinese named file


From: tomas
Subject: Re: guile can't find a chinese named file
Date: Wed, 15 Feb 2017 22:15:52 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Feb 15, 2017 at 10:32:57PM +0200, Eli Zaretskii wrote:
> > Date: Wed, 15 Feb 2017 21:20:56 +0100
> > From: address@hidden
> > Cc: address@hidden
> > 
> > > > Most notably, the whole path might cross several mount points, thus
> > > > the whole path can well have fragments coming from several file systems.
> > > 
> > > A possible solution would be to decode each mount point's part as it
> > > is being resolved.
> > 
> > ...which can only be based on guesswork: there's no reliable info on
> > the encoding used for that file system (if it's consistent at all).
> 
> You could maintain a database of encodings per file system, perhaps
> user-defined, or derived by some other means.  E.g., for volumes that
> physically reside on Windows or macOS the encoding is pretty much
> known in advance.

This is what I mean by "voodoo". We don't even know the encoding to be
consistent whithin one file system. An example would be the home dirs
of different users running under different locales (an extreme example:
they may have different 8 bit locales!).

[...]

> > I feel queasy doing some voodoo whithout the application having
> > a word on it. In the Emacs context it's a bit easier, because in
> > the "normal" case things are pretty quickly deferred to the user
> > (usually).
> 
> Not really, there are a lot of internal operations that access files
> and directories, and would wreak major havoc if they don't succeed,
> silently, in the absolute majority of uses.

That was the "a bit" part :-)

Anyway, having an encoding à la Emacs eases things a lot, since a
string can at least survive unharmed a plain round trip. The problem
of properly displaying that remains unsolved. Plus operations on that
string (concatenation, e.g.).

[...]

> > I guess (I don't *know*) Windows stores information about the encoding
> > at file system level (and keeps that consistent).
> 
> No.  At the file system level (for NTFS volumes at least) Windows file
> names are always UTF-16 encoded, and Windows just "knows" that.
> Windows converts that to the locale's codepage when you access files
> via an API that communicates file names encoded in that codepage.  (If
> the conversion fails, you get question marks instead of the characters
> that couldn't be converted.)

I see. That means that Windows has to use surrogates for everything
beyond the BMP, right? The heritage from the times Unicode was "just"
16 bit...

> > Linux hasn't that, it just keeps out of it. It hasn't even a place
> > to state the encoding used.
> 
> Exactly.  Which is why forcing a single file-name encoding on
> Linux/Unix filesystems is IMO a bad idea.

Agreed, that can't be done. It'd be nice to have one encoding per file
system, but we don't even have that :-(

regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlikxQgACgkQBcgs9XrR2kbCpQCfcLLffP3e3JdW1gg4DVylHQeo
cjAAnRwVgtZR0qIce7IkU73vUHpLSvMG
=jl5p
-----END PGP SIGNATURE-----



reply via email to

[Prev in Thread] Current Thread [Next in Thread]