guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Running script from directory with UTF-8 characters


From: Eli Zaretskii
Subject: Re: Running script from directory with UTF-8 characters
Date: Wed, 23 Dec 2015 20:28:38 +0200

> From: Marko Rauhamaa <address@hidden>
> Date: Tue, 22 Dec 2015 23:39:28 +0200
> Cc: address@hidden
> 
> > No, they aren't, not as file names. E.g., you cannot meaningfully
> > downcase or upcase such "characters", you cannot count characters (as
> > opposed to bytes), you cannot calculate how much screen estate will be
> > needed to display them, with some Far Eastern encodings you cannot
> > correctly search them for some specific ASCII characters (because they
> > can be part of a multibyte sequence), etc. etc. IOW, you cannot work
> > with file names as human-readable text, which is something many
> > programs need to do.
> 
> You can, in a roundabout way. You do the low-level file I/O in Latin-1.
> Then, you reencode into UTF-8

IOW, from the application-level perspective, the file names are
encoded in UTF-8 (in this example).  The low-level reading as byte
stream (NOT Latin-1!) is out of scope as long as you consider a Guile
Scheme program that needs to manipulate the file names.

So we are in violent agreement.

> Otherwise, you may not even be able to remove a file with a non-UTF-8
> name.

What do you mean by a non UTF-8 file name?  A file name that includes
byte sequences that are not valid UTF-8?  For that, Guile needs to
acquire a capability of representing raw bytes, similar to what Emacs
does.  This capability is an add-on, it should not be instead of being
able to interpret file names as character strings encoded in some
recognizable encoding, either forced by the application or deduced
from some meta-data, user preferences, locale's defaults, etc.

> > They are strings because _people_ name files and give them meaningful
> > names and extensions.
> 
> The Linux kernel just doesn't care, and shouldn't.

Guile is not an OS kernel.  Guile is an environment for writing
applications.  On the application level, you _should_ care, or else
you won't be able to manipulate file names in meaningful ways.

> It's acceptable for Guile to create a higher-level illusion, but it
> shouldn't sacrifice completeness while doing so. You should be able to
> manipulate every conceivable filename from Guile code.

We are again in violent agreement about the goal.  But the means
towards that goal is NOT to abandon interpretation of file names as
strings of characters, the means is to be able to represent raw bytes
on top of a meaningful character representation.

> > If Guile cannot easily work with file names encoded in a codeset other
> > than the current locale's one, then Guile should be extended to allow
> > a program to tell it in which encoding to interpret a particular name.
> 
> A program usually has no clue how a pathname has been encoded.

The programmer does, or should be.  The user does, sometimes (e.g.,
the capability presented in many browsers and editors to force text
encoding).  Some encodings can be deduced by analyzing the bit stream.
And there are locale defaults if nothing else works.  If none of that
is done, the program cannot manipulate these file names in any
meaningful way.  The kernel can duck that problem because it's the
kernel: it doesn't interact with users, and its filesystem layer is
not required to understand the meaning of, say, the file-name
extensions.  We have no such luxury on the application level.  So we
cannot simply copycat the kernel techniques into Guile, it won't work.
It also won't work to expect applications do that, as that is too
complex and subtle (and tedious) for application to do it right every
time.

Once again, I suggest to study how Emacs solves this very problem.
The solution used there is satisfactory, and fits all of your
requirements above.  It's not without some subtleties in rare cases,
but the problem is complex and there's no way around that complexity.

> > (I think Guile already supports that, but maybe I misremember.) But
> > lobbying for treating file names as byte streams, let alone Latin-1
> > characters, is a large step backwards, to 1990s when we didn't know
> > better. We've come a long way since then and learned a lot on the way.
> 
> At least our backwardness allowed Linux to jump directly to UTF-8 and
> not be afflicted by UCS-2 like Windows and Java.

Once again, Guile is not an OS kernel.  It cannot simply adopt kernel
solutions.

> I'm not saying bytevectors are elegant, but we should not replace them
> with wishful thinking.

No need for wishful thinking.  Study what Emacs does and do something
similar.

> I'm a bit sorry that Guile repeated Python 3's mistake and brought
> (Unicode) strings to the center.

Everybody does that mistake.  Emacs did it as well, but that was years
ago, and since then the mistakes were identified and corrected.  The
basis must be Unicode, the trick is to build additions on top of that
which allow raw bytes and Unicode text strings to coexist, more or
less transparently to the application level.  ("More or less" because
handling raw bytes as part of strings requires some care; fortunately,
such use cases are rare.)

> Strings are a highly special-purpose data structure; I really never
> had a real need for them in my decades of programming. Also, I
> suspect strings are much too simplistic for any serious typesetting
> or GUI work. It seems the sweet spot of strings are text/plain mail
> messages and Usenet postings.

My experience indicates otherwise (in particular, processing and
displaying plain text strings is what the Unicode Standard is all
about), but I think that issue is tangential to this discussion.

> Guile 1.x's and Python 2.x's bytevector/string confusion was actually
> a very happy medium. Neither the OS nor the programming language placed
> any interpretation to the byte sequences. That was left to the
> application.

And that is wrong.  Applications cannot handle that, they need some
heavy help from the infrastructure.  Applications actually love to
have normal human-readable text strings, after the infrastructure
decoded the byte stream into characters for them.  Most file names are
encoded in locale's codeset (otherwise file browsers and other
interactive programs that accept and display file names won't be able
to handle them), so at least this popular and very important use case
should "just work" without requiring each application to reinvent the
wheel of decoding byte sequences into characters, dealing with EILSEQ,
etc.  An environment that doesn't provide at least that won't fly.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]