guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guile can't find a chinese named file


From: David Kastrup
Subject: Re: guile can't find a chinese named file
Date: Wed, 15 Feb 2017 00:58:41 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Mike Gran <address@hidden> writes:

> But, for what it is worth, the Latin-1/UCS-32 design decision came
> from a couple of conflicting requirements.  The switch happened in the
> 1.9.x series.
>
>
> There was several examples of legacy C code using Guile for an
> extension language that accessed the bytes of a string directly, using
>
> SCM_STRING_CHARS or scm_i_string_chars.  To keep from breaking legacy
> code, we needed to retain the capability to use this (then already
> deprecated) capability to have C programs access 8-bit-locale string
> internals directly.

But if you don't know whether the strings are Latin-1 or UCS-32, that's
sort of academical.

> Also, in R6RS, there was the requirement that functions like
> "string-ref" act in "constant time". This suggested either a
> codepoint-array representation for strings, or a UTF-8 array
> representation with some indexing to allow for constant-time access.

The problem is not that Guile has an idiosyncratic internal string
representation.  As you note, other programs have that.

The problem is that Guile does not have an API for passing/processing
strings in that representation.  That means that passing strings in and
out of Guile is expensive.  And when working with string ports, even
keeping data purely inside of Guile requires conversion processes, and
string port positions are calculated in UTF8-encoded byte offsets when
strings are indexed in characters.

The problem is that Guile is _constantly_ required to recode strings it
is processing.  And to add insult to injury, it cannot do this without
data loss when its string encoding assumptions are wrong.

PostScript files are usually encoded in Latin-1 with occasional UCS-16
passages.  Reading and writing and copying such files byte-correctly
while trying to actually parse their contents is not feasible with
Guile.

> I still maintain that this design decision was a good one based on the
> simplicity of implementation.

As I said: the problem is not the chosen internal representation.  The
problem is that there is no API to access it, and it does not even map
to string ports.

> The great difficulty with the UTF-8 Guile prototype was the need to
> interrogate every string access or index to decide if it was a
> codepoint index or a byte index. I abandoned that effort because it
> was doing my head in.

Emacs tried this in version 20.2, and got rid of it in version 20.4 or
so, obliterating byte-based indexing completely.  Anything else would
not have worked in the long run.  That was when, 16 years ago?

> Had we chosen that route, the result would likely have been a long,
> long process of squashing difficult bugs related to byte vs codepoint
> index confusion.
>
> But, for what it is worth, we've had a few years of the internal
> representation of strings being private, so any modification of
> internal representation of strings would be easier in 2017 than they
> were in 2007, when the guts of strings were exposed to the C API.

> (N.B. dak at gnu is on my block list, so I won't see any such
> response.)

Not just on yours.  LilyPond is probably the largest application using
Guile as its extension language, with pretty much the worst impacts of
Guile-2 design decisions.  So obviously nobody wants to hear from its
most active developer.  This is even more important now that LilyPond is
getting removed from Debian and other distributions because it is still
hopeless to get it to run under Guile-2 (the experimental support has
encoding and stability problems and runs about a factor of 5 slower than
Guile-1).  The less one hears of that, the better for morale.

-- 
David Kastrup




reply via email to

[Prev in Thread] Current Thread [Next in Thread]