[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

String internals sketch

From: David Kastrup
Subject: String internals sketch
Date: Fri, 10 Mar 2017 16:31:38 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)


I've been mulling a bit over enabling UTF-8 as an internal string
representation.  There are several interesting points to consider:

a) Guile already uses two different internal representations: basically
UCS-8 and UCS-32.  Adding more internal representations could be done
using a number of tables indexed by the internal representation type,
making string representations sort of a "plugin".

b) Scheme, at least older dialects, have several O(1) guarantees.  If
the chosen internal representations can be configured with some fluid
(like it is done with port encodings), possibly as a list in priority
order, one can retain an O(1) guarantee as the default behavior while
allowing different mechanisms for different requirements.

c) Indexing is the most important thing one wants to be fast.  For an
utf-8 internal representation, a lot is achieved if one caches both last
index and last byte offset, preferably also string length as index and
byte length.  For each indexing operation, one can then first check the
last string-middle cache, and then scan forwards or backwards to do the
indexing (if string start or string end are closer, one can scan from
there), assuming reasonably sanitized behavior.  That will make
sequential processing close to O(1) behavior.

d) a bad complication is write access to strings, for example with

 -- Scheme Procedure: string-map! proc s [start [end]]
 -- C Function: scm_string_map_x (proc, s, start, end)
     PROC is a char->char procedure, it is mapped over S.  The order in
     which the procedure is applied to the string elements is not
     specified.  The string S is modified in-place, the return value is
     not specified.

The current string character can gain a longer or a shorter byte length
in this process.  This requires being able to deal with
insertion/deletion of bytes at the current position.  One way to do that
is to have a _gap_ for the purpose of string-map! and similar functions
and copy material from the end of the gap to its start while iterating.

Now this starts looking so much like the memory management Emacs buffers
that it isn't funny.  It also is _the_ natural string representation to
use for things like with-output-to-string-buffer and

Basically, one valid string representation would coincide with an utf-8
based string buffer.

This would seem like it would also give a boost to some Guile operations
and would offer possibilities for making "string representation plugins"
for special purposes.  This should offer a lot of leeway to gradually
suck up the best parts of things like Emacs' string representation and
code conversion facilities into a Guile kernel without being tied down
by the current Guile facilities and without having stuff like Emacs
strings black boxes mostly inaccessible to Guile programming.

It should also allow to bounce stuff like sanitized utf-8 through Guile
without incurring constant conversion costs.

So it should provide a _large_ opportunity for the sakes of applications
with profiles akin to Emacs or LilyPond.

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]