[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: String internals sketch

From: David Kastrup
Subject: Re: String internals sketch
Date: Fri, 10 Mar 2017 17:51:43 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux)

Andy Wingo <address@hidden> writes:

> Hi :)
> On Fri 10 Mar 2017 16:31, David Kastrup <address@hidden> writes:
>> a) Guile already uses two different internal representations: basically
>> UCS-8 and UCS-32.  Adding more internal representations could be done
>> using a number of tables indexed by the internal representation type,
>> making string representations sort of a "plugin".
> I think we probably want to avoid this if we can.  We gain a number of
> efficiencies if we can be concrete.

Well, the crucial question is where to hook the specialization for the
best compromise between performance, versatilty and effort.

> Of course there are counterexamples in which specialization can help,
> like the 20-some string kinds in V8, for example: cons strings, latin1
> strings, utf-16 strings, external strings, slices, and the product of
> all of those; but I am hesitant to take on this cost.  If we switched
> to UTF-8 strings, I would like to use it as our only string
> representation.
> Sure would be nice to have cons strings though!  (That would give O(1)
> string-append.)

String buffers should have O(1) string-append.

>> b) Scheme, at least older dialects, have several O(1) guarantees.
> R7RS seems to have relaxed this FWIW.  O(1) is great of course but there
> are reasonable cases to be made for O(log N) being a good tradeoff if
> you get other benefits.

A guaranteed O(log N) is not exactly trivial to come by for utf-8
encoded strings either.

>> c) Indexing is the most important thing one wants to be fast.  For an
>> utf-8 internal representation, a lot is achieved if one caches both
>> last index and last byte offset, preferably also string length as
>> index and byte length.
> Consider threads though :/ Caches get a bit complicated there.

If there is actual access of several threads on the same string, a
single-element cache is going to be mostly useless.  Don't know about
typical implementation costs and implementations of per-thread
variables, so I can't make a useful proposal here.

>> d) a bad complication is write access to strings, for example with
>>  -- Scheme Procedure: string-map! proc s [start [end]]
>>  -- C Function: scm_string_map_x (proc, s, start, end)
> TBH I wouldn't worry too much about this function in particular;

It's certainly not an important function regarding its use frequency,
but it's a good worst case study and is nicely covered by the
gapped-string model.

> That said, this point:
>> The current string character can gain a longer or a shorter byte
>> length in this process.
> Is especially gnarly in the threaded case; string updates are no longer
> atomic.  One thread mutating a string might actually corrupt another
> thread.  Right now on x86, updates are entirely atomic; on other
> processors that need barriers, the worst that could happen is that
> another thread could fail to read an update.  We'd have to re-add the
> string write mutex, which would be a bit sad :)

A gapped string does not have a lot of contention, but of course if two
threads wanted to write in different places, there would need to be a
way to either lock the gap until the end of the current operation or to
support multiple gaps in progress.

>> So it should provide a _large_ opportunity for the sakes of
>> applications with profiles akin to Emacs or LilyPond.
> I'm sympathetic :) Lots of details to get right (or wrong!) though.

Sure.  Not all implementations need to get along without locks, by the
way.  I am not a fan of providing choice between several bad options
instead of getting one version right, but "right" is so large a field
here, particularly when considering embedding Guile into other systems,
that it might be a worthwhile goal to at least have the "right for any
purpose" goal met at the framework layer, and then try improving the
implementations from there.

> WDYT about the threading aspects?

Good for headaches.  But it might make sense diversifying the headaches
into the several implementations.

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]