[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: String internals sketch
From: |
David Kastrup |
Subject: |
Re: String internals sketch |
Date: |
Fri, 10 Mar 2017 17:51:43 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/26.0.50 (gnu/linux) |
Andy Wingo <address@hidden> writes:
> Hi :)
>
> On Fri 10 Mar 2017 16:31, David Kastrup <address@hidden> writes:
>
>> a) Guile already uses two different internal representations: basically
>> UCS-8 and UCS-32. Adding more internal representations could be done
>> using a number of tables indexed by the internal representation type,
>> making string representations sort of a "plugin".
>
> I think we probably want to avoid this if we can. We gain a number of
> efficiencies if we can be concrete.
Well, the crucial question is where to hook the specialization for the
best compromise between performance, versatilty and effort.
> Of course there are counterexamples in which specialization can help,
> like the 20-some string kinds in V8, for example: cons strings, latin1
> strings, utf-16 strings, external strings, slices, and the product of
> all of those; but I am hesitant to take on this cost. If we switched
> to UTF-8 strings, I would like to use it as our only string
> representation.
>
> Sure would be nice to have cons strings though! (That would give O(1)
> string-append.)
String buffers should have O(1) string-append.
>> b) Scheme, at least older dialects, have several O(1) guarantees.
>
> R7RS seems to have relaxed this FWIW. O(1) is great of course but there
> are reasonable cases to be made for O(log N) being a good tradeoff if
> you get other benefits.
A guaranteed O(log N) is not exactly trivial to come by for utf-8
encoded strings either.
>> c) Indexing is the most important thing one wants to be fast. For an
>> utf-8 internal representation, a lot is achieved if one caches both
>> last index and last byte offset, preferably also string length as
>> index and byte length.
>
> Consider threads though :/ Caches get a bit complicated there.
If there is actual access of several threads on the same string, a
single-element cache is going to be mostly useless. Don't know about
typical implementation costs and implementations of per-thread
variables, so I can't make a useful proposal here.
>> d) a bad complication is write access to strings, for example with
>>
>> -- Scheme Procedure: string-map! proc s [start [end]]
>> -- C Function: scm_string_map_x (proc, s, start, end)
>
> TBH I wouldn't worry too much about this function in particular;
It's certainly not an important function regarding its use frequency,
but it's a good worst case study and is nicely covered by the
gapped-string model.
> That said, this point:
>
>> The current string character can gain a longer or a shorter byte
>> length in this process.
>
> Is especially gnarly in the threaded case; string updates are no longer
> atomic. One thread mutating a string might actually corrupt another
> thread. Right now on x86, updates are entirely atomic; on other
> processors that need barriers, the worst that could happen is that
> another thread could fail to read an update. We'd have to re-add the
> string write mutex, which would be a bit sad :)
A gapped string does not have a lot of contention, but of course if two
threads wanted to write in different places, there would need to be a
way to either lock the gap until the end of the current operation or to
support multiple gaps in progress.
>> So it should provide a _large_ opportunity for the sakes of
>> applications with profiles akin to Emacs or LilyPond.
>
> I'm sympathetic :) Lots of details to get right (or wrong!) though.
Sure. Not all implementations need to get along without locks, by the
way. I am not a fan of providing choice between several bad options
instead of getting one version right, but "right" is so large a field
here, particularly when considering embedding Guile into other systems,
that it might be a worthwhile goal to at least have the "right for any
purpose" goal met at the framework layer, and then try improving the
implementations from there.
> WDYT about the threading aspects?
Good for headaches. But it might make sense diversifying the headaches
into the several implementations.
--
David Kastrup