[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unicode and Guile
From: |
Marius Vollmer |
Subject: |
Re: Unicode and Guile |
Date: |
Wed, 12 Nov 2003 01:06:39 +0100 |
User-agent: |
Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux) |
Please allow me to randomly dump my thoughts on Guile and Unicode:
- The principal tension that I see is between having a memory
efficient representation (UTF-8) and one that is simple and
concept-compatible with the old way (fixed-width, maybe UTF-32).
- But is there a fixed-width Unicode representation? I.e., is UTF-32
just like ASCII only with more bits or is there more to it? Are
there combining characters in UTF-32? If there are, then there is
no reason to go looking for a fixed-width, old-style text
representation.
- If we go with a variable width encoding, we can just as well use
UTF-8 and replace strings/chars with something new, like Tom's
texts/graphemes.
- What kind of data type are strings anyway? Vectors or lists?
Traditionally, they have been mutable vectors, but variable-width
encoding of 'characters' might force us to rethink this, in general.
People expect constant time accesses for vector-like things, but we
will probably not want to guarantee them for a variable-width
encoding (with integers as indices).
- So the text/grapheme API should maybe be more abstract, and not be
using integers to refer to graphemes contained in texts but some
opaque 'iterator', 'subtext' or 'grapheme range' thing.
- Shared subtexts or grapheme ranges are easy to do for read-only
texts, but harder for mutable text. So texts should maybe be
unmutable by default. Mutable texts and pointers into it might use
a more expensive data structure, like a gap buffer.
- For Guile specifically, the problematic thing is the C API. Right
now, strings are pretty much fixed to be vectors of unsigned bytes.
We can't do much about this without breaking code. So from that
point of view, a new API for Unicode stuff looks like a good thing
as well, when we can convince ourselves that people are willing to
move over to that new API.
- The representation of texts would be determined by what is most
natural for existing C code. I.e., I think that Gtk+ uses UTF-8 and
when we find that most libraries that we want to access from Guile
use UTF-8 as well, we should make our text representation UTF-8.
- Old code can be supported by allowing string-*, char-*, etc. to work
on UTF-8 encoded texts that uses only ASCII code points. That will
causes problems to the 8-bit users (like latin-1, etc.), tho. C
code must avoid storing non-ASCII characters into such strings, and
I'm not sure right now whether we can keep it from doing that in a
compatible way.
- ... :)
--
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405
- Re: text buffers (was Re: Unicode and Guile), (continued)
- Re: Unicode and Guile, Tom Lord, 2003/11/03
- Re: Unicode and Guile, Andy Wingo, 2003/11/11
- Re: Unicode and Guile, Tom Lord, 2003/11/11
- Re: Unicode and Guile, Marius Vollmer, 2003/11/11
- Re: Unicode and Guile, Tom Lord, 2003/11/11
- Re: Unicode and Guile, Marius Vollmer, 2003/11/11
- Re: Unicode and Guile, Tom Lord, 2003/11/11
- Re: Unicode and Guile, Marius Vollmer, 2003/11/12
- Re: Unicode and Guile, Andy Wingo, 2003/11/18
- Re: Unicode and Guile,
Marius Vollmer <=
- Re: Unicode and Guile, Tom Lord, 2003/11/11
Re: Unicode and Guile, Andy Wingo, 2003/11/03
Re: Unicode and Guile, Mikael Djurfeldt, 2003/11/26