[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unicode and Guile
From: |
Tom Lord |
Subject: |
Re: Unicode and Guile |
Date: |
Tue, 11 Nov 2003 11:02:55 -0800 (PST) |
> From: Andy Wingo <address@hidden>
[...long thing...]
Thanks for the pointer to the Python type (on which I won't comment
:-). Thanks for the excuse to think about this more.
At the end of this proposal, I've addressed your "use case".
-t
Towards Standard Scheme Unicode Support
* The Problems
There are two major obstacles to providing nice,
non-culturally-biased Unicode support in standard Scheme. First,
the required standard character and string procedures are
fundamentally inconsistent with the structure of unicode. Second,
attempts to ignore that fact and "force fit" unicode into them
anyway inevitably result in a set of text-manipulation primitives
that are too low level -- that require even very simple text
manipulation programs to be far more "aware" of the details of
unicode encodings and structure than they ought to be.
** CHAR? Makes No Sense In Unicode
Consider the unicode character U+00DF "LATIN SMALL LETTER SHARP S"
(aka Eszett).
Clearly it should behave this way:
(char-alphabetic? eszett) => #t
(char-lower-case? eszett) => #t
and it is required that:
(char-ci=? eszett (char-upcase eszett)) => #t
(char-upper-case? (char-upcase eszett)) => #t
but now what exactly does:
(char-upcase eszett)
return? The upper case mapping of eszett is a two character
sequence, "SS". It's not even a Unicode base character plus
combining characters -- it's two base characters, a string.
Eszett is not an isolated anomaly (though, admittedly, is not the
common case). Here is a pointer to the data file of similarly
problematic case mappings:
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
So, something has to give, somewhere :-)
[Case mappings are a particularly clear example but I suspect
that there are other "character manipulation" operators that
make sense in Unicode but, similarly, don't map onto a
standard CHAR? type.]
** Other Approaches are Too Low Level
Consider the example of attempting to write a procedure,
in portable scheme, which performs "studly capitalization".
It should accept a string like:
a studly capitalizer
and return a string like:
a StUDly CaPItalIZer
In the simple world of the scheme CHAR and STRING types, such a
procedure is quite simple to write _and_get_completely_correct_.
It would make good exercises for a new programming student.
Let's assume that the student solves the problem in a reasonable
way: by iterating over the string and, at random positions,
replacing a character with its upper case equivalent. Simple
enough.
Unfortunately, there does not (can not) exist a mapping of
Unicode onto the standard character and string types that would not
break our student's program. His program can still _often_ give
a correct result, but to produce a completely correct program,
he will have to take a far different and, as things stand, more
complicated approach.
** One Approach Comes Close
Ray Dillenger has recently proposed on comp.lang.scheme a
treatement of Unicode in which a CHAR? value may be:
~ a unicode base character
~ a unicode base character plus a sequence of 1 or
more unicode combining characters
That goes a very long way towards solving the problem. For example,
if I had asked our student to write an anagram generator instead of
a studly capitalizer, Ray's solution would perserve the correctness
of the student's program.
Unfortunately, Ray's approach still has problems. It can not handle
case mappings correctly, as noted above. In Ray's system, there are
an infinite number of non-EQV? CHAR? values and therefore
CHAR->INTEGER may return a bignum (in Indic, Tibetan, and the Hangul
Jamo alphabets, it would apparently return a bignum frequently).
With an infinite set of characters, libraries (such as SRFI-7
"Character Sets"), which are designed with a finite character set in
mind, can not be ported. The issue of multi-character case mappings
aside, It is difficult to see how to preserve the required ordering
isomophism between characters and their integer representations.
Nevertheless, Ray's idea that a "conceptual character" is part of
an infinite set of values and a "conceptual string" a sequence of
those is the basis of this proposal.
* The Proposal
The proposal has two parts. Part 1 introduces a new type, TEXT?,
which is a string-like type that is compatible with Unicode, and
a subtype of TEXT?, GRAPHEME?, to represent "conceptual
characters".
Part 2 discusses what can become of the STRING? and CHAR? types in
this context.
** The TEXT? and GRAPHEME? Types
[This is a sketch of a specification -- not yet even a first
draft of a specification.]
~ (text? obj) => <boolean>
True if OBJ is a text object, false otherwise.
A text object represents a string of printed graphemes.
~ (utf8->text string) => <text>
~ (utf16->text string) => <text>
~ (utf16be->text string) => <text>
~ (utf16le->text string) => <text>
[...]
~ (text->utf8 text) => <string>
[...]
The usual conversions from strings (presumed to be
sequences of octets) to text.
A subset of text objects are distinguished as graphemes:
~ (grapheme? obj) => <boolean>
True if OBJ is a text object which is a grapheme,
false otherwise.
The set of graphemes is defined to be isomorphic to the set of
all unicode base characters and well formed unicode combinding
character sequences (and is thus an infinite set).
~ (grapheme=? g1 g2 [locale]) => <boolean>
~ (grapheme<? g1 g2 [locale])
~ (grapheme>? g1 g2 [locale])
[...]
~ (grapheme-ci=? g1 g2 [locale])
~ (grapheme-ci<? g1 g2 [locale])
~ (grapheme-ci>? g1 g2 [locale])
The usual orderings.
Here and elsewhere I've left the optional parameter LOCALE there
as a kind of place-holder. There are many possible collation
orders for text and programs need a way to distinguish which
they mean (as well as have a reasonable default).
It is important to note that, in general, EQV? and EQUAL? do _not_
test for grapheme equality. GRAPHEME=? must be used instead.
Also note that this proposal does not include GRAPHEME->INTEGER or
INTEGER->GRAPHEME. I have not included, but probably should
include, a hash value procedure which hashes GRAPHEME=? values
equally.
~ (grapheme-upcase g) => <text>
~ (grapheme-downcase g) => <text>
~ (grapheme-titlecase g) => <text>
Note that these return texts, not necessarilly graphemes.
For example, GRAPHEME-UPCASE of eszett would return a
text representation of "SS".
All texts, including graphemes, behave like (conceptual) strings:
~ (text-length text) => <integer>
Return the number of graphemes in TEXT.
~ (subtext text start end) => <text>
Return a subtext of TEXT containing the graphemes beginning at
index START (inclusive) and ending at END (exclusive).
~ (text=? t1 t2 [locale]) => <boolean>
~ (text<? t1 t2 [locale]) => <boolean>
[...]
The usual ordering predicates.
~ (text-append text ...) => <text>
~ (list->text list-of-graphemes) => <text>
Various constructors for text ....
However, instead of TEXT-SET!, we have:
~ (text-replace! text start end replacement)
Replace the graphemes at [START, END) in TEXT with
the graphemes in text object REPLACEMENT. Passing
#t for END is equivalent to passing an index 1
position beyond START.
TEXT must be a mutable text object (see below).
Implementations are permitted to make _some_ graphemes immutable.
In particular:
~ (text-ref text index) => <grapheme>
Return the grapheme at position INDEX in TEXT.
The grapheme returned may be immutable.
~ (text->list text) => <list of graphemes>
The graphemes returned may be immutable.
~ (char->grapheme char) => <grapheme>
~ (utf8->grapheme string) => <grapheme>
[....]
Conversions to possibly immutable graphemes.
And some simple I/O extensions:
~ (read-grapheme [port]) => <grapheme>
~ (peek-grapheme [port]) => <grapheme>
[etc.]
There is still an awkwardness, however. Consider witing the "StUDly
CaPItalIZer" procedure. It's tempting to write it as a loop that
uses an integer grapheme index to iterate over the text, randomly
picking graphemes to change the case of. That wouldn't work though:
changing the case of one character can change the length of text,
right at the point being indexed, and invalidate the indexes. So,
texts really need markers that work like those in Emacs:
~ (make-text-marker text index) => <marker>
~ (text-marker? obj) => <boolean>
~ (marker-text marker) => <index>
~ (marker-index marker) => <index>
~ (set-marker-index! marker index)
~ (set-marker! marker text index)
etc.
Changes (by TEXT-REPLACE!) to the region of a text object to
the left of a marker leave the marker in the same position
relative to the right end of the text, and vice versa.
Changes to a region which _includes_ a marker leave the
marker at last grapheme index of the replacement
text that was inserted, or, if the replacement was empty,
at its old index position minus the number of graphemes
deleted to the marker's left.
The procedures SUBTEXT, TEXT-REPLACE!, and TEXT-REF
and others that except indexes can accept markers as those
indexes.
Unlike markers, text properties and overlays aren't strictly needed to
make TEXT? useful -- but they would make a good addition. The issue
is that mutating procedures (like TEXT-REPLACE!) should be aware of
properties in order to update them properly. If properties and
overlays are left out, and people have to implement them in a higher
layer, then their "attributed text" data type can't be passed to a
procedure that just expects a text object.
* Optional Changes to CHAR? and STRING?
The above sepcification of the TEXT? and GRAPHEME? is useful on its
own, but it might be considerably more convenient in implementations
which also adopt the following ideas:
~ CHAR? is an octet, STRING? a sequence of octets
~ STRING? valuess are resizable
~ STRING? values contain an "encoding" attribute which may be
any of
utf8
utf16be
utf16le
utf32
or an impelementation defined value. Note however that
procedures such as STRING-REF ignore this attribute and
view strings as sequences of octets.
STRING-APPEND implicitly converts its second and subsequent
arguments to the same encoding as its first.
~ (text? "a string") => #t
~ (grapheme? #\a) => #t
In other words, all character values are graphemes, and all strings
are text values.
These ideas _could_ be taken even a step further with the addition
of:
~ TEXT? values contain an "encoding" attribute, just as strings
do (utf-8, etc.)
~ (string? a-text-value) => #t
~ (char? a-grapheme) => <boolean>
All text values can be strings; some graphemes can be characters.
* Summary
The new TEXT? and GRAPHEME? types present a simple and traditional
interface to "conceptual strings" and "conceptual characters".
They make it easy to express simple algorithms simply and without
reference to the internal structure of Unicode.
Reflecting the realities of global text processing, there is
no bias in the interfaces suggesting that the set of graphemes
is finite.
Also reflecting the realities of global text processing: the length
of a text object may change over time; a sequence replacement
operator is supplied instead of an element replacement operator;
and markers (similar to those in text editors) are provided for
iteration and other examples of keeping track of "a position within
a text vaue".
There is no essential difference between a grapheme and a text
object of length 1, and thus the proposal makes GRAPHEME? a
subtype of TYPE.
If STRING? is suitably extended, then it may be equal to or a subset
of TEXT?. Conversely, if TYPE? is suitably extended, it may be
equal to or a subset of STRING?. It may be sensible to unify the
two types (although even analogous string procedures and text
procedures will still behave differently from one another).
CHAR? may be safely viewed as a subtype of GRAPHEME?, but the
converse is not, and can not, be true.
--------------------------------
> Hm. Let's consider some use cases.
> Let's say an app wants to ask the user her name, she might want to write
> her name in her native Arabic. Or perhaps her address, or anything
> "local". If the app then wants to communicate this information to her
> Chinese friend (who also knows Arabic), the need for Unicode is
> fundamental. We can probably agree there.
Absolutely. What's more, if I'm sitting in california and write a
protable Scheme program that generates anagrams of a name, it'd be
awefully swell if (a) My code doesn't have to "know" anything special
about unicode internals; (b) my code works when passed her name as input.
> The question becomes, is the user's name logically a simple string (can
> we read it in with standard procedures), or must we use this
> text-buffer, complete with marks, multiple backends, et al? It seems
> more natural, to me, for this to be provided via simple strings,
> although I could be wrong here.
Scheme's requirements of the CHAR? and STRING? types simply don't map
onto unicode. The case problem I illustrated above is one example
and I _suspect_ that there are others, even if you do something like
Ray's trying to do and make an infinitely large character set.
I _think_ the TEXT? and GRAPHEME? stuff above is about as natural as
"simple strings" -- it just doesn't try to give those types behavior
that makes no sense in Unicode.
> I was looking at what Python did
> (http://www.python.org/peps/pep-100.html), and they did make a
> distinction. They have a separate unicode string representation which,
> like strings, is a subclass of SequenceObject. So they maintained
> separate representation while being relatively transparent to the
> programmer. Pretty slick, it seems.
That URL is slightly wrong. It's:
http://www.python.org/peps/pep-0100.html
It sounds _ok_. It's got some problems.
The genericity of it (that these are still sequences) is
winning.... i'll discuss that below.
Mostly its a little too low level. They're only (initially?)
supporting the 1-1 case conversions. They are exposing unicode code
points and just handing users property tables for those. They don't
include a "marker" concept. These are all symptoms of starting off
with an implementation limited to the 16-bit code points -- they
haven't thought through how to do full unicode support (and once they
do, I'll bet they wind up with something close to TEXT? and
GRAPHEME?).
> C#, Java, and ECMAScript (JavaScript) all apparently use UTF-16
> as their native format, although I've never coded in the first
> two. Networking was probably the most important consideration
> there.
Streaming conversions (e.g., for networking) are cheap and easy. I
think they made those choices to simplify implementations, and then
made the mistake of exposing that implementation detail in the
interfaces.
> Perhaps the way forward would be to leave what we've been
> calling "simple strings" alone, and somehow (perhaps with GOOPS,
> but haven't thought too much about this) pull the Python trick
> of having a unicode string that can be used everywhere simple
> strings can. Thoughts on that idea?
The proposal above makes it possible to pass text everywhere that
simple strings can be used. However, in that part of the proposal,
string-ref, sttring-set! and so forth are still specified to operate
on octets.
The proposal also makes it possible to pass strings everywhere that
text can be used. I think that's the more interesting direction:
just use text- and grapheme- procedures from now on except where you
_really_ want to refer to octets.
-t
- Re: Unicode and Guile, Kevin Ryde, 2003/11/02
- Re: Unicode and Guile, Andy Wingo, 2003/11/03
- Re: Unicode and Guile, Tom Lord, 2003/11/03
- Re: Unicode and Guile, Andy Wingo, 2003/11/11
- Re: Unicode and Guile,
Tom Lord <=
- Re: Unicode and Guile, Marius Vollmer, 2003/11/11
- Re: Unicode and Guile, Tom Lord, 2003/11/11
- Re: Unicode and Guile, Marius Vollmer, 2003/11/11
- Re: Unicode and Guile, Tom Lord, 2003/11/11
- Re: Unicode and Guile, Marius Vollmer, 2003/11/12
- Re: Unicode and Guile, Andy Wingo, 2003/11/18
- Re: Unicode and Guile, Marius Vollmer, 2003/11/11
- Re: Unicode and Guile, Tom Lord, 2003/11/11
Re: Unicode and Guile, Andy Wingo, 2003/11/03
Re: Unicode and Guile, Mikael Djurfeldt, 2003/11/26