gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft


From: Tom Lord
Subject: Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft
Date: Sat, 24 Jan 2004 16:19:16 -0800 (PST)


    > From: Florian Weimer <address@hidden>

    > Tom Lord wrote:

    > >     > Let's take one step back a bit.  What is a "character" in the
    > >     > context of this thread (i.e. Pika)?

    > > A unicode codepoint, plus buckybits.   

    > I don't think these buckybits are a good idea.  On almost any system,
    > the relationship between "byte", "Unicode codepoint" and "key sequence"
    > is non-trivial.  Gluing things together is prone to confusion and
    > future problems.

A codepoint with buckybits is not a "key sequence" -- in event-typing,
it serves as what a chording, button-box style of input device (such as
a keyboard) can transmit for a single (logical) gesture.

I'm not sure why you mention `byte' in this context.  It is true (in
Pika) that I want 0..256 to be valid codepoints with the side effect
that an octet stream happens to also be a valid codepoint stream.  And
I happen to want those Pika codepoints to be their numerically
equivalent Unicode codepoints.  In R6RS I want something compatible
with that but slightly less specific.

    > Just decide how many ISO 10646 planes you want to support, and use the
    > appropriate number of bits (21 is fine).  Use an additional bit to
    > squeeze in 256 code positions you might want to use to represent invalid
    > UTF-8 input data (so you have round-trip capability even for binary
    > files accidentally interpreted as UTF-8).

I'm not giving UTF-8 that kind of priveleged role in Pika.

However, it's a fascinating idea and I thank you for it.   It solves a
nasty little problem I was facing.

Let's suppose that I use up two buckybits purely internally to
represent "ill-formed-characters".   That is to say: users would have
6 buckybits, not 8, and there's two bits per character for internal
use.

I don't actually need 2 bits --- I just need a bit more than 1.5 and
current hw isn't too good at fractional (let alone irrationally
fractional) bits yet.

Now I can have a string like:


  <00 codepoint><00 codepoint><01 bogus><10 bogus><10 bogus><00 codepoint>
                                        ^
                                        |
                                        X

in which <01 bogus> and <10 bogus><10 bogus> are ill-formed combining
character sequences that should be treated as distinct graphemes by
procedures like GRAPHEME-LENGTH and GRAPHEME-REF.

Now if I insert a string of the form:

        <01 bogus>

at point X in that string, then the result is:


  <00 cp><00 cp><01 bogus><10 bogus><01 bogus><01 bogus><00 cp>
                \        /\        /\                  /
                 \      /  \      /  \                /
                  modified  insertion   modified by
                  by                     insertion
                  insertion

In other words, such an insertion has to change adjacent characters to 
preserve the "bogus grapheme" boundaries.

The upshot of this is that I can pun a single string as both a
sequence of codepoints and a sequence of (possibly ill-formed)
combining sequences -- and that is, btw, sufficient to provide the
round-tripping ability you were after not only for UTF-8 but for
UTF-16 and UTF-32 as well.   Total win -- thanks again.

-t




reply via email to

[Prev in Thread] Current Thread [Next in Thread]