[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft |
Date: |
Sat, 24 Jan 2004 16:19:16 -0800 (PST) |
> From: Florian Weimer <address@hidden>
> Tom Lord wrote:
> > > Let's take one step back a bit. What is a "character" in the
> > > context of this thread (i.e. Pika)?
> > A unicode codepoint, plus buckybits.
> I don't think these buckybits are a good idea. On almost any system,
> the relationship between "byte", "Unicode codepoint" and "key sequence"
> is non-trivial. Gluing things together is prone to confusion and
> future problems.
A codepoint with buckybits is not a "key sequence" -- in event-typing,
it serves as what a chording, button-box style of input device (such as
a keyboard) can transmit for a single (logical) gesture.
I'm not sure why you mention `byte' in this context. It is true (in
Pika) that I want 0..256 to be valid codepoints with the side effect
that an octet stream happens to also be a valid codepoint stream. And
I happen to want those Pika codepoints to be their numerically
equivalent Unicode codepoints. In R6RS I want something compatible
with that but slightly less specific.
> Just decide how many ISO 10646 planes you want to support, and use the
> appropriate number of bits (21 is fine). Use an additional bit to
> squeeze in 256 code positions you might want to use to represent invalid
> UTF-8 input data (so you have round-trip capability even for binary
> files accidentally interpreted as UTF-8).
I'm not giving UTF-8 that kind of priveleged role in Pika.
However, it's a fascinating idea and I thank you for it. It solves a
nasty little problem I was facing.
Let's suppose that I use up two buckybits purely internally to
represent "ill-formed-characters". That is to say: users would have
6 buckybits, not 8, and there's two bits per character for internal
use.
I don't actually need 2 bits --- I just need a bit more than 1.5 and
current hw isn't too good at fractional (let alone irrationally
fractional) bits yet.
Now I can have a string like:
<00 codepoint><00 codepoint><01 bogus><10 bogus><10 bogus><00 codepoint>
^
|
X
in which <01 bogus> and <10 bogus><10 bogus> are ill-formed combining
character sequences that should be treated as distinct graphemes by
procedures like GRAPHEME-LENGTH and GRAPHEME-REF.
Now if I insert a string of the form:
<01 bogus>
at point X in that string, then the result is:
<00 cp><00 cp><01 bogus><10 bogus><01 bogus><01 bogus><00 cp>
\ /\ /\ /
\ / \ / \ /
modified insertion modified by
by insertion
insertion
In other words, such an insertion has to change adjacent characters to
preserve the "bogus grapheme" boundaries.
The upshot of this is that I can pun a single string as both a
sequence of codepoints and a sequence of (possibly ill-formed)
combining sequences -- and that is, btw, sufficient to provide the
round-tripping ability you were after not only for UTF-8 but for
UTF-16 and UTF-32 as well. Total win -- thanks again.
-t
- [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Tom Lord, 2004/01/22
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, David Allouche, 2004/01/22
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Tom Lord, 2004/01/22
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Mark A. Flacy, 2004/01/22
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Tom Lord, 2004/01/22
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Andrew Suffield, 2004/01/22
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Junio C Hamano, 2004/01/23
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, David Allouche, 2004/01/23
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Tom Lord, 2004/01/23
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft, Florian Weimer, 2004/01/24
- Re: [Gnu-arch-users] [OT] Unicode meets Scheme strings draft,
Tom Lord <=