gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] [OT] Unicode vs. legacy character sets


From: Tom Lord
Subject: Re: [Gnu-arch-users] [OT] Unicode vs. legacy character sets
Date: Tue, 3 Feb 2004 11:05:17 -0800 (PST)

    > From: Aaron Bentley <address@hidden>

    > On Tue, 2004-02-03 at 13:16, Tom Lord wrote:

    >> So what I am (tentatively) willing to do is this: if there's enough
    >> programmers who both (a) want to help with my software and (b) are
    >> against unification -- I'm willing to have libhackerlab (hence Pika
    >> and arch) use an _extended_ Unicode.  Standardizing, within those
    >> libraries and programs on assigning-by-convention some private-use
    >> codepoints to un-unified characters.

    > I'm an ignorant English speaker, but would it be possible to make the
    > private characters be combining character?  That is, you'd have a
    > combining character to indicate that the next character is Chinese,
    > Japanese, or Korean, then use the unified Han character.

Not cleanly, no.   You have that backwards.  The combining character
would _follow_ the unified character.

        <unified><language-tag><unified><language-tag>.....

That is, indeed, perfectly workable logically -- but not, I think, the
best way to do it.   It's a _plausible_ approach in that I don't think
there's enough private-use 16-bit code-space to squeeze unified
ideographs into the basic multilingual plane (codepoints that fit into
16-bits) --- so in UTF-8 and UTF-16 you'll be paying a comperable
space penalty anyway.  But:

First: I happen to believe [long explanation of why elided] that
internally to applications, UTF-32 representations are actually
important.  The combining character approach doubles the length of a
string in UTF-32.

Second: the unified characters are codepoints (at least mostly so --
I'm not certain) -- thus admit some useful algorithms that operate on
(fixed width) codepoints rather than (unbounded width) combining
character sequences.   If you really want to refute-by-demonstration
unification, then the alternative demonstrated should propose
alternative codepoints, not combining character sequences.   For
example: if you wrote an Emacs based on libhackerlab, then compiling a
version that works on "non-unified-Unicode" should be little more than
a matter of substituting the raw-data Unicode databases with some
extended databases.

That said: it's "not my problem" -- it's not my place to pick one of
these two approaches over the other and neither of my arguments are
absolute.


    >> That wouldn't provide interoperability with everything in the world --
    >> far from it.   For example, it would be (at best) a long time before
    >> browsers would recognize the non-standard characters.  

    > That way, the raw output would be legible (though ugly) for non-savvy
    > programs, and conversion to standard Unicode would be a matter of
    > deleting the combining characters.

(Lossy) conversion to standard Unicode is trivial with either
approach.

Display -- hmm.... y[]o[]u[] d[]o[] h[]a[]v[]e[] a[] g[]o[]o[]d[]
p[]o[]i[]n[]t[] t[]h[]e[]r[]e[].

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]