gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] [semi-OT] Unicode / han unification (was Re: Spaces ...


From: Tom Lord
Subject: [Gnu-arch-users] [semi-OT] Unicode / han unification (was Re: Spaces ...)
Date: Wed, 21 Jan 2004 17:20:29 -0800 (PST)


    > From: Andrew Suffield <address@hidden>

    >> Sorry.  I meant all characters that have been electrically encoded
    >> in a standard character set.  As far as I know unicode does do
    >> that.

    > Nope, not at all. See the previous message, 'han unification'.

Let's be pedantic.   What you two are really disagreeing about is the
meaning of the word "character".

My understanding is that there are certain characters (in one sense of
the word) which are common to Chinese, Japanese, and Korean.   There
are, broadly speaking, four different styles of rendering these
characters as glyphs -- two for Chinese (traditional and simplified),
and one each for Japanese and Korean.   That is to say, there are four
different ways of drawing these characters.

A single font can render each these characters in a way such that all
users will be able to recognize and read them.  Linguists would (so I
hear) generally agree that, though they may be written in four
different styles, these are each a single character.

No single font can render each of these characters in a way that will
seem "natural" to all users -- a single font can only make them
legible.  For "natural" rendering, you would want to use one font for
Japanese text, another for simplified Chinese, and so forth.

Adding to the complexity of the situation, printed materials
traditionally use distinct fonts for these characters depending on
context.  The same character occurring once in a Japanese sentence in
a document and again in a poem quoted in the original Chinese in the
same document would be printed in two different styles.

The Unicode consortium decided not to encode differences in font as
differences in characters.  For example, with a few exceptions used as
mathematical symbols, Fraktur renderings of the latin-based characters
used to write German are not given a separate encoding in Unicode.  In
a Unicode document analyzing some writings of Goethe, the main text
and the quotes would use the same Unicode characters -- even if in an
ideal presentation, the quotes might be rendered in Fraktur.

Some (dare we say "legacy"?) CJK character sets made that decision
differently.   They encode the differently rendered versions of the
"same character" as distinct codepoints -- they give them different
numeric values in the character sets.   They record what some would a
distinction of _font_ as a distinction of _codepoint_.

Consequently, there exists systems and data sets for which "round-trip
converstion" -- conversion to Unicode and back again -- is at best
problematic and in general impossible.   There are data and systems
out there that make distinctions between characters that Unicode does
not recognize because from the Unicode perspective, they are merely
distinctions in the rendering of a single character.

So you're both right.  Unicode does indeed contain (essentially, not
quite literally) all of the characters that have ever been encoded for
computing.  And, at the same time, Unicode does indeed _not_ contain
all of the _codepoints_ (which in other contexts we might call
"characters") that have ever been encoded for computing.



    > > It isn't perfect and it certainly is not complete when you
    > > consider all forms of writing humans have ever used, but it is
    > > maintained, it works at least as well as anything else out there.

    > Doesn't do that either, if you happen to be Chinese, Japanese, or
    > Korean.

As nearly as I can tell, opinions vary about that.  That is to say
that there are some Chinese, Japanese, Koreans, and certainly plenty
of others who would disagree with asuffield, here.

If the "legacy" CJK character sets had never existed, then I think it
quite plausible that nobody would have any complaints at all about the
Unicode unification approach.   Where they wanted to distinguish the
renderings of these characters, they would use constructs outside of
the character set itself.   _Even_in_the_narrow_domain_of_english_
and, of all things the *ASCII* character set -- well, as you can see
from this sentence, the character set isn't expected to live up to
traditional typography all by itself;  /c'est la vie, non?/.

Since the legacy character sets _do_ exist, it becomes more of a
question of the wisdom of Unicode's judgement, and a question of
whether the legacy systems will be displaced.

My personal opinion is that the Unicode consortium is probably right.
While I can't personally evaluate the CJK issue based on my own
knowledge, in those areas (both linguistic and computational) where I
_am_ qualified to judge their arguments and decisions -- they are
unfailingly wise.  Their position on Han Unification is at least very
plausible from my perspective -- though my ability to detect error in
that area is very slight.

As software developers, it makes good sense, imo, to embrace Unicode
and to concentrate on developing software that will ultimately make a
compelling replacement for legacy systems and target for legacy data
sets.


-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]