gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: [semi-OT] Unicode / han unification (was Re: Sp


From: Andrew Suffield
Subject: Re: [Gnu-arch-users] Re: [semi-OT] Unicode / han unification (was Re: Spaces ...)
Date: Thu, 22 Jan 2004 04:58:31 +0000
User-agent: Mutt/1.5.5.1+cvs20040105i

On Wed, Jan 21, 2004 at 07:29:05PM -0800, Tom Lord wrote:
>     > From: Andrew Suffield <address@hidden>
> 
>     > On Wed, Jan 21, 2004 at 05:20:29PM -0800, Tom Lord wrote:
> 
>     >> My understanding is that there are certain characters (in one
>     >> sense of the word) which are common to Chinese, Japanese, and
>     >> Korean.  There are, broadly speaking, four different styles of
>     >> rendering these characters as glyphs -- two for Chinese
>     >> (traditional and simplified), and one each for Japanese and
>     >> Korean.  That is to say, there are four different ways of
>     >> drawing these characters.
> 
>     > It's not quite that simple - there are multiple, similar-looking
>     > ways of writing the same character within the same language in
>     > some cases. Usually it doesn't matter, but for some things it
>     > does - names are a good example. For a person's given name,
>     > writing the character differently is akin to spelling it
>     > differently in English, and Han unification is akin to declaring
>     > that from now on, all people with names like "Tom", "Thom", or
>     > other derivatives of "Thomas" will henceforth be called "Thom".
> 
> I'm having trouble seeing the analogy.  As far as I can tell you are
> comparing related words (names in this case), spelled in a a phonetic
> alphabet, all in one language -- to related words, written in an
> ideographic script, from different languages or from different regions
> where (roughly) the same language but different typography is used.

Yes, pretty much. There are several ways to spell some names in
English, which usually look pretty similar and are all pronounced the
same. In the CJK character sets (Japanese in particular), there are
several ways to write the character(s) that form a name - they look
similar and are pronounced the same, but they are not the same.

Just like English people usually care how you spell their name, CJK
people (again, Japanese in particular) care how you write theirs. They
will not be amused by people telling them that they have to use a
different writing, and they will be annoyed or insulted if the people
doing this are foreign (that matters).

> Can you try to explain it more precisely?
> 
> It is true, as far as I know, that Unicode does not include all
> ideographs used for personal names but I don't think that that's the
> issue you are talking about.

Unicode picks one way to write the character, and says that all people
who use different variants should use this particular one from now
on. (There is a distinct issue that some people have names that aren't
represented at all, but that's not a problem specific to unicode).

>     > The Eastern countries are pretty serious about etiquette, and using
>     > the wrong writing for somebodies name could easily tip the balance
>     > between a contract going to you, or to your next competitor.
> 
> With due respect, and acknowledging that I take your meaning and that
> you meant no harm, I think that that statement at least boarders on
> harmful cultural stereotyping.  We have no shortage complex, sometimes
> codified, and sometimes quite irrational rules of etiquette here in
> the "Western countries".

Mmm. I think you underestimate the importance on which people place on
etiquette over there (except possibly Korea, they're comparatively
cosmopoliton).

They're *really serious*. Bad manners are considered an insult. If
anything, I am understating it.

>     >> A single font can render each these characters in a way such that all
>     >> users will be able to recognize and read them.  Linguists would (so I
>     >> hear) generally agree that, though they may be written in four
>     >> different styles, these are each a single character.
> 
>     >> No single font can render each of these characters in a way that will
>     >> seem "natural" to all users -- a single font can only make them
>     >> legible.  For "natural" rendering, you would want to use one font for
>     >> Japanese text, another for simplified Chinese, and so forth.
> 
>     > If you don't code them as the same character, then having a font that
>     > uses the proper writing for them all is easy. Mozilla under X, for
>     > example, does it pretty well so long as you don't use unicode and have
>     > enough fonts installed - it'll pick a font that matches the character
>     > set of the web page.
> 
>     > If you use unicode, there is no way to tell which font is the right
>     > one to use. Sometimes the application is going to pick the wrong one,
>     > and the result is an awful ugly mess. FroM an aEsthEtic pErspEctive, a
>     > docuMEnt whErE soME of thE charactErs usE the ChinEse style and the
>     > rEst usE the JapanEsE is fairly siMilar to a docuMEnt where randoM
>     > charactErs have had their casE flippEd. You can parse it, but you
>     > don't *want* to.
> 
> Why isn't that a problem to be solved with markup?

Presumably for the same reasons that character coding is not a problem
to be solved with markup. If you say that a CJK document needs to be
tagged in order to be displayed correctly, the response goes as
follows:

 - every document must be displayed correctly; the imperfect form is
   just too bad
 - therefore tagging is compulsory
 - we're going to tag it as EUC-JP (or whatever); it codes more characters

>     > The unicode "solution" to this is for Chinese users to use Chinese
>     > fonts, Japanese users to use Japanese fonts, and neither to interact
>     > with the other, which quite neatly defeats the point of unicode.
> 
> I thought that the solution was to use a sub-optimal but readable font
> where markup is unavailable (miles' "README test") and to use things
> like markup elsewhere.

It's worse than sub-optimal.

There are about 40,000 kanji in total; most are rarely used. Very few
native Japanese people know them all. If you see something you don't
recognise, you have to go consult a dictionary - you can't guess at
something that looks similar, because lots of kanji look similar (and
if the character is not in the dictionary, you have a problem).

If written in the Chinese or Korean styles, the characters vary from
looking visibly wrong, to looking like a different kanji entirely (it
is relatively clear, once you understand how kanji are formed, whether
or not two glyphs are the same character).

A fairly lousy comparison, but it's not entirely unlike running
everything through b1ff (here's your next two paragraphs):

> 1F I R1TE ABOUT MATH OR PROGRAMMING 1N ASCII. 2 BE CLEAR EITHUR I
> USE TYPOGRAPH1CAL CONVENSHUNZ LIKE `VARIABLE OR I RELY ON A MARKUP
> SYSTEM 2 SET "VAR1ABLE" IN A DIST1NCT1VE FONT.
> 
> WHY IZ CJK DIFFURENT???!   (IT"Z AN OPEN-MINDED QUESSHUN. NOT A
> RHE2RICAL 1.)

You can understand that, probably. You're not going to be remotely
happy if you have to read very much like that; it slows you down and
it's annoying.

> If I write about math or programming in ASCII, to be clear either I
> use typographical conventions like `variable' or I rely on a markup
> system to set "variable" in a distinctive font.
> 
> Why is CJK different?   (It's an open-minded question, not a
> rhetorical one.)

Why is character coding different at all? What's the *point* of
unicode? We already have effective markup systems here - the
Content-Type field in an http session, that specifies the character
encoding used, for example.

This is the problem that unicode was supposed to solve. Saying that
you need to solve it externally raises the question of why you're
bothering with unicode.

Turn it around and apply it to the original problem again - how do you
propose marking filenames stored in tla such that the language being
used can be determined and the appropriate glyphs generated? Why is
this better than just marking them with a character set in the first
place?

I don't even want to think about what it would take to make ls display
the right glyphs in a terminal. That's the sort of thing that unicode
was supposed to fix.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ |
 `. `'                          |
   `-             -><-          |

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]