[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf-8 cjk translation bug?

From: Kenichi Handa
Subject: Re: utf-8 cjk translation bug?
Date: Tue, 30 Sep 2003 21:59:42 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.2.92 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

In article <address@hidden>, Miles Bader <address@hidden> writes:
> I have `utf-translate-cjk-mode' enabled.
> I have the following string in a buffer:

>         NECエレクトロニクス(株)

> If I write it using say `euc-jp' coding system, no problem.  According
> to `C-u C-x =', all the japanese characters are in the charset
> japanese-jisx0208.

> However, if I save it using utf-8, I get no complaints, but when I read
> it back in, the first 3 characters show up as little boxes.  `C-u C-x ='
> shows the boxes as being in charset mule-unicode-e000-ffff; the rest of
> the characters are still listed as being in japanese-jisx0208.

> I presume this is representable utf-8, because unicode is supposed to be
> able to represent all characters in any component character set
> simultaneously, so it would seem to be a bug in utf-translate-cjk-mode.

The first three letters are "FULL WIDTH LATIN ?? LETTER"
(U+FF??).  Yes, they are representable in utf-8.  But, in
subst-jis.el, we have this code:

 (lambda (pair)
   (let ((unicode (car pair))
         (char (cadr pair)))
     ;; exclude non-CJK components from decode table
     (if (and (>= unicode #x2e80) (<= unicode #xd7a3))
         (puthash unicode  char ucs-unicode-to-mule-cjk))
     (puthash char unicode ucs-mule-cjk-to-unicode)))

So, #xFF?? are excluded from ucs-unicode-to-mule-cjk, thus
they are not translated to japanese-jisx0208 on decoding.
If you have a ISO10646-1 font that contains full width
glyphs for those characters, you can see correct glyphs.

I think the reason why they are excluded from the
translation is that they are representable by the charset
mule-unicode-e000-ffff, thus there's no need of translation.
It seems to be a reasonable decision, but considering that
most users don't have an ISO10646-1 font containing those
glyphs, and that those characters can also be regarded as
CJK components (only CJK users uses them), I think we had
better not exclude them from the translation.

So, I suggest changing the above line (and similar lines in
the other subst-XXX.el) to:

     (if (>= unicode #x2e80)
         (puthash unicode  char ucs-unicode-to-mule-cjk))

and modify ccl-decode-mule-utf-8 to check translation also
for those characters.

Dave, what do you think?  Does such a change leads to any
problem?  Aren't there anything else we should change?

Ken'ichi HANDA

reply via email to

[Prev in Thread] Current Thread [Next in Thread]