bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XeTeX encoding problem


From: Masamichi HOSODA
Subject: Re: XeTeX encoding problem
Date: Sat, 16 Jan 2016 02:15:21 +0900 (JST)

>>>>     (something like ``Table of Contents'' broken etc.)
>>>>
>>>> That can be fixed in other ways, without resorting to native UTF-8.
>>>
>>> I agree.
>>
>> In the case of LuaTex, exactly, it can be fixed.
>> In the case of XeTeX, unfortunately,
>> it cannot be fixed if I understand correctly.
> 
> I think it could be done by changing the active definitions of bytes
> 128-256 when writing to an auxiliary file to read a single Unicode
> character and write out an ASCII sequence that represents that
> character, probably involving the @U command. Do you know how to do
> this?

If I understand correctly, active definitions is unrelated.
In the case of native Unicode is enabled,

"Für" in UTF-8 ".tex":
    letter -> ".tex"
    F      -> 0x66
    ü      -> 0xC3, 0xBC
    r      -> 0x72

XeTeX reads ".tex" files as native Unicode:
    letter -> ".tex"     -> inner XeTeX
    F      -> 0x66       -> U+0066
    ü      -> 0xC3, 0xBC -> U+00FC
    r      -> 0x72       -> U+0072

XeTeX writes ".toc" files in UTF-8:
    letter -> ".tex"     -> inner XeTeX -> ".toc"
    F      -> 0x66       -> U+0066      -> 0x66
    ü      -> 0xC3, 0xBC -> U+00FC      -> 0xC3, 0xBC
    r      -> 0x72       -> U+0072      -> 0x72

As a result, ".tex" and ".toc" are same.
Therefore, table of contents is not broken.


On the other hand, in the case of "bytes" encoding,

XeTeX reads as following:
    letter -> ".tex"     -> inner XeTeX
    F      -> 0x66       -> U+0066
    ü      -> 0xC3, 0xBC -> U+00C3, U+00BC
    r      -> 0x72       -> U+0072

XeTeX writes ".toc" files in UTF-8 *always*.
It cannot change without something like \XeTeXoutputencoding primitive:
    letter -> ".tex"     -> inner XeTeX    -> ".toc"
    F      -> 0x66       -> U+0066         -> 0x66
    ü      -> 0xC3, 0xBC -> U+00C3, U+00BC -> 0xC3, 0x83, 0xC2, 0xBC
    r      -> 0x72       -> U+0072         -> 0x72

As a result, ".tex" and ".toc" are different.
Moreover, ".toc" is broken. It cannot be repaired.

"0xC3, 0xBC" is replaced to \"u by \DeclareUnicodeCharacter etc.
It is correctly "ü".

However, "0xC3, 0x83" is replaced to \~A and
"0xC2, 0xBC" is replaced to $1\over4$.
It is not "ü".

Therefore, table of contents is broken.

I've posted a future request \XeTeXoutputencoding etc.
http://sourceforge.net/p/xetex/feature-requests/22/

>> Yes, CJK fonts are required.
>> For example, if you want to use Japanese characters,
>> I think that it is possible to set the Japanese font in txi-ja.tex.
>> However, if the native Unicode support is disabled,
>> the Japanese characters cannot be used in this way.
> 
> Good idea to put the font loading in the translation files.

Thank you.

Alternatively, it may be good even if there is a font configuration file
like txi-font-latinmodern.tex, txi-font-computermodern.tex, etc.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]