Re: XeTeX encoding problem

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XeTeX encoding problem

From:	Masamichi HOSODA
Subject:	Re: XeTeX encoding problem
Date:	Sat, 23 Jan 2016 12:06:58 +0900 (JST)

>> Thank you for your comments.
>> I've updated the patch.
>>
>> I want the following.
>>   UTF-8 auxiliary file.
>>   Handling Unicode filename (image files and include files).
>>   Handling Unicode PDF bookmark strings.
> 
> Thanks for working on this. I've had a look at the most recent patch,
> which resolves the category code fixing problem. I see you are using
> native UTF-8 input throughout, but I can't see how this could support
> "@documentencoding ISO-8859-1" (or any other single-byte encoding). I
> think the things you mention above could be supported without using
> native UTF-8 support.

Thank you for reviewing.

In XeTeX and LuaTeX, is "@documentencoding ISO-8859-1" support required?
If so, I'll improve the patch.
It will use byte-wise input when "@documentencoding ISO-8859-1" is used.

However, if you want ISO-8859-1,
you can use pdfTeX instead of XeTeX/LuaTex or you can convert to UTF-8,
in my humble opinion.

I want Unicode which contains CJK characters. Not only ISO-8859-1.
In byte-wise input, CJK characters can not be used.

> I don't see the problem with Unicode filenames: files are named with a
> series of bytes; does this mean that XeTeX (or LuaTeX?) has problems
> accessing files with names which aren't in UTF-8?
> 
> Are PDF bookmarks written out incorrectly also?

If I understand correctly,
XeTeX/LuaTeX's inner encoding is UTF-16 instead of UTF-8.
XeTeX/LuaTeX converts UTF-8 input to UTF-16 in default.

For example,

"Für" in UTF-8      -> XeTeX/LuaTeX inner UTF-16
0x66 0xC3 0xBC 0x72 -> 0x0066 0x00FC 0x0072

If byte-wise input is used, 

"Für" in UTF-8      -> XeTeX/LuaTeX inner UTF-16???
0x66 0xC3 0xBC 0x72 -> 0x0066 0x00C3 0x00BC 0x0072

In Windows, native filesystem is UTF-16 instead of UTF-8.
That is XeTeX/LuaTeX inner UTF-16 word sequence is passed through to Windows.

In native Unicode, word sequence 0x0066 0x00FC 0x0072 means "Für",
then filename "Für" can be handled.
In byte-wise input, word sequence 0x0066 0x00C3 0x00BC 0x0072
does not mean "Für", then filename "Für" can not be handled.

Also PDF bookmarks requires UTF-16 for Unicode support.

On the other hand, in Linux, filesystem may be UTF-8.
In this case, XeTeX/LuaTeX inner UTF-16 word sequence
is converted to UTF-8 and is passed through to system call.

In native Unicode, word sequence 0x0066 0x00FC 0x0072
is converted to UTF-8 byte sequence 0x66 0xC3 0xBC 0x72.
It means "Für", then filename "Für" can be handled.

In byte-wise input, word sequence 0x0066 0x00C3 0x00BC 0x0072
is converted to byte sequence 0x66 0xC3 0x83 0xC2 0xBC 0x72.
It does not mean "Für", then filename "Für" can not be handled.

> It's useful to give a ChangeLog entry when posting patches to this
> list, because this gives a summary behind what was changed. One thing
> I wondered about was whether \DeclareUnicodeCharacterNativeAtU and
> \DeclareUnicodeCharacterNative needed to be separate macros.

I'll write ChangeLog.

\DeclareUnicodeCharacterNativeAtU is always required
even when encoding is not UTF-8.
In US-ASCII, @U{00FC} etc. can be used.

\DeclareUnicodeCharacterNative is only required when encoding is UTF-8.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: XeTeX encoding problem, (continued)

Prev by Date: Re: texinfo-6.0.92 make check has 12 FAILs on Solaris10 x86/x64
Next by Date: Re: texinfo-6.0.92 make check has 12 FAILs on Solaris10 x86/x64
Previous by thread: Re: XeTeX encoding problem
Next by thread: Re: XeTeX encoding problem
Index(es):
- Date
- Thread