Re: XeTeX encoding problem

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XeTeX encoding problem

From:	Masamichi HOSODA
Subject:	Re: XeTeX encoding problem
Date:	Sun, 24 Jan 2016 03:15:07 +0900 (JST)

>> In XeTeX and LuaTeX, is "@documentencoding ISO-8859-1" support required?
>> If so, I'll improve the patch.
>> It will use byte-wise input when "@documentencoding ISO-8859-1" is used.
>>
>> However, if you want ISO-8859-1,
>> you can use pdfTeX instead of XeTeX/LuaTex or you can convert to UTF-8,
>> in my humble opinion.
> 
> It would be inconvenient to remember to use pdfTeX whenever you had to
> process a Texinfo document in ISO-8859-1. We should process
> byte-by-byte for an encoding like that, using the existing code in
> texinfo.tex to do so. It isn't perfect, as you say: for example, it
> looks like we couldn't include another Texinfo file the filename of
> which was in a single-byte encoding, but that's better than breaking
> it altogether.

Thank you for your comments.
I've improved the patch that can use ISO-8859-1 with XeTeX/LuaTeX.
ChangeLog is below.

>> I want Unicode which contains CJK characters. Not only ISO-8859-1.
>> In byte-wise input, CJK characters can not be used.
> 
> Have you ever got the CJK characters to work in a Texinfo file with
> XeTeX or LuaTeX? If so, maybe we should conditionally load the fonts
> that you got to work. Can you satisfactorily typeset Japanese text
> with XeTeX without the use of LaTeX packages? If not, it very likely
> won't be practical to implement special rules for typesetting Japanese
> in Texinfo itself.

Not yet. I want to try it.
I'm going to use LuaTeX-ja.

https://osdn.jp/projects/luatex-ja/wiki/FrontPage%28en%29

It can set Japanese fonts separately from alphabetic font settings.
It also has special rules for typesetting Japanese.
It does not require LaTeX.

On the other hand, in XeTeX, it is difficult.
However, Japanese characters and fonts can be used by plain XeTeX at least.

>>> I don't see the problem with Unicode filenames: files are named with a
>>> series of bytes; does this mean that XeTeX (or LuaTeX?) has problems
>>> accessing files with names which aren't in UTF-8?
>>>
> 
>>
>> In native Unicode, word sequence 0x0066 0x00FC 0x0072
>> is converted to UTF-8 byte sequence 0x66 0xC3 0xBC 0x72.
>> It means "Für", then filename "Für" can be handled.
>>
>> In byte-wise input, word sequence 0x0066 0x00C3 0x00BC 0x0072
>> is converted to byte sequence 0x66 0xC3 0x83 0xC2 0xBC 0x72.
>> It does not mean "Für", then filename "Für" can not be handled.
> 
> Thank you for the thorough explanation; it appears that the native
> support for reading files by UTF-8 sequence (instead of by byte) needs
> to be used for opening files with non-ASCII filenames.

Exactly.


ChangeLog:

Add native Unicode support for XeTeX and LuaTex

2016-01-XX  Masamichi Hosoda  <address@hidden>

        * doc/texinfo.tex:
        Add native Unicode support for XeTeX and LuaTex.

        (\iftxinativeunicodecapable): New switch.
        (\iftxiusebytewiseio): New switch.

        (\setbytewiseio): Set I/O by bytes instead of UTF-8 sequence
        for XeTeX and LuaTex non-UTF-8 (byte-wise) encodings.

        (\documentencoding): Remove input by bytes settings for XeTeX.
        Add I/O by bytes settings for single-byte encodings.
        Add native Unicode settings for UTF-8 encoding.

        (\U): Any Unicode characters can be used by native Unicode.

        (\DeclareUnicodeCharacterUTFviii): Rename from
        \DeclareUnicodeCharacter.
        (\DeclareUnicodeCharacterNative): For native Unicode,
        Definition macro to replace the Unicode character.
        (\DeclareUnicodeCharacterNativeThru): For native Unicode,
        Definition macro not to replace (through) the Unicode character.
        (\DeclareUnicodeCharacterNativeAtU): For native Unicode,
        Definition macro that is used by @U command.

        (\unicodechardefs): Rename from \utfeightchardefs.
        (\utfeightchardefs): UTF-8 byte sequence definitions (replacing and
        @U command). It makes the setting that replace UTF-8 byte sequence.
        (\nativeunicodechardefs): Native Unicode character replacing
        definitions. It makes the setting that replace the Unicode characters.
        (\nativeunicodechardefsthru): Native Unicode character ``through''
        definitions. It makes the setting that does not replace
        the Unicode characters.
        (\nativeunicodechardefsatu): Native Unicode @U command definitions.

        (\throughcharactersdefs): Character ``through'' definitions.
        It makes the setting that does not replace the characters.

--- texinfo.tex.org     2016-01-21 23:04:22.405562200 +0900
+++ texinfo.tex 2016-01-24 02:20:37.523179700 +0900
@@ -9433,43 +9433,68 @@
   \global\righthyphenmin = #3\relax
 }
 
-% Get input by bytes instead of by UTF-8 codepoints for XeTeX and LuaTeX, 
-% otherwise the encoding support is completely broken.
+% XeTeX and LuaTeX can handle native Unicode.
+% Their default I/O is UTF-8 sequence instead of byte-wise.
+% Other TeX engine (pdfTeX etc.) I/O is byte-wise.
+%
+\newif\iftxinativeunicodecapable
+\newif\iftxiusebytewiseio
+
 \ifx\XeTeXrevision\thisisundefined
+  \ifx\luatexversion\thisisundefined
+    \txinativeunicodecapablefalse
+    \txiusebytewiseiotrue
+  \else
+    \txinativeunicodecapabletrue
+    \txiusebytewiseiofalse
+  \fi
 \else
-\XeTeXdefaultencoding "bytes"  % For subsequent files to be read
-\XeTeXinputencoding "bytes"  % Effective in texinfo.tex only
-% Unfortunately, there seems to be no corresponding XeTeX command for
-% output encoding.  This is a problem for auxiliary index and TOC files.
-% The only solution would be perhaps to write out @U{...} sequences in
-% place of UTF-8 characters.
+  \txinativeunicodecapabletrue
+  \txiusebytewiseiofalse
 \fi
 
-\ifx\luatexversion\thisisundefined
-\else
-\directlua{
-local utf8_char, byte, gsub = unicode.utf8.char, string.byte, string.gsub
-local function convert_char (char)
-  return utf8_char(byte(char))
-end
-
-local function convert_line (line)
-  return gsub(line, ".", convert_char)
-end
-
-callback.register("process_input_buffer", convert_line)
-
-local function convert_line_out (line)
-  local line_out = ""
-  for c in string.utfvalues(line) do
-     line_out = line_out .. string.char(c)
-  end
-  return line_out
-end
+% Set I/O by bytes instead of UTF-8 sequence for XeTeX and LuaTex
+% for non-UTF-8 (byte-wise) encodings.
+%
+\def\setbytewiseio{%
+  \ifx\XeTeXrevision\thisisundefined
+  \else
+    \XeTeXdefaultencoding "bytes"  % For subsequent files to be read
+    \XeTeXinputencoding "bytes"  % For document root file
+    % Unfortunately, there seems to be no corresponding XeTeX command for
+    % output encoding.  This is a problem for auxiliary index and TOC files.
+    % The only solution would be perhaps to write out @U{...} sequences in
+    % place of non-ASCII characters.
+  \fi
 
-callback.register("process_output_buffer", convert_line_out)
+  \ifx\luatexversion\thisisundefined
+  \else
+    \directlua{
+    local utf8_char, byte, gsub = unicode.utf8.char, string.byte, string.gsub
+    local function convert_char (char)
+      return utf8_char(byte(char))
+    end
+
+    local function convert_line (line)
+      return gsub(line, ".", convert_char)
+    end
+
+    callback.register("process_input_buffer", convert_line)
+
+    local function convert_line_out (line)
+      local line_out = ""
+      for c in string.utfvalues(line) do
+         line_out = line_out .. string.char(c)
+      end
+      return line_out
+    end
+
+    callback.register("process_output_buffer", convert_line_out)
+    }
+  \fi
+
+  \txiusebytewiseiotrue
 }
-\fi
 
 
 % Helpers for encodings.
@@ -9496,13 +9521,6 @@
 %
 \def\documentencoding{\parseargusing\filenamecatcodes\documentencodingzzz}
 \def\documentencodingzzz#1{%
-  % Get input by bytes instead of by UTF-8 codepoints for XeTeX,
-  % otherwise the encoding support is completely broken.
-  % This settings is for the document root file.
-  \ifx\XeTeXrevision\thisisundefined
-  \else
-    \XeTeXinputencoding "bytes"
-  \fi
   %
   % Encoding being declared for the document.
   \def\declaredencoding{\csname #1.enc\endcsname}%
@@ -9519,22 +9537,37 @@
      \asciichardefs
   %
   \else \ifx \declaredencoding \lattwo
+     \iftxinativeunicodecapable
+       \setbytewiseio
+     \fi
      \setnonasciicharscatcode\active
      \lattwochardefs
   %
   \else \ifx \declaredencoding \latone
+     \iftxinativeunicodecapable
+       \setbytewiseio
+     \fi
      \setnonasciicharscatcode\active
      \latonechardefs
   %
   \else \ifx \declaredencoding \latnine
+     \iftxinativeunicodecapable
+       \setbytewiseio
+     \fi
      \setnonasciicharscatcode\active
      \latninechardefs
   %
   \else \ifx \declaredencoding \utfeight
-     \setnonasciicharscatcode\active
-     % since we already invoked \utfeightchardefs at the top level
-     % (below), do not re-invoke it, then our check for duplicated
-     % definitions triggers.  Making non-ascii chars active is enough.
+     \iftxinativeunicodecapable
+       % For native Unicode (XeTeX and LuaTeX)
+       \nativeunicodechardefs
+     \else
+       % For UTF-8 byte sequence (pdfTeX)
+       \setnonasciicharscatcode\active
+       % since we already invoked \utfeightchardefs at the top level
+       % (below), do not re-invoke it, then our check for duplicated
+       % definitions triggers.  Making non-ascii chars active is enough.
+     \fi
   %
   \else
     \message{Ignoring unknown document encoding: #1.}%
@@ -9849,13 +9882,26 @@
 % @U{xxxx} to produce U+xxxx, if we support it.
 \def\U#1{%
   \expandafter\ifx\csname uni:#1\endcsname \relax
-    \errhelp = \EMsimple       
-    \errmessage{Unicode character U+#1 not supported, sorry}%
+    \iftxinativeunicodecapable
+      % Any Unicode characters can be used by native Unicode.
+      % However, if the font does not have the glyph, the letter will miss.
+      \begingroup
+        \uccode`\.="#1\relax
+        \uppercase{.}
+      \endgroup
+    \else
+      \errhelp = \EMsimple     
+      \errmessage{Unicode character U+#1 not supported, sorry}%
+    \fi
   \else
     \csname uni:#1\endcsname
   \fi
 }
 
+% For UTF-8 byte sequence (pdfTeX)
+% Definition macro to replace the Unicode character
+% Definition macro that is used by @U command
+%
 \begingroup
   \catcode`\"=12
   \catcode`\<=12
@@ -9864,7 +9910,7 @@
   \catcode`\;=12
   \catcode`\!=12
   \catcode`\~=13
-  \gdef\DeclareUnicodeCharacter#1#2{%
+  \gdef\DeclareUnicodeCharacterUTFviii#1#2{%
     \countUTFz = "#1\relax
     %\wlog{\space\space defining Unicode char U+#1 (decimal \the\countUTFz)}%
     \begingroup
@@ -9922,6 +9968,37 @@
     \uppercase{\gdef\UTFviiiTmp{#2#3#4}}}
 \endgroup
 
+% For native Unicode (XeTeX and LuaTeX)
+% Definition macro to replace the Unicode character
+%
+\def\DeclareUnicodeCharacterNative#1#2{%
+  \catcode"#1=\active
+  \begingroup
+    \uccode`\~="#1\relax
+    \uppercase{\gdef~}{#2}%
+  \endgroup}
+
+% For native Unicode (XeTeX and LuaTeX)
+% Definition macro not to replace (through) the Unicode character
+%
+\def\DeclareUnicodeCharacterNativeThru#1#2{%
+  \catcode"#1=\active
+  \begingroup
+    \uccode`\.="#1\relax
+    \uppercase{\endgroup \def\UTFNativeTmp{.}}%
+  \begingroup
+    \uccode`\~="#1\relax
+    \uppercase{\endgroup \edef~}{\UTFNativeTmp}%
+}
+
+% For native Unicode (XeTeX and LuaTeX)
+% Definition macro that is used by @U command
+%
+\def\DeclareUnicodeCharacterNativeAtU#1#2{%
+  \def\UTFAtUTmp{#2}
+  \expandafter\globallet\csname uni:#1\endcsname \UTFAtUTmp
+}
+
 % https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_M
 % U+0000..U+007F = https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)
 % U+0080..U+00FF = 
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
@@ -9936,7 +10013,7 @@
 % We won't be doing that here in this simple file.  But we can try to at
 % least make most of the characters not bomb out.
 %
-\def\utfeightchardefs{%
+\def\unicodechardefs{%
   \DeclareUnicodeCharacter{00A0}{\tie}
   \DeclareUnicodeCharacter{00A1}{\exclamdown}
   \DeclareUnicodeCharacter{00A2}{{\tcfont \char162}}% 0242=cent
@@ -10606,14 +10683,42 @@
 
   \global\mathchardef\checkmark="1370 % actually the square root sign
   \DeclareUnicodeCharacter{2713}{\ensuremath\checkmark}
-}% end of \utfeightchardefs
+}% end of \unicodechardefs
+
+% UTF-8 byte sequence (pdfTeX) definitions (replacing and @U command)
+% It makes the setting that replace UTF-8 byte sequence.
+\def\utfeightchardefs{%
+  \let\DeclareUnicodeCharacter\DeclareUnicodeCharacterUTFviii
+  \unicodechardefs
+}
+
+% Native Unicode (XeTeX and LuaTeX) character replacing definitions
+% It makes the setting that replace the Unicode characters.
+\def\nativeunicodechardefs{%
+  \let\DeclareUnicodeCharacter\DeclareUnicodeCharacterNative
+  \unicodechardefs
+}
+
+% Native Unicode (XeTeX and LuaTeX) character ``through'' definitions
+% It makes the setting that does not replace the Unicode characters.
+\def\nativeunicodechardefsthru{%
+  \let\DeclareUnicodeCharacter\DeclareUnicodeCharacterNativeThru
+  \unicodechardefs
+}
+
+% Native Unicode (XeTeX and LuaTeX) @U command definitions
+\def\nativeunicodechardefsatu{%
+  \let\DeclareUnicodeCharacter\DeclareUnicodeCharacterNativeAtU
+  \unicodechardefs
+}
 
 % US-ASCII character definitions.
 \def\asciichardefs{% nothing need be done
    \relax
 }
 
-% Latin1 (ISO-8859-1) character definitions.
+% Non-ASCII bytes ``through'' definitions.
+% It makes the setting that does not replace the non-ASCII byte.
 \def\nonasciistringdefs{%
   \setnonasciicharscatcode\active
   \def\defstringchar##1{\def##1{\string##1}}%
@@ -10659,9 +10764,23 @@
   \defstringchar^^fc\defstringchar^^fd\defstringchar^^fe\defstringchar^^ff%
 }
 
+% Character ``through'' definitions.
+% It makes the setting that does not replace the characters.
+\def\throughcharactersdefs{%
+  \iftxiusebytewiseio
+    \nonasciistringdefs
+  \else
+    \nativeunicodechardefsthru
+  \fi
+}
+
 
 % define all the unicode characters we know about, for the sake of @U.
-\utfeightchardefs
+\iftxinativeunicodecapable
+  \nativeunicodechardefsatu
+\else
+  \utfeightchardefs
+\fi
 
 
 % Make non-ASCII characters printable again for compatibility with
@@ -11010,7 +11129,7 @@
 %
 address@hidden = @active
  @address@hidden
-   @nonasciistringdefs
+   @throughcharactersdefs
    @address@hidden
    @let"address@hidden
    @address@hidden %$ font-lock fix

% -*- coding: utf-8 -*-

\input texinfo.tex

@documentencoding UTF-8

@contents

@chapter für

für

@U{00FC}: U+00FC supported by XeTeX, LuaTeX and pdfTeX

@U{0132}: U+0132 supported by XeTeX, LuaTeX and pdfTeX

@U{0041}: U+0041 supported by XeTeX and LuaTeX only

@bye

% -*- coding: us-ascii -*-

\input texinfo.tex

@documentencoding US-ASCII

@contents

@chapter address@hidden

address@hidden

@U{00FC}: U+00FC supported by XeTeX, LuaTeX and pdfTeX

@U{0132}: U+0132 supported by XeTeX, LuaTeX and pdfTeX

@U{0041}: U+0041 supported by XeTeX and LuaTeX only

@bye

% -*- coding: iso-8859-1 -*-

\input texinfo.tex

@documentencoding ISO-8859-1

@contents

@chapter für

für

@U{00FC}: U+00FC supported by XeTeX, LuaTeX and pdfTeX

@U{0132}: U+0132 supported by XeTeX, LuaTeX and pdfTeX

@U{0041}: U+0041 supported by XeTeX and LuaTeX only

@bye

[Prev in Thread]

Current Thread

[Next in Thread]

Re: XeTeX encoding problem, (continued)
- Re: luatex problems with texinfo.tex, Werner LEMBERG, 2016/01/02

Prev by Date: Re: texinfo-6.0.92 make check has 12 FAILs on Solaris10 x86/x64
Next by Date: Re: span class="nocodebreak" corrupting HTML
Previous by thread: Re: XeTeX encoding problem
Next by thread: Re: XeTeX encoding problem
Index(es):
- Date
- Thread