Re: Texinfo Windows patch: Non-ASCII text output

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Texinfo Windows patch: Non-ASCII text output

From:	Eli Zaretskii
Subject:	Re: Texinfo Windows patch: Non-ASCII text output
Date:	Thu, 25 Dec 2014 18:00:18 +0200

> Date: Fri, 07 Nov 2014 17:55:36 +1000
> From: Jason Hood <address@hidden>
> 
> * improves screen output (faster, correctly displays both UTF-8 and
>   latin1 files).

This part of the patch was the most complex to deal with.  Its main
part deals with displaying non-ASCII characters on the Windows
console.  The original patch solved this by converting the text to
UTF-16 encoded Unicode characters, leaving it to the Windows console
to cope with any unsupported characters as best it could under the
current console codepage.

However, since the patch was written (for Texinfo 5.2), there was
significant development in this area on the trunk.  As result, the
Info reader now uses libiconv to convert the file's encoding to the
screen output encoding, and automatically replaces any characters
unsupported by the output encoding by their ASCII equivalents.

Given these changes, IMO it no longer makes sense to rely on Windows
to convert unsupported characters.  Therefore, the conversion to
UTF-16 and subsequent use of "wide" APIs to write to the console is no
longer needed -- with one notable exception: when the output encoding,
as determined by the current console codepage, is UTF-8 or UTF-7.  For
these 2 encodings, the _only_ way of delivering text to the Windows
console is by using the "wide" APIs, which accept UTF-16 encoded text.

The patch I propose below implements this logic.  If it detects the
UTF codepages, it converts the text to UTF-16 and sends it to the
screen using the WriteConsoleW API; otherwise it uses the normal
console I/O routines to write the text produced by libiconv without
changes.  The patch also handles the calls to terminal_write_chars,
which Jason's patch left out.

Let me now describe two important points about this patch:

The patch needs an accurate method of determining the codepage for
console output.  Info uses 'nl_langinfo (CODESET)' to get the target
encoding, but that is not good enough on Windows, because Windows uses
3 codepages at the same time: one for input to GUI programs, which is
also used for encoding file names, and 2 more for console input and
output.  Gnulib's nl_langinfo returns the first of these 3, whereas we
need specifically the console output codepage.  Therefore, the patch
below wraps nl_langinfo in a short Windows-specific function that only
knows about CODESET, and which calls the appropriate Windows API to
returns the codepage used for console output.

In addition, since support of Unicode characters by the fonts
available for the Windows console is limited, the patch below appends
"//TRANSLIT" to the output encoding produced by nl_langinfo.  (I think
this feature can be useful on platforms other than Windows, but it's a
GNU libiconv extension, and I don't know how to detect whether we are
using GNU libiconv.)

While at that, I also added a few more "degrading" ASCII replacements
for several Unicode characters which are widely used in GNU
documentation.  This part of the patch is not Windows-specific.

I viewed several Info manuals with this patch, and was generally
pleased with the results: what formerly was utterly illegible is now
perfectly readable, even when UTF-8 is not supported by the console.

Here's the patch; OK to commit?

--- info/info-utils.c~0 2014-12-23 21:51:58 +0200
+++ info/info-utils.c   2014-12-24 16:22:30 +0200
@@ -27,6 +27,10 @@
 #include <langinfo.h>
 #if HAVE_ICONV
 # include <iconv.h>
+#ifdef __MINGW32__
+# define nl_langinfo rpl_nl_langinfo
+extern char * rpl_nl_langinfo (nl_item);
+#endif
 #endif
 #include <wchar.h>
 
@@ -758,6 +764,9 @@ degrade_utf8 (char **from, size_t *from_
 
     {"\xE2\x86\x92","->"},/* Right arrow */
     {"\xE2\x87\x92","=>"},/* Right double arrow */
+    {"\xE2\x8A\xA3","-|"},/* Print symbol */
+    {"\xE2\x98\x85","-!-"}, /* Point symbol */
+    {"\xE2\x86\xA6","==>"}, /* Expansion symbol */
 
     {"\xE2\x80\x90","-"},  /* Hyphen */
     {"\xE2\x80\x91","-"},  /* Non-breaking hyphen */


--- info/pcterm.c~0     2014-12-23 21:51:59 +0200
+++ info/pcterm.c       2014-12-24 16:11:06 +0200
@@ -39,6 +39,8 @@
 #include <io.h>
 #include <conio.h>
 #include <process.h>
+#include <malloc.h>    /* for alloca */
+#define WIN32_LEAN_AND_MEAN
 #include <windows.h>
 
 struct text_info {
@@ -587,6 +611,57 @@ w32_read (int fd, void *buf, size_t n)
     return _read (fd, buf, n);
 }
 
+/* Write to the console a string of text encoded in UTF-8 or UTF-7.  */
+static void
+write_utf (DWORD cp, const char *text, int nbytes)
+{
+  /* MSDN says UTF-7 requires zero in flags.  */
+  DWORD flags = (cp == CP_UTF7) ? 0 : MB_ERR_INVALID_CHARS;
+  /* How much space do we need for wide characters?  */
+  int wlen = MultiByteToWideChar (cp, flags, text, nbytes, NULL, 0);
+
+  if (wlen)
+    {
+      WCHAR *text_w = alloca (wlen * sizeof (WCHAR));
+      DWORD written;
+
+      if (MultiByteToWideChar (cp, flags, text, nbytes, text_w, wlen) > 0)
+       {
+         WriteConsoleW (hscreen, text_w, wlen - 1, &written, NULL);
+         return;
+       }
+    }
+  /* Fall back on conio.  */
+  if (nbytes < 0)
+    cputs (text);
+  else
+    cprintf ("%.*s", nbytes, text);
+}
+
+/* A replacement for nl_langinfo which does a more accurate job for
+   the console output codeset.  Windows can use 3 different encodings
+   at the same time, and the Posix-compliant nl_langinfo simply
+   doesn't know enough to decide which one is needed when CODESET is
+   requested.  */
+#undef nl_langinfo
+#include <langinfo.h>
+
+char *
+rpl_nl_langinfo (nl_item item)
+{
+  if (item == CODESET)
+    {
+      static char buf[100];
+
+      /* We need all the help we can get from GNU libiconv, so we
+        request transliteration as well.  */
+      sprintf (buf, "CP%u//TRANSLIT", GetConsoleOutputCP ());
+      return buf;
+    }
+  else
+    return nl_langinfo (item);
+}
+
 #endif /* _WIN32 */
 
 /* Turn on reverse video. */
@@ -669,6 +744,10 @@ pc_put_text (string)
 {
   if (speech_friendly)
     fputs (string, stdout);
+#ifdef __MINGW32__
+  else if (output_cp == CP_UTF8 || output_cp == CP_UTF7)
+    write_utf (output_cp, string, -1);
+#endif
   else
     cputs (string);
 }
@@ -697,9 +776,13 @@ pc_write_chars (string, nchars)
     return;
 
   if (speech_friendly)
-    printf ("%.*s",nchars, string);
+    printf ("%.*s", nchars, string);
+#ifdef __MINGW32__
+  else if (output_cp == CP_UTF8 || output_cp == CP_UTF7)
+    write_utf (output_cp, string, nchars);
+#endif
   else
-    cprintf ("%..*s",nchars, string);
+    cprintf ("%.*s", nchars, string);
 }
 
 /* Scroll an area of the terminal from START to (and excluding) END,
@@ -870,6 +953,11 @@ pc_initialize_terminal (term_name)
 
   pc_get_screen_size ();
 
+#ifdef __MINGW32__
+  /* Record the screen output codepage.  */
+  output_cp = GetConsoleOutputCP ();
+#endif
+
 #ifdef __MSDOS__
   /* Store the arrow keys.  */
   term_ku = (char *)find_sequence (K_Up);

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Texinfo Windows patch: Non-ASCII text output, Eli Zaretskii <=
- Re: [Bulk] Re: Texinfo Windows patch: Non-ASCII text output, Jason Hood, 2014/12/25
  - Re: [Bulk] Re: Texinfo Windows patch: Non-ASCII text output, Eli Zaretskii, 2014/12/26
    - Re: Texinfo Windows patch: Non-ASCII text output, Jason Hood, 2014/12/26
    - Re: Texinfo Windows patch: Non-ASCII text output, Eli Zaretskii, 2014/12/26

Prev by Date: Re: Texinfo Windows patch: Fix the visual bell on MS-Windows
Next by Date: Re: Too long lines in the echo area of the stand-alone Info reader
Previous by thread: Re: Texinfo Windows patch: Fix the visual bell on MS-Windows
Next by thread: Re: [Bulk] Re: Texinfo Windows patch: Non-ASCII text output
Index(es):
- Date
- Thread