Re: Non-ASCII characters in @include search path

bug-texinfo
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Non-ASCII characters in @include search path

From:	Gavin Smith
Subject:	Re: Non-ASCII characters in @include search path
Date:	Wed, 23 Feb 2022 19:31:52 +0000
On Wed, Feb 23, 2022 at 03:39:23PM +0100, Patrice Dumas wrote:
> On Tue, Feb 22, 2022 at 08:52:56PM +0000, Gavin Smith wrote:
> > 
> > I've done more tonight but I still have more to do.  There will have to
> > be some decoding of filenames when they are being put into
> > an error message (e.g. "@include: could not find..." and possibly others).
> > It would make sense to use the document encoding for this.
> > 
> > That's for the error messages.  As for actually finding the files, that's
> > a different question.  I'll read through what you wrote again and reply
> > another time.
> 
> I have constructed an example that fails for @image (non_ascii_test_epub
> in tests/formatting/list-of-tests), but I'll wait for you to read
> my messages to come to a decision on the encoding to encode to before
> doing some code.

Probably the same is needed for @image as for @include (see changes
below).

Regarding this:

> > I've done more tonight but I still have more to do.  There will have to
> > be some decoding of filenames when they are being put into
> > an error message (e.g. "@include: could not find..." and possibly others).
> > It would make sense to use the document encoding for this.
> 
> I don't think that the document encoding is the best bet here, the
> locale encoding would be a better default, in my opinion.

I think there is some misunderstanding here.  The filenames are decoded
when read from the file according to the document encoding, and when the
error messages are printed, the locale encoding is used.  All this is
separate to the question of how to find the files on the filesystem.

> I am building tests with accented characters everywhere to be sure that
we test and handle most if not all cases.

Great, that will help a lot.

> I also checked that in a 8 bit locale an @include file with accent in
> the name is not found (because the file name is encoded to utf-8).

Agreed.  I made the following fix for this...

diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
index d3f69efd87..5e6fcfb597 100644
--- a/tp/Texinfo/Common.pm
+++ b/tp/Texinfo/Common.pm
@@ -1511,9 +1511,9 @@ sub locate_include_file($$)
   my $text = shift;
   my $file;
 
-  # Reverse the decoding of the file name from UTF-8.  When dealing
-  # with file names, we want Perl strings representing sequences of bytes,
-  # not UTF-8 codepoints.
+  # Reverse the decoding of the file name from the input encoding.  When
+  # dealing with file names, we want Perl strings representing sequences of
+  # bytes, not Unicode codepoints.
   #     This is necessary even if the name of the included file is purely
   # ASCII, as the name of the directory it is located within may contain
   # non-ASCII characters.
@@ -1522,8 +1522,12 @@ sub locate_include_file($$)
   if ($configuration_information) {
     my $info = Texinfo::Parser::global_information($configuration_information);
     my $encoding = $info->{'input_perl_encoding'};
-    if ($encoding and ($encoding eq 'utf-8' or $encoding eq 'utf-8-strict')) {
-      utf8::encode($text);
+    if ($encoding) {
+      if ($encoding eq 'utf-8' or $encoding eq 'utf-8-strict') {
+        utf8::encode($text);
+      } else {
+        $text = Encode::encode($encoding, $text);
+      }
     }
   }


I haven't had time to properly install and test a non-UTF-8 locale yet,
so please test this (I've committed this change).

I understand that this would be for a Texinfo file encoded in an 8-bit
encoding which is including a file the name of which is in the same
encoding on the filesystem.

You wrote:
> I think that your commit
> e11835b62d8f3d43c608013d21683c72e9a54cc3 "@include file name encoding"
> would still need to be modified in order to use a specific encoding to
> encode the file name to and not simply use utf8::encode as the file
> names encoding may not be utf8.  Using the locale encoding as the
> default seems better to me, with a possibility to modify the value on
> the command line, and FILE_NAMES_ENCODING_NAME could be used for that.
> To be checked, but it seems to me that in the XS parser this information
> should also be used where the include file name string (and maybe other
> file names) should be converted to that encoding from utf-8 if that
> encoding is not different from utf-8.

Whatever we do, it should be concordant with TeX's filename handling.
I imagine that TeX (except possibly on MS-Windows) would just use the
bytes, so so should we.

In any case the cases we are dealing with a very rare here, but I just
don't see that the situation is very common where somebody works in
a non-UTF-8 locale, has all their filenames in this encoding, and
recodes any files they download from the Internet or extracted from a tar
file into that encoding.  I've no insight into what use case we would be
supporting by using the kocale encoding to interpret any filenames.

It seems much more likely to me that somebody would be using a
non-UTF-8 locale for whatever reason, and would download Texinfo
files with UTF-8 names without recoding the names, and still
expect to be able to build them.  (Even if they can't type the
names in, it may get build with Makefile rules.)

Some filtering with a customization variable may be necessary for
unusual operating systems and/or filesystems.

I couldn't find information on how filenames are handled in TeX on
MS-Windows (I tried looking at documentation for both MikTeX and TeX Live
but didn't find anything).  It would be easy to get wrong, so it is best
tested before implementing anything.  What exactly is the case that
we need to support?

E.g. - UTF-8 Texinfo file, processed under KOI-8 locale on Windows,
accessing filenames named with UTF-16 filenames on Windows filesystem.
Then the UTF-8 filenames would be encoded to KOI-8, and then some file
access layer would convert the KOI-8 to UTF-16 and find the filenames.
Is that how it works or am I way off?

> > With the current code, non-ASCII bytes are output incorrectly in the
> > filename parts of errors from the XS parser.  I intend to fix this
> > by replacing the code in tp/Texinfo/XS/parsetexi/errors.c that outputs
> > errors as a dump of Perl code that Perl part of the module has to 'eval'.
> > Instead, I intend to create the error message data structures more
> > directly.  This has long been a desideratum for this module.
> 
> I commited a temporary 'fix' by encoding to utf8 to have the same result for
> the XS and NonXS parser, it should be ok until you do a better fix with
> a better interface.

I've done this now.  It could be improved to match the data structure of
Texinfo::Report more directly then the array could simply be copied across
in one go, rather than with individual calls to line_error.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Non-ASCII characters in @include search path, (continued)
Prev by Date: Re: Non-ASCII characters in @include search path
Next by Date: Re: Non-ASCII characters in @include search path
Previous by thread: Re: Non-ASCII characters in @include search path
Next by thread: Re: Non-ASCII characters in @include search path
Index(es):
- Date
- Thread