bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-ASCII characters in @include search path


From: Gavin Smith
Subject: Re: Non-ASCII characters in @include search path
Date: Sun, 20 Feb 2022 11:54:08 +0000

On Sun, Feb 20, 2022 at 10:11:09AM +0000, Gavin Smith wrote:
> On Sun, Feb 20, 2022 at 09:11:54AM +0000, Gavin Smith wrote:
> > On Sat, Feb 19, 2022 at 11:00:33PM +0100, Patrice Dumas wrote:
> > > I think that there is some wrong encoding/decoding somewhere,
> > > but I don't know where.  It is particularly strange that I cannot
> > > reproduce with 6.8 but Gaël can.
> > 
> > I reproduced with 6.8 but only with TEXINFO_XS=omit.  I am going to
> > investigate.
> 
> I reproduced with the development version.  I found that the -f and -r
> operators in Perl would not find a file named with an identical string
> (showing equal with the eq operator) but encoded internally with UTF-8,
> so that utf8::is_utf8 returns true.  The File::Spec functions return
> such a string.  The following fixed it for me:
> 
> diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
> index 29dbf3c8c3..8219534984 100644
> --- a/tp/Texinfo/Common.pm
> +++ b/tp/Texinfo/Common.pm
> @@ -1548,6 +1548,9 @@ sub locate_include_file($$)
>          File::Spec->catdir(File::Spec->splitdir($include_directories),
>                             @directories), $filename);
>        #$file = "$include_dir/$text" if (-e "$include_dir/$text" and -r 
> "$include_dir/$text");
> +
> +      utf8::downgrade ($possible_file);
> +
>        $file = "$possible_file" if (-e "$possible_file" and -r 
> "$possible_file");
>        last if (defined($file));
>      }
> 
> 
> This is obviously a mess.  We should decide exactly where the bug is: in
> the -e operator itself, in File::Spec, or in the way that we use it.
> 
> It might be simpler to eschew File::Spec and just get the filenames with
> simple string operators.

I found it was the last argument to File::Spec->catdir that led to the
utf8 flag being on: $filename.  This came from the argument to
locate_include_file, which came from the Texinfo source file.  The following
also fixes it:

diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
index 29dbf3c8c3..36be8c5b59 100644
--- a/tp/Texinfo/Common.pm
+++ b/tp/Texinfo/Common.pm
@@ -1507,6 +1507,8 @@ sub locate_include_file($$)
   my $text = shift;
   my $file;
 
+  utf8::downgrade($text);
+
   my $ignore_include_directories = 0;
 
   my ($volume, $directories, $filename) = File::Spec->splitpath($text);


This may be surprising as the non-ASCII characters were not in $text itself:
$text was just "include.texi".  The non-ASCII characters in the include path
got to this function without the utf8 flag going on.

Strings coming from the Texinfo source file have to be assumed to represent
characters, not bytes, as the Texinfo source is read with a certain encoding.
File names, however, are a sequence of bytes (on GNU/Linux at least; on
MS-Windows it may be different).  I believe it's this conflict
that is responsible.

I propose the following fix, which doesn't touch Perl's internal string
representation directly:

diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
index 29dbf3c8c3..7babba016c 100644
--- a/tp/Texinfo/Common.pm
+++ b/tp/Texinfo/Common.pm
@@ -1507,6 +1507,8 @@ sub locate_include_file($$)
   my $text = shift;
   my $file;
 
+  utf8::encode($text);
+
   my $ignore_include_directories = 0;
 
   my ($volume, $directories, $filename) = File::Spec->splitpath($text);

This means that any non-ASCII characters in a filename in a Texinfo source
file are sought in the filesystem as the corresponding UTF-8 sequences.

A more thorough fix would obey @documentencoding and convert back to the
original encoding, to retrieve the bytes that were present in the source
file in case the file was not in UTF-8.  I think it would be the most
correct to always use the exact bytes that were in the source file as the
name of the file (I assume that is what TeX would do).




reply via email to

[Prev in Thread] Current Thread [Next in Thread]