bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-ASCII characters in @include search path


From: Patrice Dumas
Subject: Re: Non-ASCII characters in @include search path
Date: Sun, 20 Feb 2022 13:10:16 +0100

On Sun, Feb 20, 2022 at 11:54:08AM +0000, Gavin Smith wrote:
> I found it was the last argument to File::Spec->catdir that led to the
> utf8 flag being on: $filename.  This came from the argument to
> locate_include_file, which came from the Texinfo source file.  The following
> also fixes it:

I do not think that the fact that it is utf8 is important, I believe
that it is an internal design choice in perl what matter is that it is
in the internal perl unicode encoding.

> diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
> index 29dbf3c8c3..36be8c5b59 100644
> --- a/tp/Texinfo/Common.pm
> +++ b/tp/Texinfo/Common.pm
> @@ -1507,6 +1507,8 @@ sub locate_include_file($$)
>    my $text = shift;
>    my $file;
>  
> +  utf8::downgrade($text);
> +
>    my $ignore_include_directories = 0;
>  
>    my ($volume, $directories, $filename) = File::Spec->splitpath($text);
> 
> 
> This may be surprising as the non-ASCII characters were not in $text itself:
> $text was just "include.texi".  The non-ASCII characters in the include path
> got to this function without the utf8 flag going on.

Again, I do not think that we should rely on the specific encoding of a
string.  We should only track whether it is interal perl unicode string
or bytes.

> Strings coming from the Texinfo source file have to be assumed to represent
> characters, not bytes, as the Texinfo source is read with a certain encoding.
> File names, however, are a sequence of bytes (on GNU/Linux at least; on
> MS-Windows it may be different).  I believe it's this conflict
> that is responsible.

I agree, that's also my interpretation.  It is the same on MS-Windows.

> I propose the following fix, which doesn't touch Perl's internal string
> representation directly:
> 
> diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
> index 29dbf3c8c3..7babba016c 100644
> --- a/tp/Texinfo/Common.pm
> +++ b/tp/Texinfo/Common.pm
> @@ -1507,6 +1507,8 @@ sub locate_include_file($$)
>    my $text = shift;
>    my $file;
>  
> +  utf8::encode($text);
> +
>    my $ignore_include_directories = 0;
>  
>    my ($volume, $directories, $filename) = File::Spec->splitpath($text);
> 
> This means that any non-ASCII characters in a filename in a Texinfo source
> file are sought in the filesystem as the corresponding UTF-8 sequences.

I think that the correct way to do that is to use
Encode::encode($text, 'utf-8');
Also I think that it should be done as late as possible, so it would be
better on $possible_file.

> A more thorough fix would obey @documentencoding and convert back to the
> original encoding, to retrieve the bytes that were present in the source
> file in case the file was not in UTF-8.  I think it would be the most
> correct to always use the exact bytes that were in the source file as the
> name of the file (I assume that is what TeX would do).

I do not think so, at least not on Linux, as in Linux the files are
always encoded as UTF-8.  So encoding in UTF-8 seems to always be
better.  It also matches with the XS parser which converts to UTF-8.

This may be incorrect on other platforms, such as windows or mac,
however.

-- 
Pat



reply via email to

[Prev in Thread] Current Thread [Next in Thread]