bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-ASCII characters in @include search path


From: Eli Zaretskii
Subject: Re: Non-ASCII characters in @include search path
Date: Sun, 20 Feb 2022 15:06:57 +0200

> From: Gavin Smith <gavinsmith0123@gmail.com>
> Date: Sun, 20 Feb 2022 11:54:08 +0000
> 
> Strings coming from the Texinfo source file have to be assumed to represent
> characters, not bytes, as the Texinfo source is read with a certain encoding.
> File names, however, are a sequence of bytes (on GNU/Linux at least; on
> MS-Windows it may be different).  I believe it's this conflict
> that is responsible.

File names are not bytes, they are characters as well, at least in
most cases relevant to this discussion.  That some filesystems are
agnostic to the characters in the bytestream that is the file name
doesn't change the basic fact that file names are created and viewed
by humans, and humans need to see characters there.

> I propose the following fix, which doesn't touch Perl's internal string
> representation directly:
> 
> diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
> index 29dbf3c8c3..7babba016c 100644
> --- a/tp/Texinfo/Common.pm
> +++ b/tp/Texinfo/Common.pm
> @@ -1507,6 +1507,8 @@ sub locate_include_file($$)
>    my $text = shift;
>    my $file;
>  
> +  utf8::encode($text);
> +
>    my $ignore_include_directories = 0;
>  
>    my ($volume, $directories, $filename) = File::Spec->splitpath($text);
> 
> This means that any non-ASCII characters in a filename in a Texinfo source
> file are sought in the filesystem as the corresponding UTF-8 sequences.

This will not work on Windows.

> A more thorough fix would obey @documentencoding and convert back to the
> original encoding, to retrieve the bytes that were present in the source
> file in case the file was not in UTF-8.  I think it would be the most
> correct to always use the exact bytes that were in the source file as the
> name of the file (I assume that is what TeX would do).

This assumes that the file name is encoded the same as the Texinfo
source.  But that assumption is only true on the system where the
Texinfo file was written, and even there it could be false.

The only thorough solution, IMO, is to assume the file names are
encoded in the filesystem as specified by the locale's codeset.  That,
too, can be false, but at least in the absolute majority of use cases
it will be true.  The only better solution is to let the user specify
the file-name encoding.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]