bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-ASCII characters in @include search path


From: Gavin Smith
Subject: Re: Non-ASCII characters in @include search path
Date: Sat, 26 Feb 2022 18:57:34 +0000

On Sat, Feb 26, 2022 at 06:50:10PM +0100, Patrice Dumas wrote:
> For an example, in the following there are only ascii strings, except -o
> encodé/ which is not decoded, and the result is that the é in encodé
> ends up not being correctly output:
> 
> $ cat test_smth.texi 
> \input texinfo
> 
> @setfilename test_smth.info
> 
> @top top
> @node Top
> 
> @bye
> 
> $ ./texi2any.pl -o encodé/ test_smth.texi
> 
> $ ls -d encod*
> encodé

I fixed that with the following:

diff --git a/tp/Texinfo/Convert/Converter.pm b/tp/Texinfo/Convert/Converter.pm
index 4ca8a64835..3225420010 100644
--- a/tp/Texinfo/Convert/Converter.pm
+++ b/tp/Texinfo/Convert/Converter.pm
@@ -554,6 +554,16 @@ sub determine_files_and_directory($;$)
        = $self->{'global_commands'}->{'setfilename'}->{'extra'}->{'text_arg'};
   }
 
+  if ($setfilename) {
+    my $document_encoding;
+    my $ignored;
+    $document_encoding = $self->{'parser_info'}->{'input_perl_encoding'}
+      if ($self->{'parser_info'}
+            and defined($self->{'parser_info'}->{'input_perl_encoding'}));
+    ($setfilename, $ignored) = Texinfo::Common::encode_file_name(
+      $self, $setfilename, $document_encoding);
+  }
+
   my $input_basename_for_outfile = $input_basename;
   my $setfilename_for_outfile = $setfilename;
   # PREFIX overrides both setfilename and the input file base name


The problem was that the $setfilename variable had the UTF-8 flag on while
the directory name from the SUBDIR variable had the UTF-8 flag off.
Concatenating these two strings upgraded the whole string to UTF-8 and
converted the bytes from SUBDIR to UTF-8 again, leading to a "double UTF-8"
internally.

I had already tested this patch to get @setfilename to work properly with
an ISO-8859-1 encoded file (attached), so it was a change I would have
likely made anyway.  However, I doubt that supporting ISO-8859-1 filenames
in @setfilename is very important.

I've committed it but am happy for it to be reverted if we decide on a
different approach.  Of course it's very likely there are other issues.


> It may be possible to fix this issue by looking at all the places where
> the SUBDIR or OUTPUT customization variable associated string interact,
> encode all the strings they interact with, also re-decode them if needed
> for error messages, or inclusion in output documents.  However, the
> other option, decode everything and encode when we need to interact with
> the outside of the code seems to me to be much simpler, require much
> less time and thinking and is much less error prone.
> 
> > > * many strings are used both in file names and in texts.  For example
> > >   the customization variable 'EXTENSION'.  Even strings that are almost
> > >   only used as bytes can appear in error messages, which means that we
> > >   need to keep the information somewhere on how to decode them.
> > 
> > It is no problem as long as the EXTENSION string is purely ASCII.
> 
> I do not think so.  I think that it needs to be encoded if mixed with
> non ascii strings.  (Also, it could be set to something non ascii, as
> customization but this should be pretty rare).

Yes, you're right: if the EXTENSION string has the UTF-8 flag on and
it is concatenated with a string with the UTF-8 flag off but which is
encoded in UTF-8, then the same "double UTF-8" problem will occur.

> 
> > > * many strings can come from documents, as character strings or from
> > >   command line, possibly kept encoded.  For example document file name
> > >   can come from @setfilename or the command line (or customization
> > >   variable).
> > 
> > This is a bigger problem as the filename could be non-ASCII, unlike
> > the extension.
> > 
> > I will try to understand the code and run some tests after I install
> > a non-UTF-8 locale.
> 
> You don't need a non-UTF-8 locale for the issue above, or for the issue
> that prompted me to try to look seriously at the issue, which is
> tests/formatting/list-of-tests non_ascii_test_epub. Having an accented
> letter in the document name makes it very hard to determine what should
> be encoded/decoded in init/epub3.pm and upstream code, in particular in
> Texinfo/Convert/Converter.pm determine_files_and_directory(), but
> although I thought previously that it could be solved in that function
> only, it is not so simple, strings come from everywhere in
> init/epub3.pm.

I'll look at it.

Attachment: texinfoFvcBQSL1dh.texinfo
Description: TeXInfo document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]