Encoding error when reading file with ISO-8859-1 filename

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Encoding error when reading file with ISO-8859-1 filename

From:	Gavin Smith
Subject:	Encoding error when reading file with ISO-8859-1 filename
Date:	Sat, 5 Mar 2022 21:00:03 +0000

Here's something that came up when I was testing filename encodings
and a proposed fix to silence a warning message.

Suppose you have a file the name of which is in ISO-8859-1, and which
is itself encoded in ISO-8859-1.  It's not easy to process this with
texi2any from the command line in a UTF-8 locale, but it's possible
with a command like

 ../texi2any.pl  [^a-z]*.texi

This leads to a warning like

"\x{fffd}" does not map to iso-8859-1 at ../../tp/Texinfo/Convert/Info.pm line 
282.

being printed.

No such error is printed in an ISO-8859-1 terminal.

The reason for this is that in the output Info file there is a string
printed like

This is ü.info, produced by texi2any version 6.8dev+dev from é.texi.

The problem was with the encoding of the input filename,
é.texi.  It couldn't be decoded as it was assumed to be in UTF-8.
The replacement character U+FFFD was used for the non-ASCII byte,
and when this string to the output file it tried to encode U+FFFD
with ISO-8859-1, which it couldn't do.

With the current code, doing

 ../texi2any.pl  [^a-z]*.texi -c DATA_INPUT_ENCODING_NAME=ISO-8859-1

avoids the error being printed as the encoding used for the command
line is different.

A one-line fix is the following:

diff --git a/tp/Texinfo/Convert/Converter.pm b/tp/Texinfo/Convert/Converter.pm
index df9d68d701..30eaea1e13 100644
--- a/tp/Texinfo/Convert/Converter.pm
+++ b/tp/Texinfo/Convert/Converter.pm
@@ -546,7 +546,8 @@ sub determine_files_and_directory($;$)
     my $input_file_name = $self->{'parser_info'}->{'input_file_name'};
     my $encoding = $self->get_conf('DATA_INPUT_ENCODING_NAME');
     if (defined($encoding)) {
-      $input_file_name = decode($encoding, $input_file_name);
+      $input_file_name = decode($encoding, $input_file_name,
+                                sub { '?' });
     }
     my ($directories, $suffix);
     ($input_basefile, $directories, $suffix) = fileparse($input_file_name);


This eliminates the problematic U+FFFD character at the point of reading
the filename.  In the output Info file, a question mark will harmlessly
appear in the filename, like:

This is ü.info, produced by texi2any version 6.8dev+dev from ?.texi.

Patrice, do you think it's ok to commit the above change?

Before I found this fix, I tried to fix it using the
$PerlIO::encoding::fallback variable (as briefly documented on the
PerlIO::encoding man page) in Texinfo::Common::output_files_open_out
to use a different replacement character, by doing

local $PerlIO::encoding::fallback = sub { '?'; };

before setting the encoding filter on the output file, but this led
to warnings

Close with partial character at ../../tp/Texinfo/Convert/Info.pm line 284.
Close with partial character.

being printed.

Trying to research this I found unresolved threads and bug reports from
over 10 years ago:

https://www.perlmonks.org/?node_id=675248
https://www.perlmonks.org/?node_id=840344
https://rt.cpan.org/Public/Bug/Display.html?id=67065

There appears to be a workaround using an undocumented Perl feature
(Encode::STOP_AT_PARTIAL) but that's best avoided as it could easily
break and it would be very hard to understand in the future.

I assume that $PerlIO::encoding::fallback doesn't work and may
never work.

[Prev in Thread]

Current Thread

[Next in Thread]

Encoding error when reading file with ISO-8859-1 filename, Gavin Smith <=
- Re: Encoding error when reading file with ISO-8859-1 filename, Patrice Dumas, 2022/03/06

Prev by Date: Re: different encodings for input and output file names and command line
Next by Date: Re: Encoding error when reading file with ISO-8859-1 filename
Previous by thread: put invalid encoded file name for another locale in Makefile for DIST
Next by thread: Re: Encoding error when reading file with ISO-8859-1 filename
Index(es):
- Date
- Thread