[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Skip filename recoding tests on MS-Windows

From: pertusus
Subject: Re: Skip filename recoding tests on MS-Windows
Date: Wed, 26 Oct 2022 11:03:53 +0200

On Wed, Oct 26, 2022 at 05:27:53AM +0300, Eli Zaretskii wrote:
> > Date: Tue, 25 Oct 2022 21:49:04 +0200
> > From: pertusus@free.fr
> > Cc: GavinSmith0123@gmail.com, bug-texinfo@gnu.org
> > 
> > > The only part that is I think different on Windows is the encoding of
> > > file names, because Windows doesn't treat file names as opaque
> > > bytestreams.  But anything that comes from a Texinfo source, even the
> > > name of an included file, should be interpreted according to
> > > @documentencoding.  When accessing included files on Windows, we
> > > should re-encode the file names to the locale's encoding, because
> > > nothing else will work reliably.  Is that what we do?
> > 
> > Yes, but it does not work reliably either, as shown by the tests
> > results.  The test which uses the locale's encoding fails (formatting
> > manual_include_accented_file_name_latin1), while the test in which the
> > document encoding is used, (formatting
> > manual_include_accented_file_name_latin1_explicit_encoding) does not
> > fail.  As analysed just before, it works because both Windows and Perl
> > are consistently wrong, but still it seems to work better.
> Perhaps the logic of these tests fails on Windows?  Can you perhaps
> describe the logic of each of these tests?  In general, I see no
> reason why encoding file names using the locale's encoding should fail
> on Windows if done correctly.  The idea of maintaining file names in
> UTF-8 internally and encoding them to the locale's encoding before
> using in file I/O calls is correct, and should work on Windows.

Here is what happens for formatting manual_include_accented_file_name_latin1
which is the test that fails:

Lets call LOC your locale.  The setup is a manual encoded in Latin1, and
an include file included_latîn1.texi.  On your computer, the î in the
include file is stored as 0x05DE, which is the conversion of 0xEE in the
LOC codepage.  This is not î, (which is 0x00EE) and the file name shows this
character instead of î when viewed in the explorer.  However, î is
presented as 0xEE to Perl when accessing the file, which is what Perl is
expecting for î in Latin1.

On Windows, we set DOC_ENCODING_FOR_INPUT_FILE_NAME to 0 (set in other
cases to 1).  In the XS parser î in the @include line is converted from
the Latin1 encoding of the Texinfo file to UTF-8, so 0xEE gets converted to
0x00EE (UTF-8 encoded).  Then, when the time comes to include the file,
encode_file_name from input.c is called.  The input_file_name_encoding
is not set, nor doc_encoding_for_input_file_name, therefore the locale,
LOC here is used to recode the file name from UTF-8 to LOC.  The 0x00EE
character (UTF-8 encoded) cannot be converted to LOC, so either the
conversion fails, or a replacement character is used.  In any case
0x00EE (UTF-8 encoded) never ends up as being recoded to 0xEE, which
would allow to find the file.

In that case, decoding to the locale leads to not finding the file.

If DOC_ENCODING_FOR_INPUT_FILE_NAME is set to 1, then the document
encoding, Latin1, is used to convert the 0x00EE character (UTF-8
encoded) which lead to 0xEE and the file is found.  Since
DOC_ENCODING_FOR_INPUT_FILE_NAME is set to 1 in the default case for
other platforms than Windows, the file is found in other platforms.

Note that my point is that the same happens on GNU/Linux.  In my UTF-8
locale, if I set -c DOC_ENCODING_FOR_INPUT_FILE_NAME=0 explicitly, the
same as the default on Windows, I get the same result as on Windows, the
include file is not found, as the file names remains UTF-8 encoded in
the parser while the file name on the filesystem is Latin1 encoded.

In the manual_include_accented_file_name_latin1_explicit_encoding test,
INPUT_FILE_NAME_ENCODING is set to ISO-8859-1, which leads Latin1 being
used as the encoding to 0x00EE (UTF-8 encoded) to and to 0xEE.  On
Windows, it emulates setting DOC_ENCODING_FOR_INPUT_FILE_NAME to 0.

In the manual_include_accented_file_name_latin1_use_locale_encoding
test, INPUT_FILE_NAME_ENCODING is set to UTF-8, which leads 0x00EE
(UTF-8 encoded) to remain UTF-8 encoded, such that the input file name
is not found.  It emulates setting DOC_ENCODING_FOR_INPUT_FILE_NAME to 1
in an UTF-8 encoded locale.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]