bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Skip filename recoding tests on MS-Windows


From: Eli Zaretskii
Subject: Re: Skip filename recoding tests on MS-Windows
Date: Sun, 23 Oct 2022 21:46:48 +0300

> From: Gavin Smith <gavinsmith0123@gmail.com>
> Date: Sun, 23 Oct 2022 18:41:34 +0100
> Cc: pertusus@free.fr, bug-texinfo@gnu.org
> 
> On Sun, Oct 23, 2022 at 08:13:24PM +0300, Eli Zaretskii wrote:
> > > I am pretty sure that this file is correctly generated, I guess that
> > > מ corresponds to the same octet than î in latin1, which is 0xEE unless I
> > > missed something, and your codepage would be Windows-1255, maybe?
> > 
> > Yes, it is.
> > 
> > > So, there seems to be no trouble creating a correctly encoded file name
> > > which, if interpreted as ISO-8859-1 gives the correct binary string.
> > 
> > Yes.  I think the problem is not in generating the file name, it is in
> > using that file later.
> 
> Is the filesystem on Windows not usually NTFS which stores filenames in
> UTF-16?

Yes, it is.

> So the file would be created with some UTF-16 name, even if it
> appears to programs in some 8-bit encoding depending on the code page.

Yes.  When the name of a file is requested by a program using 'char *'
multibyte arguments, Windows converts the UTF-16 file name stored in
the directory to the encoding it thinks the program expects.  That
encoding is the current system codepage.

> It seems relevant what the file name actually created is.  If it is not
> created with the correct name then it would not be possible to open it.

What is the "correct name"?  When a program creates a file using a
'char *' argument, Windows assumes the file name is in the current
system codepage, and converts it to UTF-16 when recording the file
name on disk.  So in my case, any bytes above 127 will be interpreted
as being in the CP1255 encoding, and will be converted to UTF-16 as
such.

> Is it possible for you to find the "included_latמn1.texi" file in the
> Windows file explorer and check what its name really is?

That is its name as shown by the File Explorer.  Which is not
surprising, given the above explanation: the File Explorer reads the
file names directly in UTF-16.  And the same with Emacs when it runs
on Windows: I see there the same name "included_latמn1.texi" as in the
File Explorer, because Emacs also accesses the original UTF-16 encoded
file names.

The reason we see that character is that Perl, which created that
file, used 'char *' file APIs, and then Windows assumed the file name
was in CP1255, and converted it to UTF-16 accordingly.  Which made the
0xEE byte be interpreted as the letter מ, and written to disk as
0x05DE, the UTF-16 encoding of that letter's codepoint.

But here's the catch: the native Windows port of GNU 'ls' I have here
shows that file as "included_latεn1.texi", because console Windows
programs generally use a different codepage (CP437 in my case) and
show non-ASCII text using a font that assumes CP437 encoding.  And the
MSYS version of 'ls' shows it as "included_lat?n1.texi", where the
question mark means that whatever encoding the MSYS programs are
expecting doesn't have any character which has the 0xEE byte as its
8-bit encoding, so that byte is replaced with '?' to indicate a
character that cannot be represented.  And so on and so forth.

> I'm really doubtful that these tests can be made to work - if you are
> limited to an 8-bit encoding that is not Latin-1, how are tests using
> Latin-1 only characters going to work?  It seems easier for all involved
> to skip these tests.

I agree.  It might be possible to concoct a test that would work by
carefully choosing the bytes, but it would be unreliable, given the
different ports of Unix software used by people who care to run the
tests, and their different expectations and assumptions about
non-ASCII file names.

Non-ASCII file names on Windows only behave sensibly when using the
'wchar_t *' file-name APIs.  Which ported Unix software almost never
does, because it means a significant surgery of the sources, which use
'char *' file names all over.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]