bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-ASCII characters in @include search path


From: Patrice Dumas
Subject: Re: Non-ASCII characters in @include search path
Date: Thu, 24 Feb 2022 14:33:11 +0100

On Wed, Feb 23, 2022 at 07:31:52PM +0000, Gavin Smith wrote:
> 
> I think there is some misunderstanding here.  The filenames are decoded
> when read from the file according to the document encoding, and when the
> error messages are printed, the locale encoding is used.  All this is
> separate to the question of how to find the files on the filesystem.

The question is whether to use the locale encoding for the file names on
the filesystem or not.  But you answer that question below.

I checked a bit the code, there are definitly places where strings
coming from manuals, and decoded to the perl internal codepoints
are mixed with strings coming from the command line that are left
as is.  Hopefully mostly in tp/Texinfo/Convert/Converter.pm
determine_files_and_directory(), in addition to error messages related to
@include and @image, and @verbatiminclude.

> > I also checked that in a 8 bit locale an @include file with accent in
> > the name is not found (because the file name is encoded to utf-8).
> 
> Agreed.  I made the following fix for this...

It fixes the NonXS parser (I modified where it is done, such as to do it
it before locate_include_file but kept your code), but not for the XS
parser.  In the XS parser, the @include file name is converted to utf-8
upon reading.  If the file name is encoded in another encoding on the
filesystem it won't be found (I tested, it is indeed the case).

To do something similar to the NonXS parser, one would need, maybe
in Texinfo/XS/parsetexi/end_line.c in end_line_misc_line around line
1428, instead of fullpath = locate_include_file (text); text should be
converted to the @documentencoding unless it is utf-8 or ascii.

> I haven't had time to properly install and test a non-UTF-8 locale yet,
> so please test this (I've committed this change).

As I said above, it works for the NonXS parser.

> I understand that this would be for a Texinfo file encoded in an 8-bit
> encoding which is including a file the name of which is in the same
> encoding on the filesystem.

Indeed.

> You wrote:
> > I think that your commit
> > e11835b62d8f3d43c608013d21683c72e9a54cc3 "@include file name encoding"
> > would still need to be modified in order to use a specific encoding to
> > encode the file name to and not simply use utf8::encode as the file
> > names encoding may not be utf8.  Using the locale encoding as the
> > default seems better to me, with a possibility to modify the value on
> > the command line, and FILE_NAMES_ENCODING_NAME could be used for that.
> > To be checked, but it seems to me that in the XS parser this information
> > should also be used where the include file name string (and maybe other
> > file names) should be converted to that encoding from utf-8 if that
> > encoding is not different from utf-8.
> 
> Whatever we do, it should be concordant with TeX's filename handling.
> I imagine that TeX (except possibly on MS-Windows) would just use the
> bytes, so so should we.
> 
> In any case the cases we are dealing with a very rare here, but I just
> don't see that the situation is very common where somebody works in
> a non-UTF-8 locale, has all their filenames in this encoding, and
> recodes any files they download from the Internet or extracted from a tar
> file into that encoding.  I've no insight into what use case we would be
> supporting by using the kocale encoding to interpret any filenames.

It could also be the reverse, somebody works in an UTF-8 locale
with a manual in a 8 bit locale and recodes the file names to
utf-8.

> It seems much more likely to me that somebody would be using a
> non-UTF-8 locale for whatever reason, and would download Texinfo
> files with UTF-8 names without recoding the names, and still
> expect to be able to build them.  (Even if they can't type the
> names in, it may get build with Makefile rules.)

To me both are possible.  Speaking for GNU/Linux, some years ago when
there were still 8 bytes locales, it would have been reasonable to
assume that people would process differently encoded manuals and recode
file names without changing the manual itself (either 8 bytes encoded
manuals in utf8 locale or utf8 manual in 8 bytes locale).  Today this is
less likely to happen while your scenario is more likely to happen as
all the manuals should be converted to utf-8, all the locales should be
utf8 and more file names should be in utf8, even on 8 bytes locales.

> Some filtering with a customization variable may be necessary for
> unusual operating systems and/or filesystems.

Yes, I'll add that after if you don't.  I think that it will need to be
obeyed by the XS parser too, in the same way as the @include file names
should be converted to the documentencoding from utf-8.

> I've done this now.  It could be improved to match the data structure of
> Texinfo::Report more directly then the array could simply be copied across
> in one go, rather than with individual calls to line_error.

It seems to be fine now as it is, there would be some speed gain, but
in most cases very little as manuals are not supposed to produce lots
of error messages.  Also I do not like that much the API in
Texinfo::Report, maybe I should modify that first.

-- 
Pat



reply via email to

[Prev in Thread] Current Thread [Next in Thread]