Re: Non-ASCII characters in @include search path

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-ASCII characters in @include search path

From:	Gavin Smith
Subject:	Re: Non-ASCII characters in @include search path
Date:	Sat, 26 Feb 2022 16:29:41 +0000

On Sat, Feb 26, 2022 at 12:17:46AM +0100, Patrice Dumas wrote:
> On Mon, Feb 21, 2022 at 08:46:56PM +0000, Gavin Smith wrote:
> > On Sun, Feb 20, 2022 at 10:32:00PM +0100, Patrice Dumas wrote:
> > > On Sun, Feb 20, 2022 at 05:27:51PM +0000, Gavin Smith wrote:
> > > > If the error message became something like
> > > > 
> > > > "nœud « �sseul� » non référencé"
> > > > 
> > > > then encoding this to UTF-8 would break the parts which already were in
> > > > UTF-8.
> > > 
> > > I just commited input decoding (command line, environment, translated
> > > messages) and output messages encoding.  I left file names as is, but
> > > prepared a customization variable for them.
> > > 
> > > Now the error message is:
> > > 
> > > testÃ©.texi:8: warning: nœud « ésseulé » non référencé
> > 
> > One way of fixing this would be to store the filename separately along with
> > the rest of the error message, and prepend the filename when it is output.
> > I can try to implement this.
> 
> I am reviewing the code to find where we mix file names that will be
> used as bytes at some point and character strings, and it is very common.
> 
> * unless I missed something, string constants are character strings. If
>   thay are to appear mostly in file names we need to encode them at some
>   point, but it does not seems to be easy to me to decide when, unless
>   when we are sure that the string will only be considered as a byte
>   sequence from then on.

If string constants in the Perl source code are purely ASCII then there
is no problem.  They can be used in error messages, inside the output
files, or used to open files on the filesystem.

For example, in HTML.pm, TOP_FILE is set as 'index.html'.  This can
be used in hyperlinks to that file as well as to create it.

There could be a problem if a variable like TOP_FILE was set from
the command line to some non-ASCII value.

> * many strings are used both in file names and in texts.  For example
>   the customization variable 'EXTENSION'.  Even strings that are almost
>   only used as bytes can appear in error messages, which means that we
>   need to keep the information somewhere on how to decode them.

It is no problem as long as the EXTENSION string is purely ASCII.

> * many strings can come from documents, as character strings or from
>   command line, possibly kept encoded.  For example document file name
>   can come from @setfilename or the command line (or customization
>   variable).

This is a bigger problem as the filename could be non-ASCII, unlike
the extension.

I will try to understand the code and run some tests after I install
a non-UTF-8 locale.

> * it is much more simpler to require customization variables from init
>   files to be character strings, which means that we need an API to
>   encode those we want to mix with bytes, and we cannot do this early so
>   it means more complexity.
> 
> For all those reasons, I really think that we should use character
> strings almost everywhere and encode when needed, such that there is
> no need to track down where a string comes from to be sure whether it
> is encoded or not.  We already decode and encode in many places as we
> have file names used in error messages combined with character strings,
> character strings from Texinfo manuals that need to be encoded.  The
> gain of avoiding to decode and encode a few strings is not covered, in
> my opinion by the complexity of having strings that cannot be mixed.
> 
> In some cases, we can decide to consider encoded strings, still, but I
> think that it should only be if we are sure that they will not ever be
> mixed with decoded character strings.

I hope the complexity in dealing with filename encodings can be kept to a
minimum.  Doing it the way you say might be simpler but we should check that
a few use cases worked.  I want to see if any issues can be fixed with the
existing approach.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Non-ASCII characters in @include search path, (continued)

Prev by Date: Re: AW: Feature request: api docs
Next by Date: Re: Non-ASCII characters in @include search path
Previous by thread: Re: Non-ASCII characters in @include search path
Next by thread: Re: Non-ASCII characters in @include search path
Index(es):
- Date
- Thread