bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-ASCII characters in @include search path


From: Gavin Smith
Subject: Re: Non-ASCII characters in @include search path
Date: Sun, 20 Feb 2022 17:27:51 +0000

On Sun, Feb 20, 2022 at 04:13:43PM +0100, Patrice Dumas wrote:
> Another reason for decoding and encoding everything is error messages.
> I am actually a bit surprised that nobody ever complained that error
> messages are not encoded.

I am not sure what the best approach is.  Say there is code like:

        $self->_line_error(sprintf(__("bad or empty \@%s formal argument: %s"),
                                           $command, $formal_arg), $line_nr);

Now, I expect that the return value from __ WOULD be encoded.  But then
other strings are interpolated into it by sprintf.  If any of them are not
encoded for the output, a mismatch occurs.

>From https://metacpan.org/pod/Locale::Messages -

> Note for Perl 5.6 and later: The returned string will always have the UTF-8 
> flag off by default. 

This is what you would expect from an encoded string.

> For example, in the following the file name is output correctly as it is
> not decoded, but the string from the Texinfo file is decoded but not
> encoded and hence ends up incorrect in the message.  Decoding everything
> and then encoding the error messages should allow to mix strings from
> different sources and different encodings.
> 
> $ ./texi2any.pl testé.texi
> testé.texi:8: warning: node `�sseul�' unreferenced

Suppose the translation for the word "node" was non-ASCII.  I'd expect
the translation for that word to be encoded correctly in the output, even
if the node name weren't.

I haven't been able to test it yet but there is a translation in French:

#: tp/Texinfo/Structuring.pm:429
#, perl-format
msgid "node `%s' unreferenced"
msgstr "nœud « %s » non référencé"

If the error message became something like

"nœud « �sseul� » non référencé"

then encoding this to UTF-8 would break the parts which already were in
UTF-8.

The only way out would seem to be different use of the gettext functions.

I don't see that there is an option in Locale::Messages or Locale::TextDomain
to get "unencoded" output, that is in Perl's internal string format.  The
closest that could be done is to always output to UTF-8, possibly set the UTF-8
flag on the resulting string, and then convert this to the final message
encoding at the end.

The other way would be to convert everything to the final encoding at the
time of interpolation.  I couldn't really see an easy way of doing this.

So my best idea at the moment for fixing the encoding of the error messages is:
* When calling gettext and related functions, always demand UTF-8, and convert
this back into Perl's internal coding afterwards.
* Convert the messages at the time they are output.

For example, if a node name is in EUC-JP, this would be converted (internally)
into UTF-8 when the file is read.  The node name would then be easily
interpolable into a UTF-8 error message.  If the user actually wanted error
messages to be printed in EUC-JP, then the whole error message would be
output at the end.

As far as filename encoding goes, I suspect that use of filenames in messages
is something that is limited in the source code so decoding of filenames
may be something that can be limited.



> testé.texi:8: warning: node `' unreferenced
> 
> 
> $ cat testé.texi 
> \input texinfo.tex
> 
> @setfilename testé.info
> 
> @node Top
> @top Testé
> 
> @node ésseulé
> 
> @node Chapitré
> @chapter Chapitré



reply via email to

[Prev in Thread] Current Thread [Next in Thread]