[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER
From: |
Gavin Smith |
Subject: |
Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER |
Date: |
Sat, 23 Feb 2019 23:40:34 +0000 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
On Wed, Feb 20, 2019 at 05:32:04AM +0200, Eli Zaretskii wrote:
> > I propose we make UTF-8 the default and clearly document this. Anybody
> > relying on a Latin-1 default is getting broken behaviour anyway in
> > various output formats.
>
> I think we should, yes.
>
> Thanks.
I've made a start on this in revision 94ef53e. However, the default output
encoding for Info files still needs to change, so that e.g. @'e is
correctly output as é and not as e', and that a Local Variables section
is output in the Info file stating the file encoding. My current work on this
is below.
What I would like to do here is to avoid some UTF-8 in the output where
it did not occur without @documentencoding UTF-8. This includes using
Unicode directional quotation marks and the bullet symbol for lists. This
would minimize the disruption for documents written in English.
Making various character encodings work properly is complex and
time-consuming. In texi2any there are several options and variables
that affect how it works, like ENABLE_ENCODING, INPUT_ENCODING_NAME,
OUTPUT_ENCODING_NAME, INPUT_PERL_ENCODING and OUTPUT_PERL_ENCODING.
Within the program there are various sets of options including parser
options and converter options that affect each other. I don't feel like
I understand it all very well. If there were some way of making it
simpler, that would be great. Having @include-specific file encodings,
as we were discussing, wouldn't help.
If anybody knows more about the problem that I've commented about in the
_open_in subroutine below, it would be helpful to hear from them.
diff --git a/tp/Texinfo/Convert/Plaintext.pm b/tp/Texinfo/Convert/Plaintext.pm
index 8d412ff..c04fef1 100644
--- a/tp/Texinfo/Convert/Plaintext.pm
+++ b/tp/Texinfo/Convert/Plaintext.pm
@@ -399,7 +399,8 @@ sub converter_initialize($)
%{$self->{'style_map'}} = %style_map;
if ($self->get_conf('ENABLE_ENCODING') and
$self->get_conf('OUTPUT_ENCODING_NAME')
- and $self->get_conf('OUTPUT_ENCODING_NAME') eq 'utf-8') {
+ and $self->get_conf('OUTPUT_ENCODING_NAME') eq 'utf-8'
+ and $self->{'extra'}->{'documentencoding'}) {
# cache this to avoid redoing calls to get_conf
$self->{'to_utf8'} = 1;
@@ -565,7 +566,8 @@ sub _process_text($$$)
$text = uc($text);
}
- if ($self->{'to_utf8'}) {
+ if ($self->{'to_utf8'}
+ and $self->{'extra'}->{'documentencoding'}) {
return Texinfo::Convert::Unicode::unicode_text($text,
$context->{'font_type_stack'}->[-1]->{'monospace'});
} elsif (!$context->{'font_type_stack'}->[-1]->{'monospace'}) {
diff --git a/tp/Texinfo/ParserNonXS.pm b/tp/Texinfo/ParserNonXS.pm
index a815979..d791caa 100644
--- a/tp/Texinfo/ParserNonXS.pm
+++ b/tp/Texinfo/ParserNonXS.pm
@@ -732,13 +732,35 @@ sub parse_texi_text($$;$$$$)
return $tree;
}
+sub _open_in {
+ my ($self, $filehandle, $file_name) = @_;
+
+ if (open($filehandle, $file_name)) {
+ if (defined($self->{'INPUT_PERL_ENCODING'})) {
+ if ($self->{'INPUT_PERL_ENCODING'} eq 'utf-8-strict') {
+ binmode($filehandle, ":utf8");
+ } else {
+ binmode($filehandle, ":encoding($self->{'INPUT_PERL_ENCODING'}")
+ if (defined($self->{'INPUT_PERL_ENCODING'}));
+ # For UTF-8, this would lead to errors in Latin-1 input the first time
+ # a line is read from the file, even though the binmode is changed
+ # later. Evidently Perl is checking ahead in the file to see if the
+ # input is valid.
+ }
+ }
+ return 1;
+ } else {
+ return 0;
+ }
+}
+
# parse a texi file
sub parse_texi_file($$)
{
my ($self, $file_name) = @_;
my $filehandle = do { local *FH };
- if (! open($filehandle, $file_name)) {
+ if (!_open_in($self, $filehandle, $file_name)) {
$self->document_error(sprintf(__("could not open %s: %s"),
$file_name, $!));
return undef;
@@ -785,6 +807,9 @@ sub parse_texi_file($$)
}];
$self->{'info'}->{'input_file_name'} = $file_name;
$self->{'info'}->{'input_directory'} = $directories;
+ $self->{'info'}->{'input_perl_encoding'} = $self->{'INPUT_PERL_ENCODING'};
+ $self->{'info'}->{'input_encoding_name'} = $self->{'INPUT_ENCODING_NAME'};
+
my $tree = $self->_parse_texi($root);
# Find 'text_root', which contains everything before first node/section.
@@ -2922,10 +2947,8 @@ sub _end_line($$$)
my $file = Texinfo::Common::locate_include_file($self, $text) ;
if (defined($file)) {
my $filehandle = do { local *FH };
- if (open ($filehandle, $file)) {
+ if (_open_in ($self, $filehandle, $file)) {
$included_file = 1;
- binmode($filehandle, ":encoding($self->{'INPUT_PERL_ENCODING'})")
- if (defined($self->{'INPUT_PERL_ENCODING'}));
print STDERR "Included $file($filehandle)\n" if
($self->{'DEBUG'});
my ($directories, $suffix);
($file, $directories, $suffix) = fileparse($file)
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, (continued)
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Gavin Smith, 2019/02/19
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Patrice Dumas, 2019/02/19
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Eli Zaretskii, 2019/02/19
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Patrice Dumas, 2019/02/22
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Gavin Smith, 2019/02/22
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Eli Zaretskii, 2019/02/19
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Gavin Smith, 2019/02/19
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Eli Zaretskii, 2019/02/20
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Gavin Smith, 2019/02/19
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Eli Zaretskii, 2019/02/19
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER,
Gavin Smith <=
- Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER, Gavin Smith, 2019/02/26