bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER


From: Gavin Smith
Subject: Re: UTF-8 conversion problem in Texinfo 6.6 with TEXINFO_XS_PARSER
Date: Sat, 23 Feb 2019 23:40:34 +0000
User-agent: Mutt/1.5.23 (2014-03-12)

On Wed, Feb 20, 2019 at 05:32:04AM +0200, Eli Zaretskii wrote:
> > I propose we make UTF-8 the default and clearly document this.  Anybody 
> > relying on a Latin-1 default is getting broken behaviour anyway in 
> > various output formats.
> 
> I think we should, yes.
> 
> Thanks.

I've made a start on this in revision 94ef53e.  However, the default output 
encoding for Info files still needs to change, so that e.g. @'e is 
correctly output as é and not as e', and that a Local Variables section 
is output in the Info file stating the file encoding.  My current work on this 
is below.

What I would like to do here is to avoid some UTF-8 in the output where 
it did not occur without @documentencoding UTF-8.  This includes using
Unicode directional quotation marks and the bullet symbol for lists.  This
would minimize the disruption for documents written in English.

Making various character encodings work properly is complex and 
time-consuming.  In texi2any there are several options and variables 
that affect how it works, like ENABLE_ENCODING, INPUT_ENCODING_NAME,
OUTPUT_ENCODING_NAME, INPUT_PERL_ENCODING and OUTPUT_PERL_ENCODING.  
Within the program there are various sets of options including parser 
options and converter options that affect each other.  I don't feel like 
I understand it all very well.  If there were some way of making it 
simpler, that would be great.  Having @include-specific file encodings, 
as we were discussing, wouldn't help.


If anybody knows more about the problem that I've commented about in the
_open_in subroutine below, it would be helpful to hear from them.



diff --git a/tp/Texinfo/Convert/Plaintext.pm b/tp/Texinfo/Convert/Plaintext.pm
index 8d412ff..c04fef1 100644
--- a/tp/Texinfo/Convert/Plaintext.pm
+++ b/tp/Texinfo/Convert/Plaintext.pm
@@ -399,7 +399,8 @@ sub converter_initialize($)
 
   %{$self->{'style_map'}} = %style_map;
   if ($self->get_conf('ENABLE_ENCODING') and 
$self->get_conf('OUTPUT_ENCODING_NAME')
-      and $self->get_conf('OUTPUT_ENCODING_NAME') eq 'utf-8') {
+      and $self->get_conf('OUTPUT_ENCODING_NAME') eq 'utf-8'
+      and $self->{'extra'}->{'documentencoding'}) {
     # cache this to avoid redoing calls to get_conf
     $self->{'to_utf8'} = 1;
 
@@ -565,7 +566,8 @@ sub _process_text($$$)
     $text = uc($text);
   }
 
-  if ($self->{'to_utf8'}) {
+  if ($self->{'to_utf8'}
+      and $self->{'extra'}->{'documentencoding'}) {
     return Texinfo::Convert::Unicode::unicode_text($text, 
             $context->{'font_type_stack'}->[-1]->{'monospace'});
   } elsif (!$context->{'font_type_stack'}->[-1]->{'monospace'}) {
diff --git a/tp/Texinfo/ParserNonXS.pm b/tp/Texinfo/ParserNonXS.pm
index a815979..d791caa 100644
--- a/tp/Texinfo/ParserNonXS.pm
+++ b/tp/Texinfo/ParserNonXS.pm
@@ -732,13 +732,35 @@ sub parse_texi_text($$;$$$$)
   return $tree;
 }
 
+sub _open_in {
+  my ($self, $filehandle, $file_name) = @_;
+
+  if (open($filehandle, $file_name)) {
+    if (defined($self->{'INPUT_PERL_ENCODING'})) {
+      if ($self->{'INPUT_PERL_ENCODING'} eq 'utf-8-strict') {
+        binmode($filehandle, ":utf8");
+      } else {
+        binmode($filehandle, ":encoding($self->{'INPUT_PERL_ENCODING'}")
+          if (defined($self->{'INPUT_PERL_ENCODING'}));
+        # For UTF-8, this would lead to errors in Latin-1 input the first time 
+        # a line is read from the file, even though the binmode is changed 
+        # later.  Evidently Perl is checking ahead in the file to see if the 
+        # input is valid.
+      }
+    }
+    return 1;
+  } else {
+    return 0;
+  }
+}
+
 # parse a texi file
 sub parse_texi_file($$)
 {
   my ($self, $file_name) = @_;
 
   my $filehandle = do { local *FH };
-  if (! open($filehandle, $file_name)) { 
+  if (!_open_in($self, $filehandle, $file_name)) {
     $self->document_error(sprintf(__("could not open %s: %s"), 
                                   $file_name, $!));
     return undef;
@@ -785,6 +807,9 @@ sub parse_texi_file($$)
         }];
   $self->{'info'}->{'input_file_name'} = $file_name;
   $self->{'info'}->{'input_directory'} = $directories;
+  $self->{'info'}->{'input_perl_encoding'} = $self->{'INPUT_PERL_ENCODING'};
+  $self->{'info'}->{'input_encoding_name'} = $self->{'INPUT_ENCODING_NAME'};
+
   my $tree = $self->_parse_texi($root);
 
   # Find 'text_root', which contains everything before first node/section.
@@ -2922,10 +2947,8 @@ sub _end_line($$$)
           my $file = Texinfo::Common::locate_include_file($self, $text) ;
           if (defined($file)) {
             my $filehandle = do { local *FH };
-            if (open ($filehandle, $file)) {
+            if (_open_in ($self, $filehandle, $file)) {
               $included_file = 1;
-              binmode($filehandle, ":encoding($self->{'INPUT_PERL_ENCODING'})")
-                if (defined($self->{'INPUT_PERL_ENCODING'}));
               print STDERR "Included $file($filehandle)\n" if 
($self->{'DEBUG'});
               my ($directories, $suffix);
               ($file, $directories, $suffix) = fileparse($file)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]