Re: Using iconv in stand-alone info

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using iconv in stand-alone info

From:	Eli Zaretskii
Subject:	Re: Using iconv in stand-alone info
Date:	Thu, 24 Dec 2015 22:23:06 +0200

> Date: Thu, 24 Dec 2015 19:45:56 +0000
> From: Gavin Smith <address@hidden>
> Cc: Texinfo <address@hidden>
> 
> > Here's what I came up with, please see if it looks better now.
> 
> It looks okay as far as I can tell without testing it, except for this 
> addition:
> 
> >        else
> >          {
> >            utf8_char_ptr = utf8_char;
> >            /* i is width of UTF-8 character */
> >            degrade_utf8 (&utf8_char_ptr, &i);
> > +         /* If we are done, make sure iconv flushes the last character.  */
> > +         if (bytes_left <= 0)
> > +           {
> > +             utf8_char_ptr = utf8_char;
> > +             i = 4;
> > +             iconv (iconv_to_utf8, NULL, NULL,
> > +                    &utf8_char_ptr, &utf8_char_free);
> > +             if (utf8_char_ptr > utf8_char)
> > +               {
> > +                 utf8_char_ptr = utf8_char;
> > +                 degrade_utf8 (&utf8_char_ptr, &i);
> > +               }
> > +           }
> >          }
> 
> That's okay for that code path, but I wonder if we should also call
> iconv to flush the last character after the main loop exits because of
> this condition:
> 
>     if (iconv_ret != (size_t) -1)
>         /* Success: all of input converted. */
>         break;

But we do: that's the other hunk in the diffs:

-      if (iconv_ret != (size_t) -1)
+      /* Make sure libiconv flushes out the last converted character.
+        This is required when the conversion is stateful, in which
+        case libiconv might not output the last charcater, waiting to
+        see whether it should be combined with the next one.  */
+      if (iconv_ret != (size_t) -1
+         && text_buffer_iconv (&output_buf, iconv_to_output,
+                               NULL, NULL) != (size_t) -1)
         /* Success: all of input converted. */
         break;

> I'm trying to read the libc manual closely and, actually, it's
> probably not necessary:
> 
>      If all input from the input buffer is successfully converted and
>      stored in the output buffer, the function returns the number of
>      non-reversible conversions performed.  In all other cases the
>      return value is `(size_t) -1' and `errno' is set appropriately.
> 
> So if there's one character held back waiting for a following
> combining character, there won't be a positive return value indicating
> success.

The manual is inaccurate.  It shouldn't say "and stored in the output
buffer", or at least clarify that the last character is sometimes not
stored until the flushing call.

Look at the source code of 'iconv' the utility that comes with
libiconv, and you will see that it actually does make this last call
every time it finishes conversion.

> But if that interpretation is correct, then why should the following
> be necessary?
> 
> +      /* Make sure libiconv flushes out the last converted character.
> +        This is required when the conversion is stateful, in which
> +        case libiconv might not output the last charcater, waiting to
> +        see whether it should be combined with the next one.  */
> +      if (iconv_ret != (size_t) -1
> +         && text_buffer_iconv (&output_buf, iconv_to_output,
> +                               NULL, NULL) != (size_t) -1)
> 
> So maybe it is necessary after exiting the main loop, and the wording
> in the manual is misleading.

The above _is_ when we are about to exit, so I'm not sure what you are
saying here.

>      /* If file is not in UTF-8, we degrade to ASCII in two steps:
>          first convert the character to UTF-8, then look up a replacement
>          string.  Note that mixing iconv_to_output and iconv_to_utf8
>          on the same input may not work well if the input encoding
>          is stateful.  We could deal with this by always converting to
>          UTF-8 first; then we could mix conversions on the UTF-8 stream. */
> 
> > Having played with this code, I must say that I feel it's based on
> > somewhat fragile assumptions whose validity is not clear to me.
> 
> It will take me some more time to respond to this. If you find code
> that you think is correct and works, by all means please go ahead and
> commit it.

I couldn't find a better code, because the behavior of the input
pointer when 'iconv' returns E2BIG is not documented.  I did actually
see it sometimes incremented by 2 characters worth of bytes when the
conversion produced only one character in the output buffer.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Using iconv in stand-alone info, (continued)

Prev by Date: Re: Using iconv in stand-alone info
Next by Date: Re: Using iconv in stand-alone info
Previous by thread: Re: Using iconv in stand-alone info
Next by thread: Re: Using iconv in stand-alone info
Index(es):
- Date
- Thread