Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings

From:	Ludovic Courtès
Subject:	Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings
Date:	Thu, 31 Jan 2013 22:42:39 +0100
User-agent:	Gnus/5.130005 (Ma Gnus v0.5) Emacs/24.2 (gnu/linux)

Hi!

Mark H Weaver <address@hidden> skribis:

> I researched this some more, and discovered that removal of byte-order
> marks (BOMs) is the responsibility of iconv, which discards BOMs from
> the beginning of streams when using the UTF-16 or UTF-32 encodings, but
> *not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding.
> It uses the BOM to determine the endianness of the stream, but other
> than that does *not* use it to guess the encoding, so there's no
> guesswork involved.  (Side note: iconv also inserts a BOM automatically
> when writing a stream using UTF-16 or UTF-32).

Are you talking about GNU iconv or iconv as specified by POSIX?

I can’t see any occurrence of “BOM” at
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html>.

> So thanks to iconv, we get UTF-{16,32} BOM removal for free.
> Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to
> a buffer overrun and assertion failure when 'iconv' discards a BOM.

Good catch!

> The first patch below fixes this problem.  I ended up almost completely
> rewriting that function, partly because it was largely structured around
> a mistaken assumption that iconv will never consume input without
> producing output, and partly because it was quite inefficient (several
> unnecessary conditional branches in the loop) and IMO was rather
> difficult to read.

Great.  (I think ‘iconv’ semantics lead to tricky code, no matter what.)

>  get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
>                    char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>  {

[...]

> -  for (output_size = 0, output = (char *) utf8_buf,
> -      bytes_consumed = 0, err = 0;
> -       err == 0 && output_size == 0
> -      && (bytes_consumed == 0 || byte_read != EOF);
> -       bytes_consumed++)
> +  for (;;)

Clarity is in the eye of the beholder, but to me this is a step backwards.

[...]

> +  /* NOTE: The following test assumes that the only special values
> +     (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */
> +  if (SCM_ICONV_SPECIAL_P (pt->input_cd))

Probably an indication that a more descriptive name is needed, as Andy
noted.

> @@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char 
> *encoding)
>         new_output_cd = iconv_open (encoding, "UTF-8");
>         if (new_output_cd == (iconv_t) -1)

Should be SCM_ICONV_UNINITIALIZED?

Thanks again for the research and fixes!

Ludo’.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: byte-order marks, (continued)
- Re: byte-order marks, Neil Jerram, 2013/01/29
  - Re: byte-order marks, Andy Wingo, 2013/01/29
    - Re: byte-order marks, Neil Jerram, 2013/01/29

Prev by Date: Re: [PATCH] Do not scan for coding declarations in open-file
Next by Date: Re: [PATCH] Do not scan for coding declarations in open-file
Previous by thread: Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings
Next by thread: Re: byte-order marks
Index(es):
- Date
- Thread