[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings
From: |
Ludovic Courtès |
Subject: |
Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings |
Date: |
Thu, 31 Jan 2013 22:42:39 +0100 |
User-agent: |
Gnus/5.130005 (Ma Gnus v0.5) Emacs/24.2 (gnu/linux) |
Hi!
Mark H Weaver <address@hidden> skribis:
> I researched this some more, and discovered that removal of byte-order
> marks (BOMs) is the responsibility of iconv, which discards BOMs from
> the beginning of streams when using the UTF-16 or UTF-32 encodings, but
> *not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding.
> It uses the BOM to determine the endianness of the stream, but other
> than that does *not* use it to guess the encoding, so there's no
> guesswork involved. (Side note: iconv also inserts a BOM automatically
> when writing a stream using UTF-16 or UTF-32).
Are you talking about GNU iconv or iconv as specified by POSIX?
I can’t see any occurrence of “BOM” at
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html>.
> So thanks to iconv, we get UTF-{16,32} BOM removal for free.
> Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to
> a buffer overrun and assertion failure when 'iconv' discards a BOM.
Good catch!
> The first patch below fixes this problem. I ended up almost completely
> rewriting that function, partly because it was largely structured around
> a mistaken assumption that iconv will never consume input without
> producing output, and partly because it was quite inefficient (several
> unnecessary conditional branches in the loop) and IMO was rather
> difficult to read.
Great. (I think ‘iconv’ semantics lead to tricky code, no matter what.)
> get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
> char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> {
[...]
> - for (output_size = 0, output = (char *) utf8_buf,
> - bytes_consumed = 0, err = 0;
> - err == 0 && output_size == 0
> - && (bytes_consumed == 0 || byte_read != EOF);
> - bytes_consumed++)
> + for (;;)
Clarity is in the eye of the beholder, but to me this is a step backwards.
[...]
> + /* NOTE: The following test assumes that the only special values
> + (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */
> + if (SCM_ICONV_SPECIAL_P (pt->input_cd))
Probably an indication that a more descriptive name is needed, as Andy
noted.
> @@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char
> *encoding)
> new_output_cd = iconv_open (encoding, "UTF-8");
> if (new_output_cd == (iconv_t) -1)
Should be SCM_ICONV_UNINITIALIZED?
Thanks again for the research and fixes!
Ludo’.
- Re: byte-order marks, (continued)
- Re: byte-order marks, Ludovic Courtès, 2013/01/29
- Re: byte-order marks, Andy Wingo, 2013/01/30
- Re: byte-order marks, Ludovic Courtès, 2013/01/30
- Re: byte-order marks, Andy Wingo, 2013/01/31
- [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Mark H Weaver, 2013/01/30
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Andy Wingo, 2013/01/31
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings, Andy Wingo, 2013/01/31
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings, Mark H Weaver, 2013/01/31
- Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings,
Ludovic Courtès <=
Re: byte-order marks, Neil Jerram, 2013/01/29