[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Reducing iconv-induced memory usage
From: |
Ludovic Courtès |
Subject: |
Re: Reducing iconv-induced memory usage |
Date: |
Wed, 27 Apr 2011 16:36:03 +0200 |
User-agent: |
Gnus/5.110015 (No Gnus v0.15) Emacs/23.3 (gnu/linux) |
Hi Mark,
Mark H Weaver <address@hidden> writes:
> address@hidden (Ludovic Courtès) writes:
>> So, here’s the patch.
>>
>> It also makes UTF-8 input ~30% faster according to ports.bm (which
>> doesn’t benchmark output):
>
> Thanks for working on this. I haven't yet had time to fully review this
> patch, but here I will document the problems I see so far.
Thanks for the review!
> First of all, while looking at this patch, I've discovered another
> problem in ports.c: scm_char_ready_p does not consider the possibility
> of multibyte characters, and returns #t whenever there is at least one
> byte ready.
Indeed; let’s discuss it separately.
> Unicode requires that we reject as ill-formed any UTF-8 byte sequence in
> non-shortest form. For example, we must reject the byte sequence
> 0xC1 0x80 which a permissive reader would read as 0x40, since obviously
> that code point can be encoded as a single byte in UTF-8.
>
> We must also reject any UTF-8 byte sequence that corresponds to a
> surrogate code point (U+D800..U+DFFF), or to a code point greater than
> U+10FFFF.
>
> Table 3.7 of the Unicode 6.0.0 standard, reproduced below, concisely
> shows all well-formed UTF-8 byte sequences. The asterisks highlight
> continuation bytes that are constrained to a smaller range than the
> usual 80..BF.
>
> code points byte[0] byte[1] byte[2] byte[3]
> ---------------------------------------------------------
> U+000000..U+00007F | 00..7F | | | |
> U+000080..U+0007FF | C2..DF | 80..BF | | |
> U+000800..U+000FFF | E0 | A0..BF* | 80..BF | |
> U+001000..U+00CFFF | E1..EC | 80..BF | 80..BF | |
> U+00D000..U+00D7FF | ED | 80..9F* | 80..BF | |
> U+00E000..U+00FFFF | EE..EF | 80..BF | 80..BF | |
> U+010000..U+03FFFF | F0 | 90..BF* | 80..BF | 80..BF |
> U+040000..U+0FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
> U+100000..U+10FFFF | F4 | 80..8F* | 80..BF | 80..BF |
> ---------------------------------------------------------
>
> So, for the code above corresponding to 2-byte sequences, it would
> suffice to verify that buf[0] >= 0xC2. The 3- and 4-byte cases are
> somewhat more constrained.
Indeed, thanks for educating me. ;-)
Could you add UTF-8 tests for such cases using the just-committed
‘test-decoding-error’ in ports.test?
>> + else if ((buf[0] & 0xf0) == 0xe0)
>> + {
>> + byte = scm_get_byte_or_eof (port);
>> + if (byte == EOF || ((byte & 0xc0) != 0x80))
>> + goto invalid_seq;
>> +
>> + buf[1] = (scm_t_uint8) byte;
>> + *len = 2;
>> +
>> + byte = scm_get_byte_or_eof (port);
>> + if (byte == EOF || ((byte & 0xc0) != 0x80))
>> + goto invalid_seq;
>> +
>> + buf[2] = (scm_t_uint8) byte;
>> + *len = 3;
>> +
>> + *codepoint = ((scm_t_wchar) buf[0] & 0x0f) << 12UL
>> + | ((scm_t_wchar) buf[1] & 0x3f) << 6UL
>> + | (buf[2] & 0x3f);
>> + }
>> + else
>> + {
>
> That ^^^ should not simply be an "else". It must check that the first
> byte is valid.
Right.
>> + invalid_seq:
>> + /* Return the faulty byte. */
>> + scm_unget_byte (byte, port);
>
> This ungets only the last byte, but there may be up to 4 bytes to unget.
No, that’s done in ‘peek-char’.
>> +/* Read a codepoint from PORT and return it in *CODEPOINT. Fill BUF
>> + with the byte representation of the codepoint in PORT's encoding, and
>> + set *LEN to the length in bytes of that representation. Return 0 on
>> + success and an errno value on error. */
>> +static int
>> +get_codepoint (SCM port, scm_t_wchar *codepoint,
>> + char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>> +{
>> + int err;
>> + scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +
>> + if (pt->input_cd == (iconv_t) -1)
>> + /* Initialize the conversion descriptors, if needed. */
>> + scm_i_set_port_encoding_x (port, pt->encoding);
>> +
>> + if (pt->input_cd == (iconv_t) -1)
>> + err = get_utf8_codepoint (port, codepoint, (scm_t_uint8 *) buf, len);
>> + else
>> + err = get_iconv_codepoint (port, codepoint, buf, len);
>
> From the code above, it appears that for UTF-8 ports,
> scm_i_set_port_encoding_x will necessarily be called once per character
> read. This seems rather inefficient.
Correct. Alas, I don’t know how to avoid this inefficiency in 2.0 since
we can’t just add a flag in ‘scm_t_port’ since it would break the ABI.
Ideas?
Besides, however inefficient it may seem, it’s still more efficient than
what we currently have, as I explained.
> Also, if we wish to support Latin-1 without iconv as well, the simple
> method above will not work.
Why would we want such a thing? :-)
The starting point for this patch was the observation that our Unicode
I/O converts to/from UTF-8, and then from UTF-8 to our internal
representation, and that it’s wasteful to use iconv to convert from
UTF-8 to UTF-8 when reading from/writing to a UTF-8 port.
> I would recommend adding an enum field to the port which for now only
> has two encoding schemes: ICONV or UTF8. Later, we could add LATIN1 and
> maybe ASCII as well. Given that this check must be done once per
> character, it seems better to do a switch on an enum than to strcmp with
> pt->encoding (as is done in scm_i_set_port_encoding_x).
Agreed; maybe something for ‘master’ once this version is in 2.0?
Thanks,
Ludo’.