[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] unistr/u8-strchr: speed up searching for ASCII characters
From: |
Pádraig Brady |
Subject: |
Re: [PATCH] unistr/u8-strchr: speed up searching for ASCII characters |
Date: |
Mon, 12 Jul 2010 00:38:57 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
On 11/07/10 15:20, Paolo Bonzini wrote:
> On 07/07/2010 03:44 PM, Pádraig Brady wrote:
>> Subject: [PATCH] unistr/u8-strchr: speed up searching for ASCII
>> characters
>>
>> * lib/unistr/u8-strchr.c (u8_strchr): Use strchr() for
>> the single byte case as it was measured to be 50% faster
>> than the existing code on x86 linux. Also add a comment
>> on why not to use memmem() for the moment for the multibyte case.
>
> If p is surely a valid UTF-8 string, you can do better in general like
> this. Say [q, q+q_len) points to an UTF-8 representation of uc:
>
> for (; p = strchr (p, *q) && memcmp (p+1, q+1, q_len-1); p += q_len)
> ;
>
> return p;
That would be an improvement if strchr() would skip lots of p at a time,
to counter the function call overhead. However, the first byte of a multibyte
UTF-8 char is the same for a lot of characters, so I'm guessing there would
be lots of false positives in practice?
>
> That's because once the first byte has matched, the length of the UTF-8
> character is known to be q_len. It's better than memmem if the startup
> cost of strchr is low enough (of course memcmp has to be
> inlined/unrolled/unswitched to get decent performance).
>
> Does the argument of u8_strchr have this guarantee? If not, the above
> code can read arbitrary memory.
I was wondering myself about what parts of gnulib/unistring could take
advantage of assuming valid UTF-8 strings. From my own notes on this
function, I have:
"Some possible optimizations would need to
be conditional on CONFIG_UNICODE_SAFETY (see u8_mblen).
Note also u8_mbtouc_unsafe() and u8_mbtouc(), the latter
detecting invalid utf-8 chars even without --enable-safety
So given the above I'm assuming that most of gnulib/unistring
assumes valid UTF-8 (which users can enforce on input with u8_check()),
and if a safe but inefficient implementation option is possible
then it should be within CONFIG_UNICODE_SAFETY. Note I found
no mention of --enable-safety in the gnulib/libunistring configure scripts."
cheers,
Pádraig.