[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: memchr2 speed, gcc
From: |
Bruno Haible |
Subject: |
Re: memchr2 speed, gcc |
Date: |
Sat, 26 Apr 2008 11:37:53 +0200 |
User-agent: |
KMail/1.5.4 |
Eric Blake wrote:
> Looks good to me
Applied, thanks.
> except for:
>
> |
> | ! /* Compute auxiliary longword values:
> | ! repeated_one is a value which has a 1 in every byte.
> | ! repeated_c1 has a c1 in every byte.
> | ! repeated_c2 has a c2 in every byte. */
> | ! repeated_one = 0x01010101;
> | ! repeated_c1 = c1 | (c1 << 8);
> | ! repeated_c2 = c2 | (c2 << 8);
> | ! repeated_c1 |= repeated_c1 << 16;
> | ! repeated_c2 |= repeated_c2 << 16;
> | if (0xffffffffU < TYPE_MAXIMUM (longword))
> | {
> | ! repeated_one |= repeated_one << 31 << 1;
> | ! repeated_c1 |= repeated_c1 << 31 << 1;
> | ! repeated_c2 |= repeated_c2 << 31 << 1;
> | ! if (8 < sizeof (longword))
> | ! {
> | ! int i;
> | !
> | ! for (i = 64; i < sizeof (longword) * 8; i *= 2)
>
> Since we are already computing repeated_one, I'm wondering if it would be
> more efficient to compute repeated_c1 = repeated_one * c1 once at the end,
> rather than computing repeated_c1 in parallel with the repeated
> modifications to repeated_one.
You would be trading (on a 32-bit platform) 2 shifts and 2 OR against
1 32-bit multiplication, or (on a 64-bit platform) 3 shifts and 3 OR against
1 64-bit multiplication. On most platforms the multiplication is too
expensive. While there are some CPUs with enough die surface on the
multiplier,
- some older SPARCs need 29 cycles per multiplication IIRC,
- some newer CPUs need a time that depends on the number of bits in one
of the two operands - either 4 or up to 8 in this case.
So, if you don't precisely know the CPU type and its timings, I would not
do this change.
Bruno