bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: memchr2 speed, gcc


From: Bruno Haible
Subject: Re: memchr2 speed, gcc
Date: Sat, 26 Apr 2008 11:37:53 +0200
User-agent: KMail/1.5.4

Eric Blake wrote:
> Looks good to me

Applied, thanks.

> except for:
> 
> |
> | !   /* Compute auxiliary longword values:
> | !      repeated_one is a value which has a 1 in every byte.
> | !      repeated_c1 has a c1 in every byte.
> | !      repeated_c2 has a c2 in every byte.  */
> | !   repeated_one = 0x01010101;
> | !   repeated_c1 = c1 | (c1 << 8);
> | !   repeated_c2 = c2 | (c2 << 8);
> | !   repeated_c1 |= repeated_c1 << 16;
> | !   repeated_c2 |= repeated_c2 << 16;
> |     if (0xffffffffU < TYPE_MAXIMUM (longword))
> |       {
> | !       repeated_one |= repeated_one << 31 << 1;
> | !       repeated_c1 |= repeated_c1 << 31 << 1;
> | !       repeated_c2 |= repeated_c2 << 31 << 1;
> | !       if (8 < sizeof (longword))
> | !   {
> | !     int i;
> | !
> | !     for (i = 64; i < sizeof (longword) * 8; i *= 2)
> 
> Since we are already computing repeated_one, I'm wondering if it would be
> more efficient to compute repeated_c1 = repeated_one * c1 once at the end,
> rather than computing repeated_c1 in parallel with the repeated
> modifications to repeated_one.

You would be trading (on a 32-bit platform) 2 shifts and 2 OR against
1 32-bit multiplication, or (on a 64-bit platform) 3 shifts and 3 OR against
1 64-bit multiplication.  On most platforms the multiplication is too
expensive. While there are some CPUs with enough die surface on the
multiplier,
  - some older SPARCs need 29 cycles per multiplication IIRC,
  - some newer CPUs need a time that depends on the number of bits in one
    of the two operands - either 4 or up to 8 in this case.
So, if you don't precisely know the CPU type and its timings, I would not
do this change.

Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]