bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: memchr2 speed, gcc


From: Bruno Haible
Subject: Re: memchr2 speed, gcc
Date: Tue, 4 Mar 2008 03:41:08 +0100
User-agent: KMail/1.5.4

Eric Blake wrote:
> +2008-03-01  Eric Blake  <address@hidden>
> +
> +     New module 'memchr2'.
> +     * modules/memchr2: New file.
> +     * modules/memchr2-tests: Likewise.
> +     * lib/memchr2.h: Likewise.
> +     * lib/memchr2.c: Likewise, based on memchr.c.

Wondering why you used 'uintmax_t' as basic word type, rather than the
'unsigned long' that memchr.c uses, I benchmarked this and a few other
variations of the memchr2.c implementation.

Summary of results:
  - With gcc 3.2.2 and 4.2.2, the word type 'unsigned long' is more efficient.
  - With gcc 4.3-20080215, it is the opposite. But this version of gcc also
    exhibits mysterious performance characteristics.

Details about the variants of memchr2.c:

  - Variant M is the original one, variant L the one with 'unsigned long'.

  - Variant O is the original one, with the test like this:
    ((((longword1 + magic_bits) ^ ~longword1) & ~magic_bits) != 0
     || (((longword2 + magic_bits) ^ ~longword2) & ~magic_bits) != 0)
    Variant S uses a simplified expression:
    (((((longword1 + magic_bits) ^ ~longword1)
       | ((longword2 + magic_bits) ^ ~longword2)) & ~magic_bits) != 0)

  - Variant X uses a __builtin_expect (..., 0) around this expression.

Details about the compilers used:
  - gcc 3.2.2
  - gcc 4.2.2
  - gcc 4.3-20080215

CPU: x86 (Athlon-K7).

The attached test program and the variant file were compiled with -O2 -g
and linked. Then "time ./a.out 100000" was run two or three times, and
the average of the "user" time taken. All times are in seconds.

Results:
                           MO     MOX    MS     MSX    LO     LOX    LS     LSX

gcc-3.2.2                 6.75          6.28          5.84          4.13
gcc-3.2.2 -mcpu=athlon    6.72          5.16          4.68          5.25
gcc-4.2.2                 6.17          5.27          5.91          5.32
gcc-4.2.2 -mtune=athlon   6.14          4.98          5.36          5.25
gcc-4.3-ss                4.51   4.72   4.51   4.72   4.75   4.67   5.26   4.67
gcc-4.3-ss -mtune=athlon  4.69   4.39   4.69   4.39   4.75   4.67   4.75   4.68

Result interpretation:

- Variant O vs. variant S: no clear winner on either side.
- gcc 4.3 results are pretty random: Sometimes -mtune=athlon (tuning for the
  CPU actually used) is a win, sometimes a deterioation. Sometimes variant M
  is better than variant L, sometimes the opposite.
- But gcc 4.3's absolute results are always better than those of previous gcc
  versions.
- Looking at the -mtune=athlon cases only:
  - Variant O vs. variant S: still no clear winner on either side.
  - Variant M vs. variant L: no clear winner here either.

Btw, how do you need to write code such that gcc uses the SSE3 instructions?

Bruno

Attachment: main.c
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]