bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Improve sha*sum speed


From: Loïc Le Loarer
Subject: Re: [PATCH] Improve sha*sum speed
Date: Wed, 14 Sep 2011 00:12:34 +0200

Hi Pádraig,

2011/9/13 Pádraig Brady <address@hidden>:
> On 09/12/2011 03:49 PM, Loďc Le Loarer wrote:
>> Hi,
>>
>> Here is my latest results and patch. Please find the patches to
>> sha1.c, sha256.c and sh512.c attached and the "time" of the resulting
>> binaries in sha_benchs.log. For all binaries, in 64 and 32 bits modes
>> (.m32), I run 3 times the command "\time sha*sum zero1G" where zero1G
>> is a 10^9 bytes file created by the command:
>> dd if=/dev/zero of=zero1G count=1 bs=1 seek=$(( 1000 * 1000 * 1000 - 1 ))
>
> Note using a sparse file should eliminate
> some I/O overhead and caching issues.
> I'm using: truncate -s1G 1G

Both commands are doing nearly the same. With truncate, the file is
really empty, with dd, it has one 4K page allocated at the end.
truncate is shorter and clearer, thanks for the tip.

>> The compilation of coreutils was done using the command
>> make CFLAGS="-O3"
>
> I used -O2 -march=corei7-avx
>
>> for 64 bit version and
>> make CFLAGS="-m32 -O3"
>> for 32 bit version.
>>
>> gcc is version 4.4.5 (Ubuntu 10.10)
>
> gcc version 4.6.0 20110603 (Red Hat 4.6.0-10)
>
>> My CPU is a Sandy Bridge @2.5GHz.
>
> Sandy Bridge i3-2310M CPU @ 2.10GHz
>
>>
>> For sha1, the result is very close to Linus' version for git.
>>
>> I think it could be a good idea to include thoses patches to improve
>> the C versions, it is probably close to the best it can be done in
>> "pure" C.
>>
>> To improve further, assembly with or without SSE could be done in a second 
>> pass.
>>
>> What to you think of that ?
>>
>> I don't have a GCC farm access yet, so I can only test on my system for now.
>
> Just summarising your results for 1G of data
>
> sha1  \  orig    new
> 32 bit | 5.15s   2.93s
> 64 bit | 3.54s   2.59s
>
> I'm not seeing any improvement on my Sandy Bridge system?
>
> sha1  \  orig    new
> 64 bit | 5.5s   5.5s
>
> Is perhaps the new GCC better able to handle the old code?

If I redo the same test as you did with gcc 4.6.1, -O2
-march=corei7-avx, I get the following:
orig: 3.1s
v1: 3.1s

So gcc 4.6.1 is giving a better result than 4.4 on the original
version and worst result on v1.
I have created a v2 which is v1 with rol macro replaced with the asm
inline using rol instruction and here is the result:

v2: 2.91s

Still not as good as gcc 4.4 on v1...

> Though you said you tried both gcc-4.6.1 and gcc-4.4.5 with
> no significant difference (maybe Red Hat have tweaks to their GCC?)

I did compare the two gcc on some version of the code, but I may have
made a mistake, anyway I cannot reproduce the results.

> I am seeing a halving of the branch instructions though
> which should help a lot for Intel P4 CPUs for example.
> (see the attached perf output (obtained using the attached perf-hw script)).
> Actually GCC with -O3 rather than -O2 there is the same
> halving of branch instructions with either new or old code
>
> I'd like to find out why your Sandy Bridge system
> is giving double the performance.

I have reproduce what you see on my system, could you try to reproduce
my results with gcc 4.4 ?

Anyway, clearly the result are gcc depend, and this is not very satisfying.

Best regards,
-- 
Loïc



reply via email to

[Prev in Thread] Current Thread [Next in Thread]