[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: af_alg benchmarks and performance

From: Matteo Croce
Subject: Re: af_alg benchmarks and performance
Date: Tue, 8 May 2018 12:50:54 +0200

On Tue, May 8, 2018 at 1:10 AM, Bruno Haible <address@hidden> wrote:
> Hi all,
> Thanks for your benchmarking help and explanations.
> Let me try to summarize.
> * We need to consider each of the algorithms md5, sha1 .... sha256 separately,
>   because each algorithm has a different performance characteristic [1].
>   This is due to the following factors:
>     - Some non-Intel hardware has crypto devices. [2]
>     - Intel hardware has special instructions for special crypto algorithms. 
> [3][4]
>     - The Linux kernel has specially optimized code for specific crypto
>       algorithms. [4]
> * For the afalg_stream case (with regular files), for all algorithms,
>   kernel crypto is faster than user-space crypto, for sizes N > N_0.
>   Reasons:
>     1. The sendfile call avoids copying the file data to user-space.
>     2. The in-kernel crypto code _may_ (or may not) be faster than the
>        plain C code from gnulib.
> * For the afalg_buffer case (and, btw, also the afalg_stream case with
>   non-regular files), it depends on the algorithm and CPU capabilities:
>   * If the in-kernel crypto code has roughly the same speed as the plain
>     C code from gnulib,
>     then we observe that kernel crypto is always slower than user-space 
> crypto,
>     because of the added overhead of copying the data to kernel space.
>   * If the in-kernel crypto code is faster than the plain C code from gnulib
>     by at least, say, 10%,
>     then kernel crypto is faster than user-space crypto, for sizes N > N_0,
>     because the faster algorithm outweighs the copying the data to kernel 
> space.
> * The reasons for our disappointment are:
>   - The original presentation [2] was misleading because, as Assaf noticed 
> [5],
>     a large portion of the reported speedup (at least for Intel processors)
>     is due to a test case that
>       1. is a corner case,
>       2. exhibits a speedup that is due to sendfile(), not a different crypto
>          implementation.
>     Lesson to be learned: When you present a new feature and motivate it with
>     speedups, please always also include an _average_ use case (i.e. 
> non-sparse
>     files, or memory regions not completely filled with zeroes)!
>   - We all have access to machines with x86_64 CPUs, and only some of them 
> have
>     special crypto instructions.
>   - The system calls have some cost. [6]
> Bruno
> [1] https://lists.gnu.org/archive/html/bug-gnulib/2018-05/msg00043.html
> [2] https://lists.gnu.org/archive/html/bug-gnulib/2018-04/msg00062.html
> [3] https://en.wikipedia.org/wiki/AES_instruction_set
> [4] https://lists.gnu.org/archive/html/bug-gnulib/2018-05/msg00038.html
> [5] https://lists.gnu.org/archive/html/bug-gnulib/2018-04/msg00088.html
> [6] https://lists.gnu.org/archive/html/bug-gnulib/2018-05/msg00044.html

Hi Bruno,

I'm sorry that you found the presentation misleading.
I always used sha1 for my tests without noticing that md5 was very
slow. Fortunately other non deprecated algos like sha256 are way

I thought that the usual case of sha1 and other algos was calculating
hashes for big files, ISO images and VM disks, that's what I use the
tool mainly: download an huge file, and then check the hash for

I disagree about the sendfile() usage, it was just an optimization,
but the gain over a read/write loop is negligible, even with the
Meltdown mitigations which adds overhead to syscalls

$ cat 2g.bin |time -p src/sha1sum
752ef2367f479e79e4f0cded9c270c2890506ab0  -
real 1.74
user 0.01
sys 1.73
$ time src/sha1sum 2g.bin
752ef2367f479e79e4f0cded9c270c2890506ab0  2g.bin

real    0m1,677s
user    0m0,000s
sys     0m1,657s

As for the crypto instruction used by the kernel, yes, not all
machines have these instructions, but as ssse3 was introduced in Core
2 Duo in 2006 I hope that the majority of the users machines already
have them.

I agree that the syscall has some cost and it's not suitable for
handling very small buffers.
Doing thousand of iterations on a small buffer was not on my check
list, it would have spotted it, but we can workaround it by caching
the af_alg socket.

Matteo Croce
per aspera ad upstream

reply via email to

[Prev in Thread] Current Thread [Next in Thread]