af_alg benchmarks and performance

From: Bruno Haible
Subject: af_alg benchmarks and performance
Date: Tue, 08 May 2018 01:10:52 +0200
Hi all,

Thanks for your benchmarking help and explanations.

Let me try to summarize.

* We need to consider each of the algorithms md5, sha1 .... sha256 separately,
  because each algorithm has a different performance characteristic [1].
  This is due to the following factors:
    - Some non-Intel hardware has crypto devices. [2]
    - Intel hardware has special instructions for special crypto algorithms. 
    - The Linux kernel has specially optimized code for specific crypto
      algorithms. [4]

* For the afalg_stream case (with regular files), for all algorithms,
  kernel crypto is faster than user-space crypto, for sizes N > N_0.
    1. The sendfile call avoids copying the file data to user-space.
    2. The in-kernel crypto code _may_ (or may not) be faster than the
       plain C code from gnulib.

* For the afalg_buffer case (and, btw, also the afalg_stream case with
  non-regular files), it depends on the algorithm and CPU capabilities:
  * If the in-kernel crypto code has roughly the same speed as the plain
    C code from gnulib,
    then we observe that kernel crypto is always slower than user-space crypto,
    because of the added overhead of copying the data to kernel space.
  * If the in-kernel crypto code is faster than the plain C code from gnulib
    by at least, say, 10%,
    then kernel crypto is faster than user-space crypto, for sizes N > N_0,
    because the faster algorithm outweighs the copying the data to kernel space.

* The reasons for our disappointment are:
  - The original presentation [2] was misleading because, as Assaf noticed [5],
    a large portion of the reported speedup (at least for Intel processors)
    is due to a test case that
      1. is a corner case,
      2. exhibits a speedup that is due to sendfile(), not a different crypto
    Lesson to be learned: When you present a new feature and motivate it with
    speedups, please always also include an _average_ use case (i.e. non-sparse
    files, or memory regions not completely filled with zeroes)!
  - We all have access to machines with x86_64 CPUs, and only some of them have
    special crypto instructions.
  - The system calls have some cost. [6]


