qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using Threa


From: Michal Prívozník
Subject: Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
Date: Fri, 5 Aug 2022 13:01:10 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0

On 7/21/22 14:07, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.
> 
> Setting the CPU affinity of threads from inside QEMU usually isn't
> easily possible, because we don't want QEMU -- once started and running
> guest code -- to be able to mess up the system. QEMU disallows relevant
> syscalls using seccomp, such that any such invocation will fail.
> 
> Especially for memory preallocation in memory backends, the CPU affinity
> can significantly increase guest startup time, for example, when running
> large VMs backed by huge/gigantic pages, because of NUMA effects. For
> NUMA-aware preallocation, we have to set the CPU affinity, however:
> 
> (1) Once preallocation threads are created during preallocation, management
>     tools cannot intercept anymore to change the affinity. These threads
>     are created automatically on demand.
> (2) QEMU cannot easily set the CPU affinity itself.
> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>     might not necessarily be exactly the CPUs we actually want to use
>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> 
> There is an easy "workaround". If we have a thread with the right CPU
> affinity, we can simply create new threads on demand via that prepared
> context. So, all we have to do is setup and create such a context ahead
> of time, to then configure preallocation to create new threads via that
> environment.
> 
> So, let's introduce a user-creatable "thread-context" object that
> essentially consists of a context thread used to create new threads.
> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> "node-affinity" property), or upper layers can extract the thread id
> ("thread-id" property) to configure it externally.
> 
> Make memory-backends consume a thread-context object
> (via the "prealloc-context" property) and use it when preallocating to
> create new threads with the desired CPU affinity. Further, to make it
> easier to use, allow creation of "thread-context" objects, including
> setting the CPU affinity directly from QEMU, *before* enabling the
> sandbox option.
> 
> 
> Quick test on a system with 2 NUMA nodes:
> 
> Without CPU affinity:
>     time qemu-system-x86_64 \
>         -object 
> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind
>  \
>         -nographic -monitor stdio
> 
>     real    0m5.383s
>     real    0m3.499s
>     real    0m5.129s
>     real    0m4.232s
>     real    0m5.220s
>     real    0m4.288s
>     real    0m3.582s
>     real    0m4.305s
>     real    0m5.421s
>     real    0m4.502s
> 
>     -> It heavily depends on the scheduler CPU selection
> 
> With CPU affinity:
>     time qemu-system-x86_64 \
>         -object thread-context,id=tc1,node-affinity=0 \
>         -object 
> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1
>  \
>         -sandbox enable=on,resourcecontrol=deny \
>         -nographic -monitor stdio
> 
>     real    0m1.959s
>     real    0m1.942s
>     real    0m1.943s
>     real    0m1.941s
>     real    0m1.948s
>     real    0m1.964s
>     real    0m1.949s
>     real    0m1.948s
>     real    0m1.941s
>     real    0m1.937s
> 
> On reasonably large VMs, the speedup can be quite significant.
> 

I've timed 'virsh start' with a guest that has 47GB worth of 1GB
hugepages and seen the startup time halved basically (from 10.5s to
5.6s). The host has 4 NUMA nodes and I'm pinning the guest onto two nodes.

I've written libvirt counterpart (which I'll post as soon as these are
merged). The way it works is the whenever .prealloc-threads= is to be
used AND qemu is capable of thread-context the thread-context object is
generated before every memory-backend-*, like this:

-object
'{"qom-type":"thread-context","id":"tc-ram-node0","node-affinity":[2]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":21474836480,"host-nodes":[2],"policy":"bind","prealloc-context":"tc-ram-node0"}'
\
-numa node,nodeid=0,cpus=0,cpus=2,memdev=ram-node0 \
-object
'{"qom-type":"thread-context","id":"tc-ram-node1","node-affinity":[3]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node1","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":28991029248,"host-nodes":[3],"policy":"bind","prealloc-context":"tc-ram-node1"}'
\


Now, it's not visible in this snippet, but my code does not reuse
thread-context objects. So if there's another memfd, it'll get its own TC:

-object
'{"qom-type":"thread-context","id":"tc-memdimm0","node-affinity":[1]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"memdimm0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":1073741824,"host-nodes":[1],"policy":"bind","prealloc-context":"tc-memdimm0"}'
\

The reason is that logic generating memory-backends is very complex and
separating out parts of it so that thread-context objects can be
generated first and reused by those backends would inevitably lead to
regression. I guess my question is, whether it's a problem that libvirt
would leave one additional thread, sleeping in a semaphore, for each
memory-backend (iff prealloc-threads are used).

Although, if I read the code correctly, thread-context object can be
specified AFTER memory backends, because they are parsed and created
before backends anyway. Well, something to think over the weekend.


> While this concept is currently only used for short-lived preallocation
> threads, nothing major speaks against reusing the concept for other
> threads that are harder to identify/configure -- except that
> we need additional (idle) context threads that are otherwise left unused.
> 
> [1] 
> https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com
> 
> Cc: Michal Privoznik <mprivozn@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Daniel P. Berrangé" <berrange@redhat.com>
> Cc: Eduardo Habkost <eduardo@habkost.net>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Stefan Weil <sw@weilnetz.de>
> 
> David Hildenbrand (7):
>   util: Cleanup and rename os_mem_prealloc()
>   util: Introduce qemu_thread_set_affinity() and
>     qemu_thread_get_affinity()
>   util: Introduce ThreadContext user-creatable object
>   util: Add write-only "node-affinity" property for ThreadContext
>   util: Make qemu_prealloc_mem() optionally consume a ThreadContext
>   hostmem: Allow for specifying a ThreadContext for preallocation
>   vl: Allow ThreadContext objects to be created before the sandbox
>     option
> 
>  backends/hostmem.c            |  13 +-
>  hw/virtio/virtio-mem.c        |   2 +-
>  include/qemu/osdep.h          |  19 +-
>  include/qemu/thread-context.h |  58 ++++++
>  include/qemu/thread.h         |   4 +
>  include/sysemu/hostmem.h      |   2 +
>  meson.build                   |  16 ++
>  qapi/qom.json                 |  25 +++
>  softmmu/cpus.c                |   2 +-
>  softmmu/vl.c                  |  30 ++-
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |  39 ++--
>  util/oslib-win32.c            |   8 +-
>  util/qemu-thread-posix.c      |  70 +++++++
>  util/qemu-thread-win32.c      |  12 ++
>  util/thread-context.c         | 363 ++++++++++++++++++++++++++++++++++
>  16 files changed, 637 insertions(+), 27 deletions(-)
>  create mode 100644 include/qemu/thread-context.h
>  create mode 100644 util/thread-context.c
> 

Reviewed-by: Michal Privoznik <mprivozn@redhat.com>

Michal




reply via email to

[Prev in Thread] Current Thread [Next in Thread]