qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using Threa


From: Joao Martins
Subject: Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
Date: Thu, 11 Aug 2022 11:50:05 +0100

On 8/9/22 19:06, David Hildenbrand wrote:
> On 09.08.22 12:56, Joao Martins wrote:
>> On 7/21/22 13:07, David Hildenbrand wrote:
>>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>>> Michal.
>>>
>>> Setting the CPU affinity of threads from inside QEMU usually isn't
>>> easily possible, because we don't want QEMU -- once started and running
>>> guest code -- to be able to mess up the system. QEMU disallows relevant
>>> syscalls using seccomp, such that any such invocation will fail.
>>>
>>> Especially for memory preallocation in memory backends, the CPU affinity
>>> can significantly increase guest startup time, for example, when running
>>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>>
>>> (1) Once preallocation threads are created during preallocation, management
>>>     tools cannot intercept anymore to change the affinity. These threads
>>>     are created automatically on demand.
>>> (2) QEMU cannot easily set the CPU affinity itself.
>>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>>     might not necessarily be exactly the CPUs we actually want to use
>>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>>
>>> There is an easy "workaround". If we have a thread with the right CPU
>>> affinity, we can simply create new threads on demand via that prepared
>>> context. So, all we have to do is setup and create such a context ahead
>>> of time, to then configure preallocation to create new threads via that
>>> environment.
>>>
>>> So, let's introduce a user-creatable "thread-context" object that
>>> essentially consists of a context thread used to create new threads.
>>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>>> "node-affinity" property), or upper layers can extract the thread id
>>> ("thread-id" property) to configure it externally.
>>>
>>> Make memory-backends consume a thread-context object
>>> (via the "prealloc-context" property) and use it when preallocating to
>>> create new threads with the desired CPU affinity. Further, to make it
>>> easier to use, allow creation of "thread-context" objects, including
>>> setting the CPU affinity directly from QEMU, *before* enabling the
>>> sandbox option.
>>>
>>>
>>> Quick test on a system with 2 NUMA nodes:
>>>
>>> Without CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object 
>>> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind
>>>  \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m5.383s
>>>     real    0m3.499s
>>>     real    0m5.129s
>>>     real    0m4.232s
>>>     real    0m5.220s
>>>     real    0m4.288s
>>>     real    0m3.582s
>>>     real    0m4.305s
>>>     real    0m5.421s
>>>     real    0m4.502s
>>>
>>>     -> It heavily depends on the scheduler CPU selection
>>>
>>> With CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object thread-context,id=tc1,node-affinity=0 \
>>>         -object 
>>> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1
>>>  \
>>>         -sandbox enable=on,resourcecontrol=deny \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m1.959s
>>>     real    0m1.942s
>>>     real    0m1.943s
>>>     real    0m1.941s
>>>     real    0m1.948s
>>>     real    0m1.964s
>>>     real    0m1.949s
>>>     real    0m1.948s
>>>     real    0m1.941s
>>>     real    0m1.937s
>>>
>>> On reasonably large VMs, the speedup can be quite significant.
>>>
>> Really awesome work!
> 
> Thanks!
> 
>>
>> I am not sure I picked up this well while reading the series, but it seems 
>> to me that
>> prealloc is still serialized on per memory-backend when solely configured by 
>> command-line
>> right?
> 
> I think it's serialized in any case, even when preallocation is
> triggered manually using prealloc=on. I might be wrong, but any kind of
> object creation or property changes should be serialized by the BQL.
> 
> In theory, we can "easily" preallocate in our helper --
> qemu_prealloc_mem() -- concurrently when we don't have to bother about
> handling SIGBUS -- that is, when the kernel supports
> MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
> serialize in there as well.
> 
/me nods matches my understanding

>>
>> Meaning when we start prealloc we wait until the memory-backend 
>> thread-context action is
>> completed (per-memory-backend) even if other to-be-configured 
>> memory-backends will use a
>> thread-context on a separate set of pinned CPUs on another node ... and 
>> wouldn't in theory
>> "need" to wait until the former prealloc finishes?
> 
> Yes. This series only takes care of NUMA-aware preallocation, but
> doesn't preallocate multiple memory backends in parallel.
> 
> In theory, it would be quite easy to preallocate concurrently: simply
> create the memory backend objects passed on the QEMU cmdline
> concurrently from multiple threads.
> 
Right

> In practice, we have to be careful I think with the BQL. But it doesn't
> sound horribly complicated to achieve that. We can perform all
> synchronized under the BQL and only trigger actual expensive
> preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
> released BQL.
> 
Right.

The small bit to take care (AFAIU from the code) is to defer waiting for all 
the memset
threads to finish. The problem in command line at least is that you attempt at 
memsetting,
but then wait for all the threads to finish. And because the context
passed to memset is allocated over the stack we must wait as we would lose that
state. So it's mainly moving the tracking to be global and defer the time
that we wait to join all threads. With MADV_POPULATE_WRITE we know we are OK 
but I
wonder if sigbus could be made to work too like only registering only once, and 
the
sigbus handler would look for the thread based on the address range it is 
handling,
having the just-MCEd address. And we only unregister the sigbus handler also 
once after
all prealloc threads are finished.

Via QMP, I am not sure BQL is the only "problem", there might be some monitor 
lock there
too or some sort of request handling serialization that only one thread 
processes QMP
requests and dispatches them. Simply releasing BQL prior to prealloc doesn't do 
much,
but though it may help doing other work while that is happening.

>>
>> Unless as you alluded in one of the last patches: we can pass these 
>> thread-contexts with
>> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have 
>> different QMP
>> clients set prealloc=on, and thus prealloc would happen concurrently per 
>> node?
> 
> I think we will serialize in any case when modifying properties. Can you
> give it a shot and see if it would work as of now? I doubt it, but I
> might be wrong.
> 

Over a quick experiment with two monitors each
attempting at prealloc each node in parallel, well it takes the same 7secs (on 
a small
2-node 128G test) regardless. Your expectation looks indeed correct.

>>
>> We were thinking to extend it to leverage per socket bandwidth essentially 
>> to parallel
>> this even further (we saw improvements with something like that but haven't 
>> tried this
>> series yet). Likely this is already possible with your work and I didn't 
>> pick up on it,
>> hence just making sure this is the case :)
> 
> With this series, you can essentially tell QEMU which physical CPUs to
> use for preallocating a given memory backend. But memory backends are
> not created+preallocated concurrently yet.
> 
Yeap, thanks for the context/info.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]