qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using Threa


From: David Hildenbrand
Subject: Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
Date: Tue, 9 Aug 2022 20:06:47 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0

On 09.08.22 12:56, Joao Martins wrote:
> On 7/21/22 13:07, David Hildenbrand wrote:
>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>> Michal.
>>
>> Setting the CPU affinity of threads from inside QEMU usually isn't
>> easily possible, because we don't want QEMU -- once started and running
>> guest code -- to be able to mess up the system. QEMU disallows relevant
>> syscalls using seccomp, such that any such invocation will fail.
>>
>> Especially for memory preallocation in memory backends, the CPU affinity
>> can significantly increase guest startup time, for example, when running
>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>
>> (1) Once preallocation threads are created during preallocation, management
>>     tools cannot intercept anymore to change the affinity. These threads
>>     are created automatically on demand.
>> (2) QEMU cannot easily set the CPU affinity itself.
>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>     might not necessarily be exactly the CPUs we actually want to use
>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>
>> There is an easy "workaround". If we have a thread with the right CPU
>> affinity, we can simply create new threads on demand via that prepared
>> context. So, all we have to do is setup and create such a context ahead
>> of time, to then configure preallocation to create new threads via that
>> environment.
>>
>> So, let's introduce a user-creatable "thread-context" object that
>> essentially consists of a context thread used to create new threads.
>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>> "node-affinity" property), or upper layers can extract the thread id
>> ("thread-id" property) to configure it externally.
>>
>> Make memory-backends consume a thread-context object
>> (via the "prealloc-context" property) and use it when preallocating to
>> create new threads with the desired CPU affinity. Further, to make it
>> easier to use, allow creation of "thread-context" objects, including
>> setting the CPU affinity directly from QEMU, *before* enabling the
>> sandbox option.
>>
>>
>> Quick test on a system with 2 NUMA nodes:
>>
>> Without CPU affinity:
>>     time qemu-system-x86_64 \
>>         -object 
>> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind
>>  \
>>         -nographic -monitor stdio
>>
>>     real    0m5.383s
>>     real    0m3.499s
>>     real    0m5.129s
>>     real    0m4.232s
>>     real    0m5.220s
>>     real    0m4.288s
>>     real    0m3.582s
>>     real    0m4.305s
>>     real    0m5.421s
>>     real    0m4.502s
>>
>>     -> It heavily depends on the scheduler CPU selection
>>
>> With CPU affinity:
>>     time qemu-system-x86_64 \
>>         -object thread-context,id=tc1,node-affinity=0 \
>>         -object 
>> memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1
>>  \
>>         -sandbox enable=on,resourcecontrol=deny \
>>         -nographic -monitor stdio
>>
>>     real    0m1.959s
>>     real    0m1.942s
>>     real    0m1.943s
>>     real    0m1.941s
>>     real    0m1.948s
>>     real    0m1.964s
>>     real    0m1.949s
>>     real    0m1.948s
>>     real    0m1.941s
>>     real    0m1.937s
>>
>> On reasonably large VMs, the speedup can be quite significant.
>>
> Really awesome work!

Thanks!

> 
> I am not sure I picked up this well while reading the series, but it seems to 
> me that
> prealloc is still serialized on per memory-backend when solely configured by 
> command-line
> right?

I think it's serialized in any case, even when preallocation is
triggered manually using prealloc=on. I might be wrong, but any kind of
object creation or property changes should be serialized by the BQL.

In theory, we can "easily" preallocate in our helper --
qemu_prealloc_mem() -- concurrently when we don't have to bother about
handling SIGBUS -- that is, when the kernel supports
MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
serialize in there as well.

> 
> Meaning when we start prealloc we wait until the memory-backend 
> thread-context action is
> completed (per-memory-backend) even if other to-be-configured memory-backends 
> will use a
> thread-context on a separate set of pinned CPUs on another node ... and 
> wouldn't in theory
> "need" to wait until the former prealloc finishes?

Yes. This series only takes care of NUMA-aware preallocation, but
doesn't preallocate multiple memory backends in parallel.

In theory, it would be quite easy to preallocate concurrently: simply
create the memory backend objects passed on the QEMU cmdline
concurrently from multiple threads.

In practice, we have to be careful I think with the BQL. But it doesn't
sound horribly complicated to achieve that. We can perform all
synchronized under the BQL and only trigger actual expensive
preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
released BQL.

> 
> Unless as you alluded in one of the last patches: we can pass these 
> thread-contexts with
> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have 
> different QMP
> clients set prealloc=on, and thus prealloc would happen concurrently per node?

I think we will serialize in any case when modifying properties. Can you
give it a shot and see if it would work as of now? I doubt it, but I
might be wrong.

> 
> We were thinking to extend it to leverage per socket bandwidth essentially to 
> parallel
> this even further (we saw improvements with something like that but haven't 
> tried this
> series yet). Likely this is already possible with your work and I didn't pick 
> up on it,
> hence just making sure this is the case :)

With this series, you can essentially tell QEMU which physical CPUs to
use for preallocating a given memory backend. But memory backends are
not created+preallocated concurrently yet.

-- 
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]