qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] util: NUMA aware memory preallocation


From: Michal Prívozník
Subject: Re: [PATCH] util: NUMA aware memory preallocation
Date: Wed, 11 May 2022 15:16:55 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0

On 5/10/22 11:12, Daniel P. Berrangé wrote:
> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
>> When allocating large amounts of memory the task is offloaded
>> onto threads. These threads then use various techniques to
>> allocate the memory fully (madvise(), writing into the memory).
>> However, these threads are free to run on any CPU, which becomes
>> problematic on NUMA machines because it may happen that a thread
>> is running on a distant node.
>>
>> Ideally, this is something that a management application would
>> resolve, but we are not anywhere close to that, Firstly, memory
>> allocation happens before monitor socket is even available. But
>> okay, that's what -preconfig is for. But then the problem is that
>> 'object-add' would not return until all memory is preallocated.
>>
>> Long story short, management application has no way of learning
>> TIDs of allocator threads so it can't make them run NUMA aware.
> 
> So I'm wondering what the impact of this problem is for various
> scenarios.

The scenario which I tested this with was no <emulatorpin/> but using
'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
those which are unfamiliar with libvirt, this is about placing the main
qemu TID (with the main eventloop) into a CGroup that restricts on what
CPUs it can run.

> 
> The default config for a KVM guest with libvirt is no CPU pinning
> at all. The kernel auto-places CPUs and decides on where RAM is to
> be allocated. So in this case, whether or not libvirt can talk to
> QMP in time to query threads is largely irrelevant, as we don't
> want todo placement in any case.
> 
> In theory the kernel should allocate RAM on the node local to
> where the process is currently executing. So as long as the
> guest RAM fits in available free RAM on the local node, RAM
> should be allocated from the node that matches the CPU running
> the QEMU main thread.
> 
> The challenge is if we spawn N more threads to do pre-alloc,
> these can be spread onto other nodes. I wonder if the kernel
> huas any preference for keeping threads within a process on
> the same NUMA node ?

That's not exactly what I saw. I would have thought too that initially
the prealloc thread could be spawned just anywhere but after few
iterations the scheduler realized what NUMA node the thread is close to
and automatically schedule it to run there. Well, it didn't happen.

> 
> Overall, if libvirt is not applying pinning to the QEMU guest,
> then we're 100% reliant on the kernel todo something sensible,
> both for normal QEMU execution and for prealloc. Since we're
> not doing placement of QEMU RAM or CPUs, the logic in this
> patch won't do anything either.
> 
> 
> If the guest has more RAM than can fit on the local NUMA node,
> then we're doomed no matter what, even ignoring prealloc, there
> will be cross-node traffic. This scenario requires the admin to
> setup proper CPU /memory pinning for QEMU in libvirt.
> 
> If libvirt is doing CPU pinning (as instructed by the mgmt app
> above us), then when we first start QEMU, the process thread
> leader will get given affinity by libvirt prior to exec. This
> affinity will be the union of affinity for all CPUs that will
> be later configured.
> 
> The typical case for CPU pinning, is that everything fits in
> one NUMA node, and so in this case, we don't need todo anything
> more. The prealloc threads will already be constrained to the
> right place by the affinity of the QEMU thread leader, so the
> logic in this patch will run, but it won't do anything that
> was not already done.
> 
> So we're left with the hardest case, where the guest is explicitly
> spread across multiple NUMA nodes. In this case the thread leader
> affinity will span many NUMA nodes, and so the prealloc threads
> will freely be placed across any CPU that is in the union of CPUs
> the guest is placed on. Just as with thue non-pinned case, the
> prealloc will be at the mercy of the kernel making sensible
> placement decisions.

Indeed, but it's at least somewhat restricted. NB, in real scenario
users will map guest NUMA nodes onto host ones with 1:1 relationship.
And each guest NUMA node will have its own memdev=, i.e. its own set of
threads, so in the end, prealloc threads won't jump between host NUMA
nodes but stay local to the node they are allocating memory on.

> 
> The very last cases is the only one where this patch can potentially
> be beneficial. The problem is that because libvirt is in charge of
> enforcing CPU affinity, IIRC, we explicitly block QEMU from doing
> anything with CPU affinity. So AFAICT, this patch should result in
> an error from sched_setaffinity when run under libvirt.

Yes, I had to disable capability dropping in qemu.conf.

After all, I think maybe the right place to fix this is kernel? I mean,
why don't prealloc threads converge to the nodes they are working with?

Michal




reply via email to

[Prev in Thread] Current Thread [Next in Thread]