qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] util: NUMA aware memory preallocation


From: David Hildenbrand
Subject: Re: [PATCH] util: NUMA aware memory preallocation
Date: Wed, 11 May 2022 18:41:15 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0

On 11.05.22 17:08, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 03:16:55PM +0200, Michal Prívozník wrote:
>> On 5/10/22 11:12, Daniel P. Berrangé wrote:
>>> On Tue, May 10, 2022 at 08:55:33AM +0200, Michal Privoznik wrote:
>>>> When allocating large amounts of memory the task is offloaded
>>>> onto threads. These threads then use various techniques to
>>>> allocate the memory fully (madvise(), writing into the memory).
>>>> However, these threads are free to run on any CPU, which becomes
>>>> problematic on NUMA machines because it may happen that a thread
>>>> is running on a distant node.
>>>>
>>>> Ideally, this is something that a management application would
>>>> resolve, but we are not anywhere close to that, Firstly, memory
>>>> allocation happens before monitor socket is even available. But
>>>> okay, that's what -preconfig is for. But then the problem is that
>>>> 'object-add' would not return until all memory is preallocated.
>>>>
>>>> Long story short, management application has no way of learning
>>>> TIDs of allocator threads so it can't make them run NUMA aware.
>>>
>>> So I'm wondering what the impact of this problem is for various
>>> scenarios.
>>
>> The scenario which I tested this with was no <emulatorpin/> but using
>> 'virsh emulatorpin' afterwards to pin emulator thread somewhere. For
>> those which are unfamiliar with libvirt, this is about placing the main
>> qemu TID (with the main eventloop) into a CGroup that restricts on what
>> CPUs it can run.
>>
>>>
>>> The default config for a KVM guest with libvirt is no CPU pinning
>>> at all. The kernel auto-places CPUs and decides on where RAM is to
>>> be allocated. So in this case, whether or not libvirt can talk to
>>> QMP in time to query threads is largely irrelevant, as we don't
>>> want todo placement in any case.
>>>
>>> In theory the kernel should allocate RAM on the node local to
>>> where the process is currently executing. So as long as the
>>> guest RAM fits in available free RAM on the local node, RAM
>>> should be allocated from the node that matches the CPU running
>>> the QEMU main thread.
>>>
>>> The challenge is if we spawn N more threads to do pre-alloc,
>>> these can be spread onto other nodes. I wonder if the kernel
>>> huas any preference for keeping threads within a process on
>>> the same NUMA node ?
>>
>> That's not exactly what I saw. I would have thought too that initially
>> the prealloc thread could be spawned just anywhere but after few
>> iterations the scheduler realized what NUMA node the thread is close to
>> and automatically schedule it to run there. Well, it didn't happen.
> 
> Thinking about it, this does make sense to some extent. When a
> thread is first spawned, how can the kernel know what region of
> memory it is about to start touching ? So at the very least the
> kernel schedular can get it wrong initially. It would need something
> to watch memory acces patterns to determine whether the initial
> decision was right or wrong, and fine tune it later.
> 
> Seems like the kernel typically tries todo the opposite to what
> we thought, and instead of moving CPUs, has ways to move the
> memory instead.
> 
> https://www.kernel.org/doc/html/latest/vm/page_migration.html
> 
>>> Overall, if libvirt is not applying pinning to the QEMU guest,
>>> then we're 100% reliant on the kernel todo something sensible,
>>> both for normal QEMU execution and for prealloc. Since we're
>>> not doing placement of QEMU RAM or CPUs, the logic in this
>>> patch won't do anything either.
>>>
>>>
>>> If the guest has more RAM than can fit on the local NUMA node,
>>> then we're doomed no matter what, even ignoring prealloc, there
>>> will be cross-node traffic. This scenario requires the admin to
>>> setup proper CPU /memory pinning for QEMU in libvirt.
>>>
>>> If libvirt is doing CPU pinning (as instructed by the mgmt app
>>> above us), then when we first start QEMU, the process thread
>>> leader will get given affinity by libvirt prior to exec. This
>>> affinity will be the union of affinity for all CPUs that will
>>> be later configured.
>>>
>>> The typical case for CPU pinning, is that everything fits in
>>> one NUMA node, and so in this case, we don't need todo anything
>>> more. The prealloc threads will already be constrained to the
>>> right place by the affinity of the QEMU thread leader, so the
>>> logic in this patch will run, but it won't do anything that
>>> was not already done.
>>>
>>> So we're left with the hardest case, where the guest is explicitly
>>> spread across multiple NUMA nodes. In this case the thread leader
>>> affinity will span many NUMA nodes, and so the prealloc threads
>>> will freely be placed across any CPU that is in the union of CPUs
>>> the guest is placed on. Just as with thue non-pinned case, the
>>> prealloc will be at the mercy of the kernel making sensible
>>> placement decisions.
>>
>> Indeed, but it's at least somewhat restricted. NB, in real scenario
>> users will map guest NUMA nodes onto host ones with 1:1 relationship.
>> And each guest NUMA node will have its own memdev=, i.e. its own set of
>> threads, so in the end, prealloc threads won't jump between host NUMA
>> nodes but stay local to the node they are allocating memory on.
> 
> Thinking about this from a completely different QEMU angle.
> 
> Right now the preallocation happens when we create the memory
> device, and takes place in threads spawned from the main
> QEMU thread.
> 
> We are doing memory placement in order that specific blocks of
> virtual RAM are co-located with specific virtual CPUs. IOW we
> know we already have some threads that will match locality of
> the RAM we have.
> 
> We are doing memory pre-allocation to give predictable
> performance once the VM starts, and to have a guarantee
> that the memory is actually available  for use.
> 
> We don't actually need the memory pre-allocation to take
> place so early in object creation, as we have it right now.
> It just needs to be done before VM creation is done and
> vCPUs start guest code.
> 
> 
> From the POV of controlling QEMU VM resource usage, we don't
> really want memory pre-allocation to consume more host CPUs
> than we've assigned to the VM for its vCPUs.
> 
> So what if instead of creating throwaway threads for memory
> allocation early, we ran the preallocation in the vCPU
> threads before they start executing guest code ? This is
> still early enough to achieve our goals for preallocation.
> 
> These vCPU threads already have the right affinity setup. This
> ensures that the CPU burn for preallocation doesn't exceed
> what we've allowed for guest CPU usage in general, so resource
> limits will be naturally enforced.
> 
> Is this kind of approach possible ?

That sounds quite hackish. IMHO, we should make sure that any approach
we introduce is able to cope with both, coldplugged and hotplugged
memory backends.

I agree with Paolos comment that libvirt is trying to micromanage QEMU
here, which is the root of the issue IMHO.

For applicable VMs (!realtime?), Libvirt could simply allow QEMU to set
the affinity in case there cannot really be harm done. Not sure what
other interfaces the kernel could provide to allow Libvirt to restrict
the affinity to only some subset of nodes/cpus, so QEMU's harm could be
limited to that subset.

-- 
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]