qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] mem/x86: add processor address space check for VM memory


From: Ani Sinha
Subject: Re: [PATCH] mem/x86: add processor address space check for VM memory
Date: Thu, 14 Sep 2023 16:51:16 +0530


> On 14-Sep-2023, at 2:07 PM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 14.09.23 07:53, Ani Sinha wrote:
>>> On 12-Sep-2023, at 9:04 PM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> [...]
>>> 
>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>> index 54838c0c41..d187890675 100644
>>>>> --- a/hw/i386/pc.c
>>>>> +++ b/hw/i386/pc.c
>>>>> @@ -908,9 +908,12 @@ static hwaddr pc_max_used_gpa(PCMachineState *pcms, 
>>>>> uint64_t pci_hole64_size)
>>>>> {
>>>>>     X86CPU *cpu = X86_CPU(first_cpu);
>>>>> 
>>>>> -    /* 32-bit systems don't have hole64 thus return max CPU address */
>>>>> -    if (cpu->phys_bits <= 32) {
>>>>> -        return ((hwaddr)1 << cpu->phys_bits) - 1;
>>>>> +    /*
>>>>> +     * 32-bit systems don't have hole64, but we might have a region for
>>>>> +     * memory hotplug.
>>>>> +     */
>>>>> +    if (!(cpu->env.features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM)) {
>>>>> +        return pc_pci_hole64_start() - 1;
>>>> Ok this is very confusing! I am looking at pc_pci_hole64_start() function. 
>>>> I have a few questions …
>>>> (a) pc_get_device_memory_range() returns the size of the device memory as 
>>>> the difference between ram_size and maxram_size. But from what I 
>>>> understand, ram_size is the actual size of the ram present and maxram_size 
>>>> is the max size of ram *after* hot plugging additional memory. How can we 
>>>> assume that the additional available space is already occupied by hot 
>>>> plugged memory?
>>> 
>>> Let's take a look at an example:
>>> 
>>> $ ./build/qemu-system-x86_64 -m 8g,maxmem=16g,slots=1 \
>>>  -object memory-backend-ram,id=mem0,size=1g \
>>>  -device pc-dimm,memdev=mem0 \
>>>  -nodefaults -nographic -S -monitor stdio
>>> 
>>> (qemu) info mtree
>>> ...
>>> memory-region: system
>>>  0000000000000000-ffffffffffffffff (prio 0, i/o): system
>>>    0000000000000000-00000000bfffffff (prio 0, ram): alias ram-below-4g 
>>> @pc.ram 0000000000000000-00000000bfffffff
>>>    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
>>>      00000000000c0000-00000000000dffff (prio 1, rom): pc.rom
>>>      00000000000e0000-00000000000fffff (prio 1, rom): alias isa-bios 
>>> @pc.bios 0000000000020000-000000000003ffff
>>>      00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
>>>    00000000000a0000-00000000000bffff (prio 1, i/o): alias smram-region @pci 
>>> 00000000000a0000-00000000000bffff
>>>    00000000000c0000-00000000000c3fff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000c0000-00000000000c3fff
>>>    00000000000c4000-00000000000c7fff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000c4000-00000000000c7fff
>>>    00000000000c8000-00000000000cbfff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000c8000-00000000000cbfff
>>>    00000000000cc000-00000000000cffff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000cc000-00000000000cffff
>>>    00000000000d0000-00000000000d3fff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000d0000-00000000000d3fff
>>>    00000000000d4000-00000000000d7fff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000d4000-00000000000d7fff
>>>    00000000000d8000-00000000000dbfff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000d8000-00000000000dbfff
>>>    00000000000dc000-00000000000dffff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000dc000-00000000000dffff
>>>    00000000000e0000-00000000000e3fff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000e0000-00000000000e3fff
>>>    00000000000e4000-00000000000e7fff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000e4000-00000000000e7fff
>>>    00000000000e8000-00000000000ebfff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000e8000-00000000000ebfff
>>>    00000000000ec000-00000000000effff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000ec000-00000000000effff
>>>    00000000000f0000-00000000000fffff (prio 1, i/o): alias pam-pci @pci 
>>> 00000000000f0000-00000000000fffff
>>>    00000000fec00000-00000000fec00fff (prio 0, i/o): ioapic
>>>    00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
>>>    00000000fee00000-00000000feefffff (prio 4096, i/o): apic-msi
>>>    0000000100000000-000000023fffffff (prio 0, ram): alias ram-above-4g 
>>> @pc.ram 00000000c0000000-00000001ffffffff
>>>    0000000240000000-000000047fffffff (prio 0, i/o): device-memory
>>>      0000000240000000-000000027fffffff (prio 0, ram): mem0
>>> 
>>> 
>>> We requested 8G of boot memory, which is split between "<4G" memory and 
>>> ">=4G" memory.
>>> 
>>> We only place exactly 3G (0x0->0xbfffffff) under 4G, starting at address 0.
>> I can’t reconcile this with this code for q35:
>>    if (machine->ram_size >= 0xb0000000) {
>>         lowmem = 0x80000000; // max memory 0x8fffffff or 2.25 GiB
>>     } else {
>>         lowmem = 0xb0000000; // max memory 0xbfffffff or 3 GiB
>>     }
>> You assigned 8 Gib to ram which is > 0xb0000000 (2.75 Gib)
> 
> QEMU defaults to the "pc" machine. If you add "-M q35" you get:
> 
> address-space: memory
>  0000000000000000-ffffffffffffffff (prio 0, i/o): system
>    0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g 
> @pc.ram 0000000000000000-000000007fffffff
> [...]
>    0000000100000000-000000027fffffff (prio 0, ram): alias ram-above-4g 
> @pc.ram 0000000080000000-00000001ffffffff
>    0000000280000000-00000004bfffffff (prio 0, i/o): device-memory
>      0000000280000000-00000002bfffffff (prio 0, ram): mem0
> 
> 
>>> 
>>> We leave the remainder (1G) of the <4G addresses available for I/O devices 
>>> (32bit PCI hole).
>>> 
>>> So we end up with 5G (0x100000000->0x23fffffff) of memory starting exactly 
>>> at address 4G.
>>> 
>>> "maxram_size - ram_size"=8G is the maximum amount of memory you can 
>>> hotplug. We use it to size the
>>> "device-memory" region:
>>> 
>>> 0x47fffffff - 0x240000000+1 = 0x240000000
>>> -> 9 GiB
>>> 
>>> We requested a to hotplug a maximum of "8 GiB", and sized the area slightly 
>>> larger to allow for some flexibility
>>> when it comes to placing DIMMs in that "device-memory" area.
>> Right but here in this example you do not hot plug memory while the VM is 
>> running. We can hot plug 8G yes, but the memory may not physically exist yet 
>> (and may never exist). How can we use this math to provision device-memory 
>> when the memory may not exist physically?
> 
> We simply reserve a region in GPA space where we can coldplug and hotplug a
> predefined maximum amount of memory we can hotplug.
> 
> What do you think is wrong with that?

The only issue I have is that even though we are accounting for it, the memory 
actually might not be physically present.

> 
>>> 
>>> We place that area for memory devices after the RAM. So it starts after the 
>>> 5G of ">=4G" boot memory.
>>> 
>>> 
>>> Long story short, based on the initial RAM size and the maximum RAM size, 
>>> you
>>> can construct the layout above and exactly know
>>> a) How much memory is below 4G, starting at address 0 -> leaving 1G for the 
>>> 32bit PCI hole
>>> b) How much memory is above 4G, starting at address 4g.
>>> c) Where the region for memory devices starts (aligned after b) ) and how 
>>> big it is.
>>> d) Where the 64bit PCI hole is (after c) )
>>> 
>>>> (b) Another question is, in pc_pci_hole64_start(), why are we adding this 
>>>> size to the start address?
>>>> } else if (pcmc->has_reserved_memory && (ms->ram_size < ms->maxram_size)) {
>>>>    pc_get_device_memory_range(pcms, &hole64_start, &size);
>>>>         if (!pcmc->broken_reserved_end) {
>>>>             hole64_start += size;
>>> 
>>> The 64bit PCI hole starts after "device-memory" above.
>>> 
>>> Apparently, we have to take care of some layout issues before QEMU 2.5. You 
>>> can assume that nowadays,
>>> "pcmc->broken_reserved_end" is never set. So the PCI64 hole is always after 
>>> the device-memory region.
>>> 
>>>> I think this is trying to put the hole after the device memory. But if the 
>>>> ram size is <=maxram_size then the hole is after the above_4G memory? Why?
>>> 
>>> I didn't quit get what the concern is, can you elaborate?
>> Oh I meant the else part here and made a typo, the else implies ram size == 
>> maxram_size
>>   } else {
>>         hole64_start = pc_above_4g_end(pcms);
>>     }
>> So in this case, there is no device_memory region?!
> 
> Yes. In this case ms->ram_size == ms->maxram_size and you cannot cold/hotplug 
> any memory devices.
> 
> See how pc_memory_init() doesn't call machine_memory_devices_init() in that 
> case.
> 
> That's what the QEMU user asked for when *not* specifying maxmem (e.g., -m 
> 4g).
> 
> In order to cold/hotplug any memory devices, you have to tell QEMU ahead of 
> time how much memory
> you are intending to provide using memory devices (DIMM, NVDIMM, virtio-pmem, 
> virtio-mem).

So that means that when we are actually hot plugging the memory, there is no 
need to actually perform additional checks. It can be done statically when -mem 
and -maxmem etc are provided in the command line.

> 
> So when specifying, say -m 4g,maxmem=20g, we can have memory devices of a 
> total of 16g (20 - 4).
> We use reserve a GPA space for device_memory that is at least 16g, into which 
> we can either coldplug
> (QEMU cmdline) or hotplug (qmp/hmp) memory later.
> 
>> Another thing I do not understand is, for 32 -bit,
>> above_4g_mem_start is 4GiB  and above_4g_mem_size = ram_size - lowmem.
>> So we are allocating “above-4G” ram above address space of the processor?!
>>> 
>>>> (c) in your above change, what does long mode have anything to do with all 
>>>> of this?
>>> 
>>> According to my understanding, 32bit (i386) doesn't have a 64bit hole. And 
>>> 32bit vs.
>>> 64bit (i386 vs. x86_64) is decided based on LM, not on the address bits (as 
>>> we learned, PSE36, and PAE).
>>> 
>>> But really, I just did what x86_cpu_realizefn() does to decide 32bit vs. 
>>> 64bit ;)
>>> 
>>>    /* For 64bit systems think about the number of physical bits to present.
>>>     * ideally this should be the same as the host; anything other than 
>>> matching
>>>     * the host can cause incorrect guest behaviour.
>>>     * QEMU used to pick the magic value of 40 bits that corresponds to
>>>     * consumer AMD devices but nothing else.
>>>     *
>>>     * Note that this code assumes features expansion has already been done
>>>     * (as it checks for CPUID_EXT2_LM), and also assumes that potential
>>>     * phys_bits adjustments to match the host have been already done in
>>>     * accel-specific code in cpu_exec_realizefn.
>>>     */
>>>    if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
>>>    ...
>>>    } else {
>>>        /* For 32 bit systems don't use the user set value, but keep
>>>         * phys_bits consistent with what we tell the guest.
>>>         */
>> Ah I see. I missed this. But I still can’t understand why for 32 bit, 
>> pc_pci_hole64_start() would be the right address for max gpa?
> 
> You want "end of device memory region" if there is one, or
> "end of RAM" is there is none.
> 
> What pc_pci_hole64_start() does:
> 
> /*
> * The 64bit pci hole starts after "above 4G RAM" and
> * potentially the space reserved for memory hotplug.
> */
> 
> There is the
>       ROUND_UP(hole64_start, 1 * GiB);
> in there that is not really required for the !hole64 case. It
> shouldn't matter much in practice I think (besides an aligned value
> showing up in the error message).
> 
> We could factor out most of that calculation into a
> separate function, skipping that alignment to make that
> clearer.

Yeah this whole memory segmentation is quite complicated and might benefit from 
a qemu doc or a refactoring. 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]