qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] mem/x86: add processor address space check for VM memory


From: David Hildenbrand
Subject: Re: [PATCH] mem/x86: add processor address space check for VM memory
Date: Thu, 14 Sep 2023 10:37:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 14.09.23 07:53, Ani Sinha wrote:


On 12-Sep-2023, at 9:04 PM, David Hildenbrand <david@redhat.com> wrote:

[...]

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 54838c0c41..d187890675 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -908,9 +908,12 @@ static hwaddr pc_max_used_gpa(PCMachineState *pcms, 
uint64_t pci_hole64_size)
{
     X86CPU *cpu = X86_CPU(first_cpu);

-    /* 32-bit systems don't have hole64 thus return max CPU address */
-    if (cpu->phys_bits <= 32) {
-        return ((hwaddr)1 << cpu->phys_bits) - 1;
+    /*
+     * 32-bit systems don't have hole64, but we might have a region for
+     * memory hotplug.
+     */
+    if (!(cpu->env.features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM)) {
+        return pc_pci_hole64_start() - 1;
Ok this is very confusing! I am looking at pc_pci_hole64_start() function. I 
have a few questions …
(a) pc_get_device_memory_range() returns the size of the device memory as the 
difference between ram_size and maxram_size. But from what I understand, 
ram_size is the actual size of the ram present and maxram_size is the max size 
of ram *after* hot plugging additional memory. How can we assume that the 
additional available space is already occupied by hot plugged memory?

Let's take a look at an example:

$ ./build/qemu-system-x86_64 -m 8g,maxmem=16g,slots=1 \
  -object memory-backend-ram,id=mem0,size=1g \
  -device pc-dimm,memdev=mem0 \
  -nodefaults -nographic -S -monitor stdio

(qemu) info mtree
...
memory-region: system
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000000000-00000000bfffffff (prio 0, ram): alias ram-below-4g @pc.ram 
0000000000000000-00000000bfffffff
    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
      00000000000c0000-00000000000dffff (prio 1, rom): pc.rom
      00000000000e0000-00000000000fffff (prio 1, rom): alias isa-bios @pc.bios 
0000000000020000-000000000003ffff
      00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
    00000000000a0000-00000000000bffff (prio 1, i/o): alias smram-region @pci 
00000000000a0000-00000000000bffff
    00000000000c0000-00000000000c3fff (prio 1, i/o): alias pam-pci @pci 
00000000000c0000-00000000000c3fff
    00000000000c4000-00000000000c7fff (prio 1, i/o): alias pam-pci @pci 
00000000000c4000-00000000000c7fff
    00000000000c8000-00000000000cbfff (prio 1, i/o): alias pam-pci @pci 
00000000000c8000-00000000000cbfff
    00000000000cc000-00000000000cffff (prio 1, i/o): alias pam-pci @pci 
00000000000cc000-00000000000cffff
    00000000000d0000-00000000000d3fff (prio 1, i/o): alias pam-pci @pci 
00000000000d0000-00000000000d3fff
    00000000000d4000-00000000000d7fff (prio 1, i/o): alias pam-pci @pci 
00000000000d4000-00000000000d7fff
    00000000000d8000-00000000000dbfff (prio 1, i/o): alias pam-pci @pci 
00000000000d8000-00000000000dbfff
    00000000000dc000-00000000000dffff (prio 1, i/o): alias pam-pci @pci 
00000000000dc000-00000000000dffff
    00000000000e0000-00000000000e3fff (prio 1, i/o): alias pam-pci @pci 
00000000000e0000-00000000000e3fff
    00000000000e4000-00000000000e7fff (prio 1, i/o): alias pam-pci @pci 
00000000000e4000-00000000000e7fff
    00000000000e8000-00000000000ebfff (prio 1, i/o): alias pam-pci @pci 
00000000000e8000-00000000000ebfff
    00000000000ec000-00000000000effff (prio 1, i/o): alias pam-pci @pci 
00000000000ec000-00000000000effff
    00000000000f0000-00000000000fffff (prio 1, i/o): alias pam-pci @pci 
00000000000f0000-00000000000fffff
    00000000fec00000-00000000fec00fff (prio 0, i/o): ioapic
    00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
    00000000fee00000-00000000feefffff (prio 4096, i/o): apic-msi
    0000000100000000-000000023fffffff (prio 0, ram): alias ram-above-4g @pc.ram 
00000000c0000000-00000001ffffffff
    0000000240000000-000000047fffffff (prio 0, i/o): device-memory
      0000000240000000-000000027fffffff (prio 0, ram): mem0


We requested 8G of boot memory, which is split between "<4G" memory and ">=4G" 
memory.

We only place exactly 3G (0x0->0xbfffffff) under 4G, starting at address 0.

I can’t reconcile this with this code for q35:

    if (machine->ram_size >= 0xb0000000) {
         lowmem = 0x80000000; // max memory 0x8fffffff or 2.25 GiB
     } else {
         lowmem = 0xb0000000; // max memory 0xbfffffff or 3 GiB
     }

You assigned 8 Gib to ram which is > 0xb0000000 (2.75 Gib)


QEMU defaults to the "pc" machine. If you add "-M q35" you get:

address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g @pc.ram 
0000000000000000-000000007fffffff
[...]
    0000000100000000-000000027fffffff (prio 0, ram): alias ram-above-4g @pc.ram 
0000000080000000-00000001ffffffff
    0000000280000000-00000004bfffffff (prio 0, i/o): device-memory
      0000000280000000-00000002bfffffff (prio 0, ram): mem0




We leave the remainder (1G) of the <4G addresses available for I/O devices 
(32bit PCI hole).

So we end up with 5G (0x100000000->0x23fffffff) of memory starting exactly at 
address 4G.

"maxram_size - ram_size"=8G is the maximum amount of memory you can hotplug. We 
use it to size the
"device-memory" region:

0x47fffffff - 0x240000000+1 = 0x240000000
-> 9 GiB

We requested a to hotplug a maximum of "8 GiB", and sized the area slightly 
larger to allow for some flexibility
when it comes to placing DIMMs in that "device-memory" area.

Right but here in this example you do not hot plug memory while the VM is 
running. We can hot plug 8G yes, but the memory may not physically exist yet 
(and may never exist). How can we use this math to provision device-memory when 
the memory may not exist physically?

We simply reserve a region in GPA space where we can coldplug and hotplug a
predefined maximum amount of memory we can hotplug.

What do you think is wrong with that?



We place that area for memory devices after the RAM. So it starts after the 5G of 
">=4G" boot memory.


Long story short, based on the initial RAM size and the maximum RAM size, you
can construct the layout above and exactly know
a) How much memory is below 4G, starting at address 0 -> leaving 1G for the 
32bit PCI hole
b) How much memory is above 4G, starting at address 4g.
c) Where the region for memory devices starts (aligned after b) ) and how big 
it is.
d) Where the 64bit PCI hole is (after c) )

(b) Another question is, in pc_pci_hole64_start(), why are we adding this size 
to the start address?
} else if (pcmc->has_reserved_memory && (ms->ram_size < ms->maxram_size)) {
        pc_get_device_memory_range(pcms, &hole64_start, &size);
         if (!pcmc->broken_reserved_end) {
             hole64_start += size;

The 64bit PCI hole starts after "device-memory" above.

Apparently, we have to take care of some layout issues before QEMU 2.5. You can 
assume that nowadays,
"pcmc->broken_reserved_end" is never set. So the PCI64 hole is always after the 
device-memory region.

I think this is trying to put the hole after the device memory. But if the ram 
size is <=maxram_size then the hole is after the above_4G memory? Why?

I didn't quit get what the concern is, can you elaborate?

Oh I meant the else part here and made a typo, the else implies ram size == 
maxram_size

   } else {
         hole64_start = pc_above_4g_end(pcms);
     }

So in this case, there is no device_memory region?!

Yes. In this case ms->ram_size == ms->maxram_size and you cannot cold/hotplug 
any memory devices.

See how pc_memory_init() doesn't call machine_memory_devices_init() in that 
case.

That's what the QEMU user asked for when *not* specifying maxmem (e.g., -m 4g).

In order to cold/hotplug any memory devices, you have to tell QEMU ahead of 
time how much memory
you are intending to provide using memory devices (DIMM, NVDIMM, virtio-pmem, 
virtio-mem).

So when specifying, say -m 4g,maxmem=20g, we can have memory devices of a total 
of 16g (20 - 4).
We use reserve a GPA space for device_memory that is at least 16g, into which 
we can either coldplug
(QEMU cmdline) or hotplug (qmp/hmp) memory later.

Another thing I do not understand is, for 32 -bit,
above_4g_mem_start is 4GiB  and above_4g_mem_size = ram_size - lowmem.
So we are allocating “above-4G” ram above address space of the processor?!


(c) in your above change, what does long mode have anything to do with all of 
this?

According to my understanding, 32bit (i386) doesn't have a 64bit hole. And 
32bit vs.
64bit (i386 vs. x86_64) is decided based on LM, not on the address bits (as we 
learned, PSE36, and PAE).

But really, I just did what x86_cpu_realizefn() does to decide 32bit vs. 64bit 
;)

    /* For 64bit systems think about the number of physical bits to present.
     * ideally this should be the same as the host; anything other than matching
     * the host can cause incorrect guest behaviour.
     * QEMU used to pick the magic value of 40 bits that corresponds to
     * consumer AMD devices but nothing else.
     *
     * Note that this code assumes features expansion has already been done
     * (as it checks for CPUID_EXT2_LM), and also assumes that potential
     * phys_bits adjustments to match the host have been already done in
     * accel-specific code in cpu_exec_realizefn.
     */
    if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
    ...
    } else {
        /* For 32 bit systems don't use the user set value, but keep
         * phys_bits consistent with what we tell the guest.
         */

Ah I see. I missed this. But I still can’t understand why for 32 bit, 
pc_pci_hole64_start() would be the right address for max gpa?

You want "end of device memory region" if there is one, or
"end of RAM" is there is none.

What pc_pci_hole64_start() does:

/*
 * The 64bit pci hole starts after "above 4G RAM" and
 * potentially the space reserved for memory hotplug.
 */

There is the
        ROUND_UP(hole64_start, 1 * GiB);
in there that is not really required for the !hole64 case. It
shouldn't matter much in practice I think (besides an aligned value
showing up in the error message).

We could factor out most of that calculation into a
separate function, skipping that alignment to make that
clearer.

--
Cheers,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]