qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1] vfio/common: Separate vfio-pci ranges


From: Alex Williamson
Subject: Re: [PATCH v1] vfio/common: Separate vfio-pci ranges
Date: Mon, 11 Sep 2023 12:35:24 -0600

On Mon, 11 Sep 2023 11:12:55 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 11/09/2023 10:48, Duan, Zhenzhong wrote:
> >> -----Original Message-----
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >> Sent: Monday, September 11, 2023 5:07 PM
> >> Subject: Re: [PATCH v1] vfio/common: Separate vfio-pci ranges
> >>
> >> On 11/09/2023 09:57, Duan, Zhenzhong wrote:  
> >>>> -----Original Message-----
> >>>> From: qemu-devel-bounces+zhenzhong.duan=intel.com@nongnu.org <qemu-
> >>>> devel-bounces+zhenzhong.duan=intel.com@nongnu.org> On Behalf Of Joao
> >>>> Martins
> >>>> Sent: Friday, September 8, 2023 5:30 PM
> >>>> Subject: [PATCH v1] vfio/common: Separate vfio-pci ranges
> >>>>
> >>>> QEMU computes the DMA logging ranges for two predefined ranges: 32-bit
> >>>> and 64-bit. In the OVMF case, when the dynamic MMIO window is enabled,
> >>>> QEMU includes in the 64-bit range the RAM regions at the lower part
> >>>> and vfio-pci device RAM regions which are at the top of the address
> >>>> space. This range contains a large gap and the size can be bigger than
> >>>> the dirty tracking HW limits of some devices (MLX5 has a 2^42 limit).
> >>>>
> >>>> To avoid such large ranges, introduce a new PCI range covering the
> >>>> vfio-pci device RAM regions, this only if the addresses are above 4GB
> >>>> to avoid breaking potential SeaBIOS guests.
> >>>>
> >>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >>>> [ clg: - wrote commit log
> >>>>       - fixed overlapping 32-bit and PCI ranges when using SeaBIOS ]
> >>>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
> >>>> ---
> >>>> v2:
> >>>> * s/minpci/minpci64/
> >>>> * s/maxpci/maxpci64/
> >>>> * Expand comment to cover the pci-hole64 and why we don't do special
> >>>>  handling of pci-hole32.
> >>>> ---
> >>>> hw/vfio/common.c     | 71 +++++++++++++++++++++++++++++++++++++-------
> >>>> hw/vfio/trace-events |  2 +-
> >>>> 2 files changed, 61 insertions(+), 12 deletions(-)
> >>>>
> >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>> index 237101d03844..134649226d43 100644
> >>>> --- a/hw/vfio/common.c
> >>>> +++ b/hw/vfio/common.c
> >>>> @@ -27,6 +27,7 @@
> >>>>
> >>>> #include "hw/vfio/vfio-common.h"
> >>>> #include "hw/vfio/vfio.h"
> >>>> +#include "hw/vfio/pci.h"
> >>>> #include "exec/address-spaces.h"
> >>>> #include "exec/memory.h"
> >>>> #include "exec/ram_addr.h"
> >>>> @@ -1400,6 +1401,8 @@ typedef struct VFIODirtyRanges {
> >>>>     hwaddr max32;
> >>>>     hwaddr min64;
> >>>>     hwaddr max64;
> >>>> +    hwaddr minpci64;
> >>>> +    hwaddr maxpci64;
> >>>> } VFIODirtyRanges;
> >>>>
> >>>> typedef struct VFIODirtyRangesListener {
> >>>> @@ -1408,6 +1411,31 @@ typedef struct VFIODirtyRangesListener {
> >>>>     MemoryListener listener;
> >>>> } VFIODirtyRangesListener;
> >>>>
> >>>> +static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
> >>>> +                                     VFIOContainer *container)
> >>>> +{
> >>>> +    VFIOPCIDevice *pcidev;
> >>>> +    VFIODevice *vbasedev;
> >>>> +    VFIOGroup *group;
> >>>> +    Object *owner;
> >>>> +
> >>>> +    owner = memory_region_owner(section->mr);
> >>>> +
> >>>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> >>>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> >>>> +            if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
> >>>> +                continue;
> >>>> +            }
> >>>> +            pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >>>> +            if (OBJECT(pcidev) == owner) {
> >>>> +                return true;
> >>>> +            }
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    return false;
> >>>> +}  
> >>>
> >>> What about simplify it with memory_region_is_ram_device()?
> >>> This way vdpa device could also be included.
> >>>  
> >>
> >> Note that the check is not interested in RAM (or any other kinda of memory 
> >> like
> >> VGA). That's covered in the 32-64 ranges. But rather in any PCI device RAM 
> >> that
> >> would fall in the 64-bit PCI hole. Would memory_region_is_ram_device() 
> >> really
> >> cover it? If so, I am all for the simplification.  
> > 
> > Ram device is used not only by vfio pci bars but also host notifier of vdpa 
> > and vhost-user.
> > 
> > I have another question, previously I think vfio pci bars are device states 
> > and
> > save/restored through VFIO migration protocol, so we don't need to dirty
> > tracking them. Do I understand wrong?  
> 
> The general thinking of device dirty tracking is to track all addressable 
> IOVAs.
> But you do raise a good question. My understanding is that migrating the bars
> *as is* might be device migration specific (not a guarantee?); the save file 
> and
> precopy interface are the only places we transfer from/to the data and it's 
> just
> opaque data, not bars or anything formatted specifically -- so if we migrate
> bars it is hidden in what device f/w wants to move. Might be that BARs aren't
> even needed as they are sort of scratch space from h/w side. Ultimately, the
> dirty tracker is the one reporting the values, and the device h/w chooses to 
> not
> report those IOVAs as dirty then nothing changes.

I think ram_device is somewhat controversial, only a few select device
types make use of it, so for example an emulated NIC not making use of
ram_device might still exist within the IOVA aperture of the device.

It does however seem strange to me to include the MMIO space of other
devices in the dirty tracking of the vfio device.  I'm not sure it's
really the vfio device's place to report MMIO within another device as
dirty.  Saving grace being that this likely doesn't really occur
currently, but should the vfio device even exert any resources for
tracking such ranges?

A concern with simply trying to define RAM vs PCI ranges is that QEMU
is pretty fluid about the VM layout.  For example, what prevents us
from having a NUMA configuration where one memory node might be
similarly disjoint as between RAM and PCI memory here?  TBH, I thought
this was why we had the combine_iova_ranges() function in the kernel so
that QEMU simply wrote through the dirty tracking ranges and the driver
combined them as necessary based on their own constraints.  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]