qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to hos


From: Duan, Zhenzhong
Subject: RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
Date: Mon, 16 Jun 2025 03:24:06 +0000


>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>On 2025/5/28 15:12, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table
>to
>>> host
>>>
>>> OK. Let me clarify this at the top as I see the gap here now:
>>>
>>> First, the vSMMU model is based on Zhenzhong's older series that
>>> keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
>>> now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
>>> is allocated when a passthrough cdev attaches to a VFIO container.
>>>
>>> Second, the vSMMU model reuses the default IOAS via that ioas_id.
>>> Since the VFIO container doesn't allocate a nesting parent S2 HWPT
>>> (maybe it could?), so the vSMMU allocates another S2 HWPT in the
>>> vIOMMU code.
>>>
>>> Third, the vSMMU model, for invalidation efficiency and HW Queue
>>> support, isolates all emulated devices out of the nesting-enabled
>>> vSMMU instance, suggested by Jason. So, only passthrough devices
>>> would use the nesting-enabled vSMMU instance, meaning there is no
>>> need of IOMMU_NOTIFIER_IOTLB_EVENTS:
>>
>> I see, then you need to check if there is emulated device under 
>> nesting-enabled
>vSMMU and fail if there is.
>>
>>> - MAP is not needed as there is no shadow page table. QEMU only
>>>    traps the page table pointer and forwards it to host kernel.
>>> - UNMAP is not needed as QEMU only traps invalidation requests
>>>    and forwards them to host kernel.
>>>
>>> (let's forget about the "address space switch" for MSI for now.)
>>>
>>> So, in the vSMMU model, there is actually no need for the iommu
>>> AS. And there is only one IOAS in the VM instance allocated by the
>>> VFIO container. And this IOAS manages the GPA->PA mappings. So,
>>> get_address_space() returns the system AS for passthrough devices.
>>>
>>> On the other hand, the VT-d model is a bit different. It's a giant
>>> vIOMMU for all devices (either passthrough or emualted). For all
>>> emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
>>> iommu address space returned via get_address_space().
>>>
>>> That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>>> for passthrough devices, right?
>>
>> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual 
>> vtd
>> supports stage-1 translation, guest still can choose to run in legacy
>mode(stage2),
>> e.g., with kernel cmdline intel_iommu=on,sm_off
>>
>> So before guest run, we don't know which kind of page table either stage1 or
>stage2
>> for this VFIO device by guest. So we have to use iommu AS to catch stage2's
>MAP event
>> if guest choose stage2.
>
>@Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
>the MRs of the device's AS, hence the IOAS created by VFIO container would
>be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
>switched to IOMMU MR. So it should be able to support shadowing the guest
>IO page table. Hence, this should not be a problem.
>
>@Nicolin, I think your major point is making the VFIO container IOAS as a
>GPA IOAS (always return system AS in get_address_space op) and reusing it
>when setting nested translation. Is it? I think it should work if:
>1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
>    request. But I don't want the get_address_space op always return system
>    AS as the reason mentioned by Zhenzhong above.
>2) we can disallow emulated/passthru devices behind the same pcie-pci
>    bridge[1]. For emulated devices, AS should switch to iommu MR, while for
>    passthru devices, it needs the AS stick with the system MR hence be able
>    to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
>    switch to iommu MR and have a separate GPA IOAS is needed. This separate
>    GPA IOAS can be shared by all the passthru devices.
>
>[1]
>https://lore.kernel.org/all/SJ0PR11MB6744E2BA00BBE677B2B49BE99265A@SJ0
>PR11MB6744.namprd11.prod.outlook.com/#t
>
>So basically, we are ok with your idea. But we should decide if it is
>necessary to support the topology in 2). I think this is a general
>question. TBH. I don't have much information to judge if it is valuable.
>Perhaps, let's hear from more people.

Hi @Liu, Yi L @Nicolin Chen, for emulated/passthru devices behind the same 
pcie-pci bridge, I think of an idea, adding a new PCI callback:

AddressSpace * (*get_address_space_extend)(PCIBus *bus, void *opaque, int 
devfn, bool accel_dev);

which pass in real bus/devfn and a new param accel_dev which is true for vfio 
device.
Vtd implements this callback and return separate AS for vfio device if it's 
under an pcie-pci bridge and flts=on;
otherwise it fallback to call .get_address_space(). This way emulated devices 
and passthru devices behind the same pcie-pci bridge can have different AS.

If above idea is acceptable, then only obstacle is ERRATA_772415, maybe we can 
let VFIO check this errata and bypass RO mapping from beginning?
Or we just block this VFIO device running with flts=on if ERRATA_772415 and 
suggesting running with flts=off?

Thanks
Zhenzhong



reply via email to

[Prev in Thread] Current Thread [Next in Thread]