qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, o


From: Peter Xu
Subject: Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update
Date: Thu, 9 Mar 2023 11:55:49 -0500

On Thu, Mar 09, 2023 at 09:16:08AM +0000, David Woodhouse wrote:
> On Wed, 2023-03-08 at 18:09 -0500, Peter Xu wrote:
> > On Mon, Mar 06, 2023 at 05:28:24PM +0000, David Woodhouse wrote:
> > > Indeed, I don't think we care about the in-kernel I/OAPIC. I don't
> > > think we care about the kernel knowing about e.g. "GSI #11" at all. We
> > > can just deliver it as MSI (for the I/OAPIC) or using KVM_INTERRUPT and
> > > the interrupt window as we do for the PIC. Which is why I'd happily rip
> > > that out and let it be delivered via the APIC intercept at 0xfeexxxxx.
> > > 
> > > The existing code which just keeps IRQ routes updated when they're
> > > valid is kind of OK, and well-behaved guests can function. But it isn't
> > > *right* in the case where they aren't valid.
> > > 
> > > What *ought* to happen is that the IOMMU should raise a fault at the
> > > moment the interrupt occurs, if the translation isn't valid. And we
> > > don't have that at all.
> > 
> > Right, that's definitely not ideal as an emulator.
> > 
> > > 
> > > As for why I care? I don't really *need* it, as I have everything I
> > > need for Xen PIRQ support already merged in 
> > > https://gitlab.com/qemu-project/qemu/-/commit/6096cf7877
> > > 
> > > So while the thread at
> > > https://lore.kernel.org/qemu-devel/aaef9961d210ac1873153bf3cf01d984708744fc.camel@infradead.org/
> > > was partly driven by expecting to need this for Xen PIRQ support
> > > (because in $DAYJOB I did those things in the other order and the PIRQ
> > > support ended up just being a trivial different translator like the
> > > IOMMU's IR)... I'd still quite like to fix it up in QEMU anyway, just
> > > for correctness and fidelity in the faulting cases.
> > > 
> > > We can do more efficient invalidation too, rather than blowing away the
> > > entire routing table every time. Just disconnect the IRQFD for the
> > > specific interrupts that get invalidated, and let them get fixed up
> > > again next time they occur.
> > 
> > I'm curious whether there's anything else beside the "correctness of
> > emulation" reason.
> 
> Nah, at this point it's just because it annoys me. I did this in
> another VMM and it works nicely, and I'd like QEMU to do the same. ;)
> 
> > I would think it nice if it existed or trivial to have as what you said.  I
> > just don't know whether it's as easy, at least so far a new kernel
> > interface seems still needed, allowing a kernel irq to be paused until
> > being translated by QEMU from some channel we provide.
> 
> It doesn't need a new kernel interface.
> 
> With IRQ remapping we have a userspace I/OAPIC, so how we deliver its
> MSIs is entirely up to us — we absolutely don't need the kernel to have
> *any* KVM_IRQ_ROUTING_IRQCHIP entries.

Oh so it's about split irqchip only..  Yes indeed it looks like it's
possible to not change the kernel.

> The only IRQs that are handled fully in the kernel are events arriving
> on some eventfd which is bound as an IRQFD to some IRQ in the KVM
> routing table. (Mostly MSIs coming from VFIO).
> 
> If we want to "pause" one of those, all we have to do is unbind the
> IRQFD and then we can handle that eventfd in userspace instead. Which
> is what we do *anyway* in the case where IRQFD support isn't available.
> 
> In
> https://lore.kernel.org/kvm/20201027143944.648769-1-dwmw2@infradead.org/
> I *optimised* that dance, so userspace doesn't have to stop listening
> on the eventfd when the IRQFD is bound; the IRQFD code consumes the
> event first. But we can live without that in older kernels.
> 
> Basically, it's just treating the IRQFD support as an *optimisation*,
> and retaining the 'slow path' of handling the event in userspace and
> injecting the resulting MSI.
> 
> It's just that in that slow path, as we're translating and injecting
> the MSI, we *also* update the IRQ routing table and reattach the IRQFD
> for the next time that interrupt fires. Like a cache.
> 
> And we stash an invalidation cookie (for Intel-IOMMU the IRTE index)
> for each routing table entry, so that when asked to invalidate a
> *range* of cookies (that's how the Intel IOMMU invalidate works), we
> can just detach the IRQFD from those ones, letting them get handled in
> userspace next time and retranslated (with associated fault report, as
> appropriate).

We can maintain a cookie per entry in the routing table in userspace, I
think, and ignore those when applying to KVM (perhaps need to regenerate
another table when applying?  As KVM won't recognize the cookies). Besides
that, do we need other infrastructures to link this entry / GSI back to
which device registers with it?  Since I assume we need to tear down irqfds
if there is, and rehook the event to an userspace handler here too.

There're four devices that can hook onto this, IIUC.  Besides IOAPIC and
VFIO, there's also ivshmem and vhost.  IIUC we'll need to change all the
four devices to implement this.

Besides the changeset (which seems to be still non-trivial to me.. without
yet evaluating whether that'll be worth the effort), one concern I have
right now is whether delaying the 1st irq would regress in any case.

I think you may have better knowledge here than me on how guest behaves in
IRQ subsystem.  For example, is there any case where IRQs can be modified /
invalidated frequently (perhaps mask / unmask?) so there can be a lot of
IRQs delivered slower than before? Because after this change the IRQ setup
/ cache overhead will be applied to the 1st IRQ being triggered rather than
when IRQ was established / unmasked.

This also reminded me (just out of curiosity) on what will happen if
without IR at all: say, what if a device setup a wrong MSI (with a messed
up MSI data register) on bare metal?  Then, does QEMU properly emulate that
too so far?

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]