qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, o


From: David Woodhouse
Subject: Re: [PATCH] hw/intc/ioapic: Update KVM routes before redelivering IRQ, on RTE update
Date: Thu, 09 Mar 2023 19:56:08 +0000
User-agent: Evolution 3.44.4-0ubuntu1

On Thu, 2023-03-09 at 11:55 -0500, Peter Xu wrote:
> On Thu, Mar 09, 2023 at 09:16:08AM +0000, David Woodhouse wrote:
> > The only IRQs that are handled fully in the kernel are events arriving
> > on some eventfd which is bound as an IRQFD to some IRQ in the KVM
> > routing table. (Mostly MSIs coming from VFIO).
> > 
> > If we want to "pause" one of those, all we have to do is unbind the
> > IRQFD and then we can handle that eventfd in userspace instead. Which
> > is what we do *anyway* in the case where IRQFD support isn't available.
> > 
> > In
> > https://lore.kernel.org/kvm/20201027143944.648769-1-dwmw2@infradead.org/
> > I *optimised* that dance, so userspace doesn't have to stop listening
> > on the eventfd when the IRQFD is bound; the IRQFD code consumes the
> > event first. But we can live without that in older kernels.
> > 
> > Basically, it's just treating the IRQFD support as an *optimisation*,
> > and retaining the 'slow path' of handling the event in userspace and
> > injecting the resulting MSI.
> > 
> > It's just that in that slow path, as we're translating and injecting
> > the MSI, we *also* update the IRQ routing table and reattach the IRQFD
> > for the next time that interrupt fires. Like a cache.
> > 
> > And we stash an invalidation cookie (for Intel-IOMMU the IRTE index)
> > for each routing table entry, so that when asked to invalidate a
> > *range* of cookies (that's how the Intel IOMMU invalidate works), we
> > can just detach the IRQFD from those ones, letting them get handled in
> > userspace next time and retranslated (with associated fault report, as
> > appropriate).
> 
> We can maintain a cookie per entry in the routing table in userspace, I
> think, and ignore those when applying to KVM (perhaps need to regenerate
> another table when applying?  As KVM won't recognize the cookies). Besides
> that, do we need other infrastructures to link this entry / GSI back to
> which device registers with it?  Since I assume we need to tear down irqfds
> if there is, and rehook the event to an userspace handler here too.
> 

KVM doesn't need to recognise the cookies. Only the userspace code
which handles the core IRQ routing table and the IRQFD assignment.

I *have* just abused some S390-based fields for the cookie and a
'stale' bit...

/* 64-bit cookie for IOMMU to use for invalidation choices */
#define ire_ir_cookie(ire) ((ire)->u.adapter.ind_offset)

/* Flags, to indicate a stale entry that needs retranslating */
#define ire_user_flags(ire) ((ire)->u.adapter.summary_offset)
#define IRE_USER_FLAG_STALE             1

... but it could be done in a separate data structure just as well.

Given a range of cookies to invalidate, the core code just detaches the
IRQFD for any entry in the KVM IRQ routing table that has a cookie
within that range.

> There're four devices that can hook onto this, IIUC.  Besides IOAPIC and
> VFIO, there's also ivshmem and vhost.  IIUC we'll need to change all the
> four devices to implement this.
> 
> Besides the changeset (which seems to be still non-trivial to me.. without
> yet evaluating whether that'll be worth the effort), one concern I have
> right now is whether delaying the 1st irq would regress in any case.

It's fine. In QEMU you don't *have* to delay the first IRQ; you *can*
prepopulate the cache at the moment the guest programs a device's MSI
table, for example.

In a certain other implementation, we don't prepopulate so that first
IRQ does get handled in userspace every time, because we want to keep
track of whether a given MSI has *ever* fired or not. And there's been
absolutely no issue with that latency.

> I think you may have better knowledge here than me on how guest behaves in
> IRQ subsystem.  For example, is there any case where IRQs can be modified /
> invalidated frequently (perhaps mask / unmask?)

Mask/unmask shouldn't invalidate the cache. 

>  so there can be a lot of IRQs delivered slower than before? Because
> after this change the IRQ setup / cache overhead will be applied to
> the 1st IRQ being triggered rather than when IRQ was established /
> unmasked.

We've launched... a lot... of guests with this model and not seen any
issues :)

I'll knock up a prototype in QEMU and we can reason about it far more
coherently. I think it ends up actually being a simplification and
leading to easier-to-understand code.

> This also reminded me (just out of curiosity) on what will happen if
> without IR at all: say, what if a device setup a wrong MSI (with a messed
> up MSI data register) on bare metal?  Then, does QEMU properly emulate that
> too so far?

For the most past, MSI isn't special; it's *just* a memory write
transaction. Traditionally, we ask the device to write to some address
in the 0xFEExxxxx range, which happens not to be actual memory, but is
the APIC.

But if you want to just pin a data structure in memory, and give the
device the address of some 32-bit field in that data structure, then
*poll* for completion to see when the device wrote there... that's just
fine.

That would *generally* work in QEMU too, since we mostly raise MSI from
devices by doing that actual stl_le_phys() call.

The caveat is the "special" KVM way of encoding APIC IDs > 255, by
putting the high bits into the high bits of the 64-bit MSI target
address. That means that if handled as an actual memory transaction, it
would *miss* the APIC at 0x00000000FEExxxxx and really go to memory.

Which is why in some (post-translation) cases we have to treat it
"specially" as an MSI instead of just a memory write. Which I think is
actually the reason for that explicit kvm_set_irq() in ioapic_service()
which I mentioned at the start of this thread.

You'll note that when I added basically the same special case to
pci_msi_trigger() in commit 6096cf7877 I felt it warranted at least 5
lines of comment to explain itself :)





Attachment: smime.p7s
Description: S/MIME cryptographic signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]