qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] pci: Abort if pci_add_capability fails


From: Akihiko Odaki
Subject: Re: [PATCH] pci: Abort if pci_add_capability fails
Date: Wed, 31 Aug 2022 21:32:50 +0900

On Wed, Aug 31, 2022 at 5:18 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Alex Williamson <alex.williamson@redhat.com> writes:
>
> > On Tue, 30 Aug 2022 13:37:35 +0200
> > Markus Armbruster <armbru@redhat.com> wrote:
> >>        if (!offset) {
> >>            offset = pci_find_space(pdev, size);
> >>            /* out of PCI config space is programming error */
> >>            assert(offset);
> >>        } else {
> >>            /* Verify that capabilities don't overlap.  Note: device 
> >> assignment
> >>             * depends on this check to verify that the device is not 
> >> broken.
> >>             * Should never trigger for emulated devices, but it's helpful
> >>             * for debugging these. */
> >>
> >> The comment makes me suspect that device assignment of a broken device
> >> could trigger the error.  It goes back to
> >>
> >> commit c9abe111209abca1b910e35c6ca9888aced5f183
> >> Author: Jan Kiszka <jan.kiszka@siemens.com>
> >> Date:   Wed Aug 24 14:29:30 2011 +0200
> >>
> >>     pci: Error on PCI capability collisions
> >>
> >>     Nothing good can happen when we overlap capabilities. This may happen
> >>     when plugging in assigned devices or when devices models contain bugs.
> >>     Detect the overlap and report it.
> >>
> >>     Based on qemu-kvm commit by Alex Williamson.
> >>
> >>     Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>     Acked-by: Don Dutile <ddutile@redhat.com>
> >>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>
> >> If this is still correct, then your patch is a regression: QEMU is no
> >> longer able to gracefully handle assignment of a broken device.  Does
> >> this matter?  Alex, maybe?
> >
> > Ok, that was a long time ago.  I have some vague memories of hitting
>
> Indeed!
>
> > something like this with a Broadcom NIC, but a google search for the
> > error string doesn't turn up anything recently.  So there's a fair
> > chance this wouldn't break anyone initially.
>
> I like your careful phrasing.
>
> > Even back when the above patch was proposed, there were some
> > suggestions to turn the error path into an abort, which I pushed back
> > on since clearly enumerating capabilities of a device can occur due to
> > a hot-plug and we don't necessarily have control of the device being
> > added.  This is only more true with the possibility of soft-devices out
> > of tree, through things like vfio-user.
>
> Valid point.
>
> When an compiled-in device model asks for overlapping PCI capabilities,
> it's a programming error.
>
> When an assigned device (be it physical or virtual) has overlapping PCI
> capabilities, it's bad input.
>
> Abort on programming error is okay.  Abort on bad input isn't.
>
> I think callers of the former sort should pass &error_abort to
> pci_add_capability(), and callers of the latter sort should handle the
> error.

Looking at vfio's implementation, it seems it makes its own effort to
make capabilities not overlap. Taking account of that, I have sent a
new version titled "[PATCH v2] pci: Assert that capabilities never
overlap", which has an updated message and removes the comment in
pci_add_capability().

>
> > Personally I think the right approach is to support an error path such
> > that we can abort when triggered by a cold-plug device, while simply
> > rejecting a broken hot-plug device, but that seems to be the minority
> > opinion among QEMU developers afaict.  Thanks,
>
> I'm in the "aborting on programming errors is okay" camp.
>
> Recovery from certain programming errors is possible.  However,
> different programming errors can manifest themselves the same way, and
> not all need permit safe recovery.
>
> Say we detect overlapping PCI capabilities when hot-plugging a virtual
> device model.
>
> If this is because the programmer passed an incorrect literal offset, we
> can recover safely by failing the hot plug.
>
> But perhaps the offset comes from a table, and some other bug scribbled
> over it.  Continuing the process is *unsafe*, and may increase the
> damage and/or obstruct the root cause.
>
> The former kind of bug is unlikely to survive even cursory testing.
>

Whatever kind of programming error it is, it is dangerous to attempt
to recover from it using an error handling code path untested for a
long period. I think it is better to have some common code to recover
from assertion failures. For example, we may have some option to print
"oops" just like Linux does when an assertion fails and let the user
handle it. That is quite out of scope of this change though.

Regards,
Akihiko Odaki



reply via email to

[Prev in Thread] Current Thread [Next in Thread]