qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] pci: Abort if pci_add_capability fails


From: Markus Armbruster
Subject: Re: [PATCH] pci: Abort if pci_add_capability fails
Date: Wed, 31 Aug 2022 10:18:22 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Alex Williamson <alex.williamson@redhat.com> writes:

> On Tue, 30 Aug 2022 13:37:35 +0200
> Markus Armbruster <armbru@redhat.com> wrote:
>>        if (!offset) {
>>            offset = pci_find_space(pdev, size);
>>            /* out of PCI config space is programming error */
>>            assert(offset);
>>        } else {
>>            /* Verify that capabilities don't overlap.  Note: device 
>> assignment
>>             * depends on this check to verify that the device is not broken.
>>             * Should never trigger for emulated devices, but it's helpful
>>             * for debugging these. */
>> 
>> The comment makes me suspect that device assignment of a broken device
>> could trigger the error.  It goes back to
>> 
>> commit c9abe111209abca1b910e35c6ca9888aced5f183
>> Author: Jan Kiszka <jan.kiszka@siemens.com>
>> Date:   Wed Aug 24 14:29:30 2011 +0200
>> 
>>     pci: Error on PCI capability collisions
>>     
>>     Nothing good can happen when we overlap capabilities. This may happen
>>     when plugging in assigned devices or when devices models contain bugs.
>>     Detect the overlap and report it.
>>     
>>     Based on qemu-kvm commit by Alex Williamson.
>>     
>>     Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>     Acked-by: Don Dutile <ddutile@redhat.com>
>>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> 
>> If this is still correct, then your patch is a regression: QEMU is no
>> longer able to gracefully handle assignment of a broken device.  Does
>> this matter?  Alex, maybe?
>
> Ok, that was a long time ago.  I have some vague memories of hitting

Indeed!

> something like this with a Broadcom NIC, but a google search for the
> error string doesn't turn up anything recently.  So there's a fair
> chance this wouldn't break anyone initially.

I like your careful phrasing.

> Even back when the above patch was proposed, there were some
> suggestions to turn the error path into an abort, which I pushed back
> on since clearly enumerating capabilities of a device can occur due to
> a hot-plug and we don't necessarily have control of the device being
> added.  This is only more true with the possibility of soft-devices out
> of tree, through things like vfio-user.

Valid point.

When an compiled-in device model asks for overlapping PCI capabilities,
it's a programming error.

When an assigned device (be it physical or virtual) has overlapping PCI
capabilities, it's bad input.

Abort on programming error is okay.  Abort on bad input isn't.

I think callers of the former sort should pass &error_abort to
pci_add_capability(), and callers of the latter sort should handle the
error.

> Personally I think the right approach is to support an error path such
> that we can abort when triggered by a cold-plug device, while simply
> rejecting a broken hot-plug device, but that seems to be the minority
> opinion among QEMU developers afaict.  Thanks,

I'm in the "aborting on programming errors is okay" camp.

Recovery from certain programming errors is possible.  However,
different programming errors can manifest themselves the same way, and
not all need permit safe recovery.

Say we detect overlapping PCI capabilities when hot-plugging a virtual
device model.

If this is because the programmer passed an incorrect literal offset, we
can recover safely by failing the hot plug.

But perhaps the offset comes from a table, and some other bug scribbled
over it.  Continuing the process is *unsafe*, and may increase the
damage and/or obstruct the root cause.

The former kind of bug is unlikely to survive even cursory testing.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]