qemu-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer


From: Peter Maydell
Subject: Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer
Date: Tue, 9 Aug 2022 17:15:35 +0100

On Tue, 9 Aug 2022 at 16:37, Denis Krienbühl <denis@href.ch> wrote:
> We are running a Ceph 15.2.16 cluster at work, connect VMs to it using 
> librbd1 14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive 
> (4.2-3ubuntu6.23~cloud0).
>
> A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This 
> seems to only happen when lots of volume detach requests are executed in 
> parallel. We can somewhat reliably reproduce this in our staging environment 
> (it still takes thousands of detaches and sometimes hours of running tests 
> until it occurs once). In production it is a lot rarer, but it does occur 
> with some regularity.
>
> Using crash dumps and debug symbols I was able to pin-point the cause of the 
> segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, 
> QEMU tries to access two SCSIDiskState variables that are both `0x0`, when 
> QEMU expects them to be characters.
>
> 1. `s->version`
> 2. `s->vendor`
>
> I found out that in other parts of the code (`scsi_block_realize`) those 
> variables are not expected to always be valid characters, so I copied that 
> over and created my own release of the distro package with this patch (here 
> diffed against master):
>
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index efee6739f9..419d6e28f1 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
> uint8_t *outbuf)
>          return -1;
>      }
>
> +    /* Avoid null-pointers leading to segfaults below */
> +    if (!s->version) {
> +        s->version = g_strdup(qemu_hw_version());
> +    }
> +    if (!s->vendor) {
> +        s->vendor = g_strdup("QEMU");
> +    }
> +
>      /* PAGE CODE == 0 */
>      buflen = req->cmd.xfer;
>      if (buflen > SCSI_MAX_INQUIRY_LEN) {

If these are NULL, then something has gone weirdly wrong (possibly
memory corruption or other Bad Things).

At least in the version of this code in upstream git, the
'realize' method does:

    if (!s->version) {
        s->version = g_strdup(qemu_hw_version());
    }
    if (!s->vendor) {
        s->vendor = g_strdup("QEMU");
    }

which is to say: when the device is created, if the user didn't
specify version/vendor strings via QOM properties, we set them to
these defaults. So at all points in execution after this, these
pointers can't be NULL.

> 1. I don’t have a way to reproduce this in the master.
> 2. I don’t know if this is safe.
> 3. I don’t know if this just masks the issue and the problem lies somewhere 
> deeper.

Number 3.

My wild guess is that there's a race condition somewhere such
that when you're doing this huge amount of detaches, very rarely
a disk is detached and deleted but this INQUIRY request is
incorrectly still sent to the disk (which being a freed object,
might be overwritten with other stuff). But that is purely a guess.

> So.. should this be something I create a bug report for?

If you can repro this on current head-of-git, or at least on
the most recent release, then yes, file a bug report.

thanks
-- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]