qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: hang in virtio-failover-test (s390 host)


From: Peter Maydell
Subject: Re: hang in virtio-failover-test (s390 host)
Date: Thu, 24 Mar 2022 13:01:19 +0000

On Thu, 24 Mar 2022 at 11:53, Laurent Vivier <lvivier@redhat.com> wrote:
>
> On 24/03/2022 12:11, Peter Maydell wrote:
> > This is a backtrace from virtio-failover-test, which had hung
> > on the s390 gitlab CI runner. Both processes were using CPU,
> > so this is some kind of livelock, not a deadlock.
> >
> > Looking more closely at the virtio-net-failover process, in the function
> > test_migrate_off_abort() we have executed 'migrate_cancel' and then go
> > into a loop waiting for 'status' to be "cancelled", with aborts if
> > it is either "failed" or "active". But the status the QEMU process
> > returns is "completed", so we loop forever waiting for a status change
> > that will never come (I assume).
> >
>
> It means the migration has been completed before we tried to cancel it.
> The test doesn't fail but is not valid.
>
> Could you try this:
>
> diff --git a/tests/qtest/virtio-net-failover.c 
> b/tests/qtest/virtio-net-failover.c
> index 80292eecf65f..80cda4ca28ce 100644
> --- a/tests/qtest/virtio-net-failover.c
> +++ b/tests/qtest/virtio-net-failover.c
> @@ -1425,6 +1425,11 @@ static void test_migrate_off_abort(gconstpointer 
> opaque)
>           ret = migrate_status(qts);
>
>           status = qdict_get_str(ret, "status");
> +        if (strcmp(status, "completed") == 0) {
> +            g_test_skip("Failed to cancel the migration");
> +            qobject_unref(ret);
> +            goto out;
> +        }
>           if (strcmp(status, "cancelled") == 0) {
>               qobject_unref(ret);
>               break;
> @@ -1437,6 +1442,7 @@ static void test_migrate_off_abort(gconstpointer opaque)
>       check_one_card(qts, true, "standby0", MAC_STANDBY0);
>       check_one_card(qts, true, "primary0", MAC_PRIMARY0);
>
> +out:
>       qos_object_destroy((QOSGraphObject *)vdev);
>       machine_stop(qts);
>   }

Looks plausible, but I can't currently get this hang to reproduce
(it's probably a fairly rare intermittent) so I can't really
test a fix in any meaningful way.

It looks like there are several other loops in other tests in
this file which also need to check for "completed".

I would suggest maybe using check_migration_status() instead
of hand-rolling loops here, except that that function seems
to assert on an unexpected "completed" status whereas we want
the test to skip. It could probably be improved to be usable
here, though.

thanks
-- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]