qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] monitor: Fix order in monitor_cleanup()


From: Kevin Wolf
Subject: Re: [PATCH] monitor: Fix order in monitor_cleanup()
Date: Fri, 12 Feb 2021 15:22:40 +0100

Am 29.01.2021 um 13:53 hat Markus Armbruster geschrieben:
> I ran into odd behavior the other day, and bisected it to this commit.
> 
>     $ qemu-system-x86_64 -display none -chardev 
> socket,id=qmp,path=test-qmp,server=on,wait=off -mon mode=control,chardev=qmp
> 
> In another terminal, create a bunch of FIFOs, then use them to have some
> in-band commands block, with out-of-band commands interleaved just
> because:
> 
>     $ for ((i=0; i<20; i++)); do mkfifo fifo$i; done
>     $ cat oob-test2
>     {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}}
>     {"exec-oob": "migrate-pause", "id": 0}
>     {"execute": "memsave", "id": 1, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo1"}}
>     {"exec-oob": "migrate-pause", "id": 2}
>     {"execute": "memsave", "id": 3, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo3"}}
>     {"exec-oob": "migrate-pause", "id": 4}
>     {"execute": "memsave", "id": 5, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo5"}}
>     {"exec-oob": "migrate-pause", "id": 6}
>     {"execute": "memsave", "id": 7, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo7"}}
>     {"exec-oob": "migrate-pause", "id": 8}
>     {"execute": "memsave", "id": 9, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo9"}}
>     {"exec-oob": "migrate-pause", "id": 10}
>     {"execute": "memsave", "id": 11, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo11"}}
>     {"exec-oob": "migrate-pause", "id": 12}
>     {"execute": "memsave", "id": 13, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo13"}}
>     {"exec-oob": "migrate-pause", "id": 14}
>     {"execute": "memsave", "id": 15, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo15"}}
>     {"exec-oob": "migrate-pause", "id": 16}
>     {"execute": "memsave", "id": 17, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo17"}}
>     {"exec-oob": "migrate-pause", "id": 18}
>     {"execute": "memsave", "id": 19, "arguments": {"val": 0, "size": 4096, 
> "filename": "fifo19"}}
>     {"exec-oob": "migrate-pause", "id": 20}
>     $ socat -t99999 STDIO UNIX-CONNECT:$HOME/work/images/test-qmp <oob-test2
>     {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 5}, 
> "package": "v5.1.0-2224-g8db1efd3f3"}, "capabilities": ["oob"]}}
>     {"return": {}}
>     {"id": 0, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 2, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 4, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 6, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 8, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 10, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 12, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 14, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
>     {"id": 16, "error": {"class": "GenericError", "desc": "migrate-pause is 
> currently only supported during postcopy-active state"}}
> 
> Looking good: the out-of-band commands jump the queue until the queue is
> too full for jumping.
> 
> Now go back to the first terminal, and hit C-c.
> 
> Before this commit, the second terminal shows the shutdown event
> 
>     {"timestamp": {"seconds": 1611923623, "microseconds": 528169}, "event": 
> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
> 
> and the first terminal shows
> 
>     ^Cqemu-system-x86_64: terminating on signal 2
> 
> QEMU terminates with exit status 0.  Good (except for the exit status,
> but let's ignore that).
> 
> After the commit, the second terminal additionally shows the error reply
> for (in-band) command 1
> 
>     {"id": 1, "error": {"class": "GenericError", "desc": "Could not open 
> 'fifo1': Interrupted system call"}}
>     {"timestamp": {"seconds": 1611923812, "microseconds": 520891}, "event": 
> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
> 
> and the first terminal still shows
> 
>     ^Cqemu-system-x86_64: terminating on signal 2
> 
> However, QEMU does *not* terminate.  When I hit C-c again, the second
> terminal gives me the next in-band reply
> 
>     {"id": 3, "error": {"class": "GenericError", "desc": "Could not open 
> 'fifo3': Interrupted system call"}}
> 
> and the first one another
> 
>     ^C
> 
> Hitting C-c some more gives me more in-band replies and more ^C.

The problem is that monitor_qmp_dispatcher_co() doesn't check whether it
should shut down unless it would have to wait for a new request.

So 'memsave' tries to open the FIFO, this blocks and ^C results in EINTR
for the open(), which makes the 'memsave' command fail. Then
monitor_qmp_dispatcher_co() tries to execute the rest of the queued
commands, i.e. the next 'memsave' that will hang in the same way.

Fixing this is easy enough:

diff --git a/monitor/qmp.c b/monitor/qmp.c
index 43880fa623..46939537b4 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -227,6 +227,10 @@ void coroutine_fn monitor_qmp_dispatcher_co(void *data)
          */
         qatomic_mb_set(&qmp_dispatcher_co_busy, false);

+        if (qmp_dispatcher_co_shutdown) {
+            return;
+        }
+
         while (!(req_obj = monitor_qmp_requests_pop_any_with_lock())) {
             /*
              * No more requests to process.  Wait to be reentered from

> The ninth C-c gives me the error reply for (in-band) command 17, and a crash:
> 
> Terminal 1 now shows
> 
>     ^Cqemu-system-x86_64-qemu: terminating on signal 2
>     ^C^C^C^C^C^C^C^Cqemu-system-x86_64-qemu: ../util/async.c:343: 
> aio_ctx_finalize: Assertion `flags & BH_DELETED' failed.
>     Aborted (core dumped)

So all of this happens inside of monitor_cleanup(), while waiting for
monitor_qmp_dispatcher_co() to shut down:

    AIO_WAIT_WHILE(qemu_get_aio_context(),
                   (aio_poll(iohandler_get_aio_context(), false),
                    qatomic_mb_read(&qmp_dispatcher_co_busy)));

Importantly, this is _after_ calling iothread_stop(), which made sure
that all pending BHs in the monitor iothread are executed.

What now happens is that monitor_qmp_dispatcher_co() wants to resume
the monitor. This schedules a new BH on the iothread, which was already
supposed to be inactive.

When finally all requests are handled and monitor_cleanup() continues
after the polling loop and calls iothread_destroy(), we notice that
there is a pending BH where there shouldn't be any and abort.

I think this means that the commit should have moved even the
iothread_stop() call to below the polling loop. I can't reproduce the
problem any more with the fix above, but I think the current order in
monitor_cleanup() is still a (possibly latent) bug.

So why did all of that work before 357bda95?

If the old code didn't crash as described in the commit message, it
would just free all kinds of monitor resources while the coroutine was
still running. This includes removing all pending requests from the
queue. I guess this is what accidentally made it "work" previously.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]