qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] monitor: Fix order in monitor_cleanup()


From: Markus Armbruster
Subject: Re: [PATCH] monitor: Fix order in monitor_cleanup()
Date: Mon, 15 Feb 2021 13:17:05 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Kevin Wolf <kwolf@redhat.com> writes:

> Am 29.01.2021 um 13:53 hat Markus Armbruster geschrieben:
>> I ran into odd behavior the other day, and bisected it to this commit.
>> 
>>     $ qemu-system-x86_64 -display none -chardev 
>> socket,id=qmp,path=test-qmp,server=on,wait=off -mon mode=control,chardev=qmp
>> 
>> In another terminal, create a bunch of FIFOs, then use them to have some
>> in-band commands block, with out-of-band commands interleaved just
>> because:
>> 
>>     $ for ((i=0; i<20; i++)); do mkfifo fifo$i; done
>>     $ cat oob-test2
>>     {"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}}
>>     {"exec-oob": "migrate-pause", "id": 0}
>>     {"execute": "memsave", "id": 1, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo1"}}
>>     {"exec-oob": "migrate-pause", "id": 2}
>>     {"execute": "memsave", "id": 3, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo3"}}
>>     {"exec-oob": "migrate-pause", "id": 4}
>>     {"execute": "memsave", "id": 5, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo5"}}
>>     {"exec-oob": "migrate-pause", "id": 6}
>>     {"execute": "memsave", "id": 7, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo7"}}
>>     {"exec-oob": "migrate-pause", "id": 8}
>>     {"execute": "memsave", "id": 9, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo9"}}
>>     {"exec-oob": "migrate-pause", "id": 10}
>>     {"execute": "memsave", "id": 11, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo11"}}
>>     {"exec-oob": "migrate-pause", "id": 12}
>>     {"execute": "memsave", "id": 13, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo13"}}
>>     {"exec-oob": "migrate-pause", "id": 14}
>>     {"execute": "memsave", "id": 15, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo15"}}
>>     {"exec-oob": "migrate-pause", "id": 16}
>>     {"execute": "memsave", "id": 17, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo17"}}
>>     {"exec-oob": "migrate-pause", "id": 18}
>>     {"execute": "memsave", "id": 19, "arguments": {"val": 0, "size": 4096, 
>> "filename": "fifo19"}}
>>     {"exec-oob": "migrate-pause", "id": 20}
>>     $ socat -t99999 STDIO UNIX-CONNECT:$HOME/work/images/test-qmp <oob-test2
>>     {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 5}, 
>> "package": "v5.1.0-2224-g8db1efd3f3"}, "capabilities": ["oob"]}}
>>     {"return": {}}
>>     {"id": 0, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 2, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 4, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 6, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 8, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 10, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 12, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 14, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>>     {"id": 16, "error": {"class": "GenericError", "desc": "migrate-pause is 
>> currently only supported during postcopy-active state"}}
>> 
>> Looking good: the out-of-band commands jump the queue until the queue is
>> too full for jumping.
>> 
>> Now go back to the first terminal, and hit C-c.
>> 
>> Before this commit, the second terminal shows the shutdown event
>> 
>>     {"timestamp": {"seconds": 1611923623, "microseconds": 528169}, "event": 
>> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
>> 
>> and the first terminal shows
>> 
>>     ^Cqemu-system-x86_64: terminating on signal 2
>> 
>> QEMU terminates with exit status 0.  Good (except for the exit status,
>> but let's ignore that).
>> 
>> After the commit, the second terminal additionally shows the error reply
>> for (in-band) command 1
>> 
>>     {"id": 1, "error": {"class": "GenericError", "desc": "Could not open 
>> 'fifo1': Interrupted system call"}}
>>     {"timestamp": {"seconds": 1611923812, "microseconds": 520891}, "event": 
>> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
>> 
>> and the first terminal still shows
>> 
>>     ^Cqemu-system-x86_64: terminating on signal 2
>> 
>> However, QEMU does *not* terminate.  When I hit C-c again, the second
>> terminal gives me the next in-band reply
>> 
>>     {"id": 3, "error": {"class": "GenericError", "desc": "Could not open 
>> 'fifo3': Interrupted system call"}}
>> 
>> and the first one another
>> 
>>     ^C
>> 
>> Hitting C-c some more gives me more in-band replies and more ^C.
>
> The problem is that monitor_qmp_dispatcher_co() doesn't check whether it
> should shut down unless it would have to wait for a new request.
>
> So 'memsave' tries to open the FIFO, this blocks and ^C results in EINTR
> for the open(), which makes the 'memsave' command fail.

Perhaps the command should retry after EINTR.  Out of scope here.

>                                                         Then
> monitor_qmp_dispatcher_co() tries to execute the rest of the queued
> commands, i.e. the next 'memsave' that will hang in the same way.
>
> Fixing this is easy enough:
>
> diff --git a/monitor/qmp.c b/monitor/qmp.c
> index 43880fa623..46939537b4 100644
> --- a/monitor/qmp.c
> +++ b/monitor/qmp.c
> @@ -227,6 +227,10 @@ void coroutine_fn monitor_qmp_dispatcher_co(void *data)
>           */
>          qatomic_mb_set(&qmp_dispatcher_co_busy, false);
>
> +        if (qmp_dispatcher_co_shutdown) {
> +            return;
> +        }
> +
>          while (!(req_obj = monitor_qmp_requests_pop_any_with_lock())) {
>              /*
>               * No more requests to process.  Wait to be reentered from
>
>> The ninth C-c gives me the error reply for (in-band) command 17, and a crash:
>> 
>> Terminal 1 now shows
>> 
>>     ^Cqemu-system-x86_64-qemu: terminating on signal 2
>>     ^C^C^C^C^C^C^C^Cqemu-system-x86_64-qemu: ../util/async.c:343: 
>> aio_ctx_finalize: Assertion `flags & BH_DELETED' failed.
>>     Aborted (core dumped)
>
> So all of this happens inside of monitor_cleanup(), while waiting for
> monitor_qmp_dispatcher_co() to shut down:
>
>     AIO_WAIT_WHILE(qemu_get_aio_context(),
>                    (aio_poll(iohandler_get_aio_context(), false),
>                     qatomic_mb_read(&qmp_dispatcher_co_busy)));
>
> Importantly, this is _after_ calling iothread_stop(), which made sure
> that all pending BHs in the monitor iothread are executed.
>
> What now happens is that monitor_qmp_dispatcher_co() wants to resume
> the monitor. This schedules a new BH on the iothread, which was already
> supposed to be inactive.
>
> When finally all requests are handled and monitor_cleanup() continues
> after the polling loop and calls iothread_destroy(), we notice that
> there is a pending BH where there shouldn't be any and abort.
>
> I think this means that the commit should have moved even the
> iothread_stop() call to below the polling loop. I can't reproduce the
> problem any more with the fix above, but I think the current order in
> monitor_cleanup() is still a (possibly latent) bug.
>
> So why did all of that work before 357bda95?
>
> If the old code didn't crash as described in the commit message, it
> would just free all kinds of monitor resources while the coroutine was
> still running. This includes removing all pending requests from the
> queue. I guess this is what accidentally made it "work" previously.

Awesome.

Thanks for writing up your analysis!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]