qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 1/1] osdep: asynchronous teardown for shutdown on Linux


From: Daniel P . Berrangé
Subject: Re: [PATCH v2 1/1] osdep: asynchronous teardown for shutdown on Linux
Date: Thu, 4 Aug 2022 17:56:34 +0100
User-agent: Mutt/2.2.6 (2022-06-05)

On Wed, Aug 03, 2022 at 07:31:41PM +0200, Claudio Imbrenda wrote:
> This patch adds support for asynchronously tearing down a VM on Linux.
> 
> When qemu terminates, either naturally or because of a fatal signal,
> the VM is torn down. If the VM is huge, it can take a considerable
> amount of time for it to be cleaned up. In case of a protected VM, it
> might take even longer than a non-protected VM (this is the case on
> s390x, for example).
> 
> Some users might want to shut down a VM and restart it immediately,
> without having to wait. This is especially true if management
> infrastructure like libvirt is used.
> 
> This patch implements a simple trick on Linux to allow qemu to return
> immediately, with the teardown of the VM being performed
> asynchronously.
> 
> If the new commandline option -async-teardown is used, a new process is
> spawned from qemu at startup, using the clone syscall, in such way that
> it will share its address space with qemu.
> 
> The new process will then simpy wait until qemu terminates, and then it
> will exit itself.
> 
> This allows qemu to terminate quickly, without having to wait for the
> whole address space to be torn down. The teardown process will exit
> after qemu, so it will be the last user of the address space, and
> therefore it will take care of the actual teardown.
> 
> The teardown process will share the same cgroups as qemu, so both
> memory usage and cpu time will be accounted properly.
> 
> This feature can already be used with libvirt by adding the following
> to the XML domain definition:
> 
>   <commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0";>
>   <arg value='-async-teardown'/>
>   </commandline>
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  include/qemu/osdep.h |  2 ++
>  os-posix.c           |  5 ++++
>  qemu-options.hx      | 17 ++++++++++++++
>  util/osdep.c         | 55 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 79 insertions(+)


> diff --git a/util/osdep.c b/util/osdep.c
> index 60fcbbaebe..bb0baf97a0 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -23,6 +23,15 @@
>   */
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
> +
> +#ifdef CONFIG_LINUX
> +#include <sys/types.h>
> +#include <sys/select.h>
> +#include <sys/unistd.h>
> +#include <sys/syscall.h>
> +#include <signal.h>
> +#endif
> +
>  #include "qemu/cutils.h"
>  #include "qemu/sockets.h"
>  #include "qemu/error-report.h"
> @@ -512,6 +521,52 @@ const char *qemu_hw_version(void)
>      return hw_version;
>  }
>  
> +#ifdef __linux__
> +static int async_teardown_fn(void *arg)
> +{
> +    sigset_t all_signals;
> +    fd_set r, w, e;
> +    int fd;
> +
> +    /* open a pidfd descriptor for the parent qemu process */
> +    fd = syscall(__NR_pidfd_open, getppid(), 0);

We ought to open this FD in the parent process to avoid a race
where the parent crashes immediately after clone() and gets
reparented to 'init' before this child process calls pidfd_open,
otherwise it'll sit around waiting for init to exit.

> +    /* if something went wrong, or if the file descriptor is too big */
> +    if ((fd < 0) || (fd >= FD_SETSIZE)) {
> +        _exit(1);
> +    }
> +    /* zero all fd sets */
> +    FD_ZERO(&r);
> +    FD_ZERO(&w);
> +    FD_ZERO(&e);
> +    /* set the fd for the pidfd in the "read" set */
> +    FD_SET(fd, &r);
> +    /* block all signals */
> +    sigfillset(&all_signals);
> +    sigprocmask(SIG_BLOCK, &all_signals, NULL);

Technnically this is racy as there's still a window in which the
child is running when signals are not blocked.

> +    /* wait for the pid to disappear -> fd will appear as ready for read */
> +    (void) select(fd + 1, &r, &w, &e, NULL);

While using pidfd can work, a stronger protection would be to use

   prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);

this guarantees that the kernel will deliver SIGKILL to this
process immediately when the parent QEMU exits.

We should probably do both in fact.

> +
> +    /*
> +     * Close all file descriptors that might have been inherited from the
> +     * main qemu process when doing clone. This is needed to make libvirt
> +     * happy.
> +     */
> +    close_range(0, ~0U, 0);

Shouldn't we close all the FDs immediately when this process is
created, rather than only at the end when we're exiting. I don't
see there's any need to keep them open. Doing it immediately
would be better when using prctl(PR_SET_PDEATHSIG)

> +    _exit(0);
> +}
> +
> +void init_async_teardown(void)
> +{
> +    const int size = 8192; /* should be more than enough */
> +    char *stack = malloc(size);
> +

You need to block all signals here.

> +    /* start a new process sharing the address space with qemu */
> +    clone(async_teardown_fn, stack + size, CLONE_VM, NULL, NULL, NULL, NULL);

And unblock signals again here.

That way the "everything blocked"  mask is inherited by the child
from the very first moment of its existance.

> +}
> +#else /* __linux__ */
> +void init_async_teardown(void) {}
> +#endif
> +
>  #ifdef _WIN32
>  static void socket_cleanup(void)
>  {
> -- 
> 2.37.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]