qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] tests/qtest/migration-test: Disable migration/multifd/tcp/pl


From: Daniel P . Berrangé
Subject: Re: [PATCH] tests/qtest/migration-test: Disable migration/multifd/tcp/plain/cancel
Date: Tue, 14 Mar 2023 17:48:04 +0000
User-agent: Mutt/2.2.9 (2022-11-12)

On Tue, Mar 14, 2023 at 12:46:34PM -0400, Peter Xu wrote:
> On Tue, Mar 14, 2023 at 10:11:53AM +0000, Dr. David Alan Gilbert wrote:
> > OK, I think I kind of see what's happening here, one for Peter Xu.
> > If I'm right it's a race something like:
> >   a) The test harness tells the source it wants to enter postcopy
> >   b) The harness then waits for the source to stop
> >   c) ... and the dest to start 
> > 
> >   It's blocked on one of b&c but can't tell which
> > 
> >   d) The main thread in the dest is waiting for the postcopy recovery fd
> >     to be opened
> >   e) But I think the source is still trying to send normal precopy RAM
> >     and perhaps hasn't got around yet to opening that socket yet????
> >   f) But I think the dest isn't reading from the main channel at that
> >     point because of (d)
> 
> I think this analysis is spot on.  Thanks Dave!
> 
> Src qemu does this with below order:
> 
>         1. setup preempt channel
>         1.1. connect()  --> this is done in another thread
>         1.2. sem_wait(postcopy_qemufile_src_sem) --> make sure it's created
>         2. prepare postcopy package (LISTEN, non-iterable states, ping-3, RUN)
>         3. send the package
> 
> So logically the sequence is guaranteed so that when LISTEN cmd is
> processed, we should have connect()ed already.
> 
> But I think there's one thing missing on dest.. since the accept() on the
> dest node should be run in the main thread, meanwhile the LISTEN cmd is
> also processed on the main thread, even if the listening socket is trying
> to kick the main thread to do the accept() (so the connection has
> established) it won't be able to kick the final accept() as main thread is
> waiting in the semaphore.  That caused a deadlock.
> 
> A simple fix I can think of is moving the wait channel operation outside
> the main thread, e.g. to the preempt thread.
> 
> I've attached that simple fix.  Peter, is it easy to verify it?  I'm not
> sure the reproducability, fine by me too if it's easier to just disable
> preempt tests for 8.0 release.
> 
> Thanks,
> 
> -- 
> Peter Xu

> From 92f2f90d2eb270ca158479bfd9a5a855ec7ddf4d Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 14 Mar 2023 12:24:02 -0400
> Subject: [PATCH] migration: Wait on preempt channel in preempt thread
> 
> QEMU main thread will wait until dest preempt channel established during
> processing the LISTEN command (within the whole postcopy PACKAGED data), by
> waiting on the semaphore postcopy_qemufile_dst_done.
> 
> That's racy, because it's possible that the dest QEMU main thread hasn't
> yet accept()ed the new connection when processing the LISTEN event.  The
> sem_wait() will yield the main thread without being able to run anything
> else including the accept() of the new socket, which can cause deadlock
> within the main thread.
> 
> To avoid the race, move the "wait channel" from main thread to the preempt
> thread right at the start.
> 
> Reported-by: Peter Maydell <peter.maydell@linaro.org>
> Fixes: 5655aab079 ("migration: Postpone postcopy preempt channel to be after 
> main")
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/postcopy-ram.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


This description of the bug & proposed fixed matches what I could
infer as the flaw from the stack trace's Peter provided.

> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index f54f44d899..41c0713650 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -1197,11 +1197,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState 
> *mis)
>      }
>  
>      if (migrate_postcopy_preempt()) {
> -        /*
> -         * The preempt channel is established in asynchronous way.  Wait
> -         * for its completion.
> -         */
> -        qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
>          /*
>           * This thread needs to be created after the temp pages because
>           * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
> @@ -1668,6 +1663,12 @@ void *postcopy_preempt_thread(void *opaque)
>  
>      qemu_sem_post(&mis->thread_sync_sem);
>  
> +    /*
> +     * The preempt channel is established in asynchronous way.  Wait
> +     * for its completion.
> +     */
> +    qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
> +
>      /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
>      qemu_mutex_lock(&mis->postcopy_prio_thread_mutex);
>      while (1) {
> -- 
> 2.39.1
> 


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]