qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 1/3] migration/multifd: Move channels_ready semaphore


From: Peter Xu
Subject: Re: [RFC PATCH 1/3] migration/multifd: Move channels_ready semaphore
Date: Tue, 10 Oct 2023 17:40:59 -0400

On Tue, Oct 10, 2023 at 05:00:37PM -0400, Peter Xu wrote:
> On Fri, Sep 22, 2023 at 11:53:17AM -0300, Fabiano Rosas wrote:
> > Commit d2026ee117 ("multifd: Fix the number of channels ready") moved
> > the "post" of channels_ready to the start of the multifd_send_thread()
> > loop and added a missing "wait" at multifd_send_sync_main(). While it
> > does work, the placement of the wait goes against what the rest of the
> > code does.
> > 
> > The sequence at multifd_send_thread() is:
> > 
> >     qemu_sem_post(&multifd_send_state->channels_ready);
> >     qemu_sem_wait(&p->sem);
> >     <work>
> >     if (flags & MULTIFD_FLAG_SYNC) {
> >         qemu_sem_post(&p->sem_sync);
> >     }
> > 
> > Which means that the sending thread makes itself available
> > (channels_ready) and waits for more work (sem). So the sequence in the
> > migration thread should be to check if any channel is available
> > (channels_ready), give it some work and set it off (sem):
> > 
> >     qemu_sem_wait(&multifd_send_state->channels_ready);
> 
> Here it means we have at least 1 free send thread, then...
> 
> >     <enqueue work>
> >     qemu_sem_post(&p->sem);
> 
> ... here we enqueue some work to the current thread (pointed by "i"), no
> matter it's free or not, as "i" may not always point to the free thread.
> 
> >     if (flags & MULTIFD_FLAG_SYNC) {
> >         qemu_sem_wait(&p->sem_sync);
> >     }
> 
> So I must confess I never fully digest how these sem/mutex/.. worked in
> multifd, since the 1st day it's introduced.. so please take below comment
> with a grain of salt..
> 
> It seems to me that the current design allows >1 pending_job for a thread.
> Here the current code didn't do "wait(channels_ready)" because it doesn't
> need to - it simply always queue an MULTIFD_FLAG_SYNC pending job over the
> thread, and wait for it to run.
> 
> From that POV I think I can understand why "wait(channels_ready)" is not
> needed here.  But then I'm confused because we don't have a real QUEUE to
> put those requests; we simply apply this:
> 
> multifd_send_sync_main():
>         p->flags |= MULTIFD_FLAG_SYNC;
> 
> Even if this send thread can be busy handling a batch of pages and
> accessing p->flags.  I think it can actually race with the send thread
> reading the flag at the exact same time:
> 
> multifd_send_thread():
>             multifd_send_fill_packet(p);
>             flags = p->flags;  <-------------- here
> 
> And whether it sees MULTIFD_FLAG_SYNC is unpredictable.  If it sees it,
> it'll post(sem_sync) in this round.  If it doesn't see it, it'll
> post(sem_sync) in the next round.  In whatever way, we'll generate an empty
> multifd packet to the wire I think, even though I don't know whether that's
> needed at all...

A correction: Since it's protected by p->mutex, I think we will only get an
empty multifd packet when we have pending_jobs==2.. because then we'll see
pending_job==2 with p->flags==SYNC, we send pages along with flags=SYNC to
dest, after that we kick sem_sync on src, then we found another
pending_jobs==1 even if p->flags will be zero.  The next multifd packet
will be only containing header (flags=0) and with no pages.

> 
> I'm not sure whether we should fix it in a more complete form, by not
> sending that empty multifd packet at all? Because that only contains the
> header without any real page inside, IIUC, so it seems to be a waste of
> resource.  Here what we want is only to kick sem_sync?

When thinking more about it, now I'm unsure whether sync is really working
as expected now in general..

IIUC SYNC message is used to flush all pages from source to destination.
We need that because we want to order the different versions of guest
pages, making sure the new version of a page always arrives later than its
old version, hence after all pages migrated we'll be sure all guest pages
on dest will be the latest.

Let's define "version X for page Y" as PyVx.  Version 1 of page 2 is P2V1.

So if without SYNC, a race can happen like this:

  sender 1         sender 2           receiver 1            receiver 2
  --------         --------           ----------            ----------

  send P1V1
          ...P1 changed content.. queued again in sender 2...
                   send P1V2
          ...If we got unlucky on receiving order of P1 versions...
                                                            receive P1V2
                                      receive P1V1

So if receiver 1 got P1V1 after receiver 2 got P1V2, we'll ultimately have
P1V1 on dst, which is an old data, causing data corrupt after migration.

Now we have the SYNC packet, but would it always work?  I'll discuss with
the latest RAM_SAVE_FLAG_MULTIFD_FLUSH sync message:

  src main    sender 1     sender 2    dst main  receiver 1  receiver 2
  --------    --------     --------    --------  ----------  ----------

              send P1V1
  send MULTIFD_FLUSH   
          ...P1 changed.. queued again in sender 2...
                           send P1V2
                                       receive MULTIFD_FLUSH
                                       (but since nothing received, flush 
nothing)
                                                             receive P1V2
                                                 receive P1V1

IIUC the problem is MULTIFD_FLUSH now does not rely on dest QEMU receiving
all existing pages sent.  Since the main channel is also a separate channel
from other multifd channels, I don't see why above cannot happen.

I think the problem will go away if e.g. src QEMU will need an SYNC_ACK
from dest qemu, making sure dest qemu digested all the sent pages.  Or, we
always send the same page via the same channel, e.g. by a hash(page_index).

I had a feeling that we have a bug, we just never hit it, because we don't
send P1V1 and P1V2 that close; we only do that for each whole iteration
looping over all ramblocks in find_dirty_block().  But it seems the bug is
possible, but I'll be very happy to be proven wrong by anyone..

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]