qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/2] dump-guest-memory: Add blocker for migration


From: Peter Xu
Subject: Re: [PATCH 0/2] dump-guest-memory: Add blocker for migration
Date: Fri, 3 Sep 2021 12:08:42 -0400

On Wed, Sep 01, 2021 at 01:35:18PM +0200, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Markus,
> >
> > On Wed, Aug 25, 2021 at 09:54:12AM +0200, Markus Armbruster wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > Both dump-guest-memory and live migration have vm state cached 
> >> > internally.
> >> > Allowing them to happen together means the vm state can be messed up.  
> >> > Simply
> >> > block live migration for dump-guest-memory.
> >> >
> >> > One trivial thing to mention is we should still allow dump-guest-memory 
> >> > even if
> >> > -only-migratable is specified, because that flag should majorly be used 
> >> > to
> >> > guarantee not adding devices that will block migration by accident.  
> >> > Dump guest
> >> > memory is not like that - it'll only block for the seconds when it's 
> >> > dumping.
> >> 
> >> I recently ran into a similarly unusual use of migration blockers:
> >> 
> >>     Subject: -only-migrate and the two different uses of migration blockers
> >>      (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?)
> >>     Date: Mon, 19 Jul 2021 13:00:20 +0200 (5 weeks, 1 day, 20 hours ago)
> >>     Message-ID: <87sg0amuuz.fsf_-_@dusky.pond.sub.org>
> >> 
> >>     We appear to use migration blockers in two ways:
> >> 
> >>     (1) Prevent migration for an indefinite time, typically due to use of
> >>     some feature that isn't compatible with migration.
> >> 
> >>     (2) Delay migration for a short time.
> >> 
> >>     Option -only-migrate is designed for (1).  It interferes with (2).
> >> 
> >>     Example for (1): device "x-pci-proxy-dev" doesn't support migration.  
> >> It
> >>     adds a migration blocker on realize, and deletes it on unrealize.  With
> >>     -only-migrate, device realize fails.  Works as designed.
> >> 
> >>     Example for (2): spapr_mce_req_event() makes an effort to prevent
> >>     migration degrate the reporting of FWNMIs.  It adds a migration blocker
> >>     when it receives one, and deletes it when it's done handling it.  This
> >>     is a best effort; if migration is already in progress by the time FWNMI
> >>     is received, we simply carry on, and that's okay.  However, option
> >>     -only-migrate sabotages the best effort entirely.
> >> 
> >>     While this isn't exactly terrible, it may be a weakness in our thinking
> >>     and our infrastructure.  I'm bringing it up so the people in charge are
> >>     aware :)
> >> 
> >> https://lists.nongnu.org/archive/html/qemu-devel/2021-07/msg04723.html
> >> 
> >> Downthread there, Dave Gilbert opined
> >> 
> >>     It almost feels like they need a way to temporarily hold off
> >>     'completion' of migratio - i.e. the phase where we stop the CPU and
> >>     write the device data;  mind you you'd also probably want it to stop
> >>     cold-migrates/snapshots?
> >
> > Yeah, maybe spapr_mce_req_event() can be another candidate of the internal
> > version of migration_add_blocker().
> >
> > I can add a patch to replace it if anyone likes me to.
> >
> > Both cold and live snapshot should have checked migration blockers, I think.
> > E.g., cold snapshot has:
> >
> > bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
> >                   bool has_devices, strList *devices, Error **errp)
> > {
> >     [...]
> >     if (migration_is_blocked(errp)) {
> >         return false;
> >     }
> >     [...]
> > }
> >
> > While the live snapshot shares similar code in migrate_prepare().
> >
> > So looks safe that nothing wrong should happen within add/del pair of 
> > blockers.
> >
> > However I do see that it's possible we'll allow the add_blocker to succeed 
> > even
> > if during cold snapshot, because migration_is_idle() in 
> > migration_add_blocker()
> > only checks migration state, not RUN_STATE_SAVE_VM.  So I'm wondering 
> > whether
> > we'd like one more patch to cover that too, like:
> >
> > ---8<---
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 41429680c2..9c602a4ac1 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2055,15 +2055,16 @@ void migrate_init(MigrationState *s)
> >  
> >  int migrate_add_blocker_internal(Error *reason, Error **errp)
> >  {
> > -    if (migration_is_idle()) {
> > -        migration_blockers = g_slist_prepend(migration_blockers, reason);
> > -        return 0;
> > +    /* Snapshots are similar to migrations, so check RUN_STATE_SAVE_VM 
> > too. */
> > +    if (runstate_check(RUN_STATE_SAVE_VM) || !migration_is_idle()) {
> > +        error_propagate_prepend(errp, error_copy(reason),
> > +                                "disallowing migration blocker "
> > +                                "(migration in progress) for: ");
> > +        return -EBUSY;
> >      }
> >  
> > -    error_propagate_prepend(errp, error_copy(reason),
> > -                            "disallowing migration blocker "
> > -                            "(migration in progress) for: ");
> > -    return -EBUSY;
> > +    migration_blockers = g_slist_prepend(migration_blockers, reason);
> > +    return 0;
> >  }
> >  
> >  int migrate_add_blocker(Error *reason, Error **errp)
> > ---8<---
> >
> > It'll by accident also cover guest dump which also sets RUN_STATE_SAVE_VM, 
> > but
> > I think that's ok.
> >
> > Thanks,
> 
> I delayed answering this in the hope of finding the time to think.  No
> luck.  This is a quick answer, hopefully better than nothing and not too
> confused.
> 
> "Snapshot should honor migration blockers" feels like an independent
> issue.  I can't comment on it, because I haven't done my thinking there.

Yes it's independent, but I do think they're similar to live migration.  Let's
wait for Dave or Juan to chim in.

> 
> I'm not sure you fully got Dave's suggestion to "temporarily hold off
> 'completion'".  Let me explain (Dave, please correct misunderstandings,
> if any).
> 
> Migration blockers and migration are mutually exclusive: you can't
> migrate while such blockers are in effect (trying to immediately fails),
> and you can't add such blockers while migration is in progress.
> 
> The temporary blockers Dave suggested do not block migration, only its
> *completion* phase.  This makes sense *only* for short-lived blockers.
> If such temporary blockers are in effect when migration is ready to
> enter its completion phase, that entry is delayed until the temporary
> blockers are gone.  If you try to add a temporary blocker while
> migration is in its completion phase, the add blocks until migration
> exists its completion phase.

I see, that sounds like what a mutex would do (which we'll hold during
completion phase of migration), and it looks reasonable.

But I think it's not a blocker of not having a bigger protection - so far my
approach should be blocking the whole migration.  To be explicit:

  - If we dump-guest-memory during migration (not only completion phase,
    anytime after it starts), it fails dump-guest-mem.

  - If we try to live migrate during guest-dump-memory, it fails migration.

So basically that's a "bigger mutex".  We may want to shrink it further, but
IMHO it can also be done on top (and we need some more work to do to justify
everything will still be okay).

It shouldn't be in a rush, imo, and it should be low priority because we don't
really have those two collide each other, not to mention when dump-guest-memory
during completing of migration.

Thanks,

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]