qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is


From: Alex Williamson
Subject: Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
Date: Tue, 17 May 2022 11:22:32 -0600

On Tue, 17 May 2022 13:08:44 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote:
> 
> > > This is really intended to be a NOP from where things are now, as if
> > > you use mlx5 live migration without a patch like this then it causes a
> > > botched pre-copy since everything just ends up permanently dirty.
> > > 
> > > If it makes more sense we could abort the pre-copy too - in the end
> > > there will be dirty tracking so I don't know if I'd invest in a big
> > > adventure to fully define non-dirty tracking migration.  
> > 
> > How is pre-copy currently "botched" without a patch like this?  If it's
> > simply that the pre-copy doesn't converge and the downtime constraints
> > don't allow the VM to enter stop-and-copy, that's the expected behavior
> > AIUI, and supports backwards compatibility with existing SLAs.  
> 
> It means it always fails - that certainly isn't working live
> migration. There is no point in trying to converge something that we
> already know will never converge.

If we eliminate the pre-copy phase then it's not so much live migration
anyway.  Trying to converge is indeed useless work, but afaik it's that
useless work that generates the data that management tools can use to
determine that SLAs cannot be achieved in a compatible way.
 
> > I'm assuming that by setting this new skip_precopy flag that we're
> > forcing the VM to move to stop-and-copy, regardless of any other SLA
> > constraints placed on the migration.    
> 
> That does seem like a defect in this patch, any SLA constraints should
> still all be checked under the assumption all ram is dirty.

The migration iteration function certainly tries to compare remaining
bytes to a threshold based on bandwidth and downtime.  The exit path
added here is the same as it would take if we had achieved our
threshold limit.  It's not clear to me that we're checking the downtime
limit elsewhere or have the data to do it if we don't transfer anything
estimate bandwidth.

> > It seems like a better solution would be to expose to management
> > tools that the VM contains a device that does not support the
> > pre-copy phase so that downtime expectations can be adjusted.  
> 
> I don't expect this to be a real use case though..
> 
> Remember, you asked for this patch when you wanted qemu to have good
> behavior when kernel support for legacy dirty tracking is removed
> before we merge v2 support.

Is wanting good behavior a controversial point?  Did we define this as
the desired good behavior?  Ref?  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]