qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support


From: Alex Williamson
Subject: Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
Date: Wed, 1 Mar 2023 12:55:59 -0700

On Wed, 1 Mar 2023 20:49:28 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 27/02/2023 19:43, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 27 Feb 2023 13:26:00 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >  
> >> On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
> >>  
> >>> But we have no requirement to send all init_bytes before stop-copy.
> >>> This is a hack to achieve a theoretical benefit that a driver might be
> >>> able to improve the latency on the target by completing another
> >>> iteration.  
> >> I think this is another half-step at this point..
> >>
> >> The goal is to not stop the VM until the target VFIO driver has
> >> completed loading initial_bytes.
> >>
> >> This signals that the time consuming pre-setup is completed in the
> >> device and we don't have to use downtime to do that work.
> >>
> >> We've measured this in our devices and the time-shift can be
> >> significant, like seconds levels of time removed from the downtime
> >> period.
> >>
> >> Stopping the VM before this pre-setup is done is simply extending the
> >> stopped VM downtime.
> >>
> >> Really what we want is to have the far side acknowledge that
> >> initial_bytes has completed loading.
> >>
> >> To remind, what mlx5 is doing here with precopy is time-shifting work,
> >> not data. We want to put expensive work (ie time) into the period when
> >> the VM is still running and have less downtime.
> >>
> >> This challenges the assumption built into qmeu that all data has equal
> >> time and it can estimate downtime time simply by scaling the estimated
> >> data. We have a data-size independent time component to deal with as
> >> well.  
> > As I mentioned before, I understand the motivation, but imo the
> > implementation is exploiting the interface it extended in order to force
> > a device driven policy which is specifically not a requirement of the
> > vfio migration uAPI.  It sounds like there's more work required in the
> > QEMU migration interfaces to properly factor this information into the
> > algorithm.  Until then, this seems like a follow-on improvement unless
> > you can convince the migration maintainers that providing false
> > information in order to force another pre-copy iteration is a valid use
> > of passing the threshold value to the driver.  
> 
> In my previous message I suggested to drop this exploit and instead 
> change the QEMU migration API and introduce to it the concept of 
> pre-copy initial bytes -- data that must be transferred before source VM 
> stops (which is different from current @must_precopy that represents 
> data that can be transferred even when VM is stopped).
> We could do it by adding a new parameter "init_precopy_size" to the 
> state_pending_{estimate,exact} handlers and every migration user could 
> use it (RAM, block, etc).
> We will also change the migration algorithm to take this new parameter 
> into account when deciding to move to stop-copy.
> 
> Of course this will have to be approved by migration maintainers first, 
> but if it's done in a standard way such as above, via the migration API, 
> would it be OK by you to go this way?

I still think we're conflating information and requirements by allowing
a device to impose a policy which keeps QEMU in pre-copy.  AIUI, what
we're trying to do is maximize the time separation between the
initial_bytes from the device and the end-of-stream.  But knowing the
data size of initial_bytes is not really all that useful.

If we think about the limits of network bandwidth, all data transfers
approach zero time, but the startup latency of the target device that
we're trying to maximize here is fixed.  By prioritizing initial_bytes,
we're separating in space the beginning of target device setup from the
end-of-stream, but that's only an approximation of time, which is what
QEMU really needs to know to honor downtime requirements.

So it seems like what we need here is both a preface buffer size and a
target device latency.  The QEMU pre-copy algorithm should factor both
the remaining data size and the device latency into deciding when to
transition to stop-copy, thereby allowing the device to feed actually
relevant data into the algorithm rather than dictate its behavior.
Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]