qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support


From: Jason Gunthorpe
Subject: Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
Date: Mon, 6 Mar 2023 15:01:30 -0400

On Wed, Mar 01, 2023 at 03:39:17PM -0700, Alex Williamson wrote:
> On Wed, 1 Mar 2023 17:12:51 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:
> > 
> > > So it seems like what we need here is both a preface buffer size and a
> > > target device latency.  The QEMU pre-copy algorithm should factor both
> > > the remaining data size and the device latency into deciding when to
> > > transition to stop-copy, thereby allowing the device to feed actually
> > > relevant data into the algorithm rather than dictate its behavior.  
> > 
> > I don't know that we can realistically estimate startup latency,
> > especially have the sender estimate latency on the receiver..
> 
> Knowing that the target device is compatible with the source is a point
> towards making an educated guess.
> 
> > I feel like trying to overlap the device start up with the STOP phase
> > is an unnecessary optimization? How do you see it benifits?
> 
> If we can't guarantee that there's some time difference between sending
> initial bytes immediately at the end of pre-copy vs immediately at the
> beginning of stop-copy, does that mean any handling of initial bytes is
> an unnecessary optimization?

Sure if the device doesn't implement an initial_bytes startup phase
then it is all pointless, but probably those devices should return 0
for initial_bytes? If we see initial_bytes and assume it indicates a
startup phase, why not do it?

> I'm imagining that completing initial bytes triggers some
> initialization sequence in the target host driver which runs in
> parallel to the remaining data stream, so in practice, even if sent at
> the beginning of stop-copy, the target device gets a head start.

It isn't parallel in mlx5. The load operation of the initial bytes on
the receiver will execute the load command and that command will take
some amount of time sort of proportional to how much data is in the
device. IIRC the mlx5 VFIO driver will block read until this finishes.

It is convoluted but it ultimately is allocating (potentially alot)
pages in the hypervisor kernel so the time predictability is not very
good.

Other device types we are looking at might do network connections at
this step - eg a storage might open a network connection to its back
end. This could be unpredicatably long in degenerate cases.

> > I've been thinking of this from the perspective that we should always
> > ensure device startup is completed, it is time that has to be paid,
> > why pay it during STOP?
> 
> Creating a policy for QEMU to send initial bytes in a given phase
> doesn't ensure startup is complete.  There's no guaranteed time
> difference between sending that data and the beginning of stop-copy.

As I've said, to really do a good job here we want to have the sender
wait until the receiver completes startup, and not just treat it as a
unidirectional byte-stream. That isn't this patch..

> QEMU is trying to achieve a downtime goal, where it estimates network
> bandwidth to get a data size threshold, and then polls devices for
> remaining data.  That downtime goal might exceed the startup latency of
> the target device anyway, where it's then the operators choice to pay
> that time in stop-copy, or stalled on the target.

If you are saying there should be a policy flag ('optimize for total
migration time' vs 'optimize for minimum downtime') that seems
reasonable, though I wonder who would pick the first option.
 
> But if we actually want to ensure startup of the target is complete,
> then drivers should be able to return both data size and estimated time
> for the target device to initialize.  That time estimate should be
> updated by the driver based on if/when initial_bytes is drained.  The
> decision whether to continue iterating pre-copy would then be based on
> both the maximum remaining device startup time and the calculated time
> based on remaining data size.

That seems complicated. Why not just wait for the other side to
acknowledge it has started the device? Then we aren't trying to guess.

AFAIK this sort of happens implicitly in this patch because once
initial bytes is pushed the next data that follows it will block on
the pending load and the single socket will backpressure until the
load is done. Horrible, yes, but it is where qemu is at. multi-fd is
really important :)

Jason



reply via email to

[Prev in Thread] Current Thread [Next in Thread]