External email: Use caution opening links or attachments
On Mon, 27 Feb 2023 13:26:00 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:
On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
But we have no requirement to send all init_bytes before stop-copy.
This is a hack to achieve a theoretical benefit that a driver might be
able to improve the latency on the target by completing another
iteration.
I think this is another half-step at this point..
The goal is to not stop the VM until the target VFIO driver has
completed loading initial_bytes.
This signals that the time consuming pre-setup is completed in the
device and we don't have to use downtime to do that work.
We've measured this in our devices and the time-shift can be
significant, like seconds levels of time removed from the downtime
period.
Stopping the VM before this pre-setup is done is simply extending the
stopped VM downtime.
Really what we want is to have the far side acknowledge that
initial_bytes has completed loading.
To remind, what mlx5 is doing here with precopy is time-shifting work,
not data. We want to put expensive work (ie time) into the period when
the VM is still running and have less downtime.
This challenges the assumption built into qmeu that all data has equal
time and it can estimate downtime time simply by scaling the estimated
data. We have a data-size independent time component to deal with as
well.
As I mentioned before, I understand the motivation, but imo the
implementation is exploiting the interface it extended in order to force
a device driven policy which is specifically not a requirement of the
vfio migration uAPI. It sounds like there's more work required in the
QEMU migration interfaces to properly factor this information into the
algorithm. Until then, this seems like a follow-on improvement unless
you can convince the migration maintainers that providing false
information in order to force another pre-copy iteration is a valid use
of passing the threshold value to the driver.