qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] migration: Fix rdma migration failed


From: Fabiano Rosas
Subject: Re: [PATCH 1/2] migration: Fix rdma migration failed
Date: Fri, 22 Sep 2023 12:59:37 -0300

Peter Xu <peterx@redhat.com> writes:

> On Wed, Sep 20, 2023 at 05:04:11PM +0800, Li Zhijian wrote:
>> From: Li Zhijian <lizhijian@cn.fujitsu.com>
>> 
>> Destination will fail with:
>> qemu-system-x86_64: rdma: Too many requests in this message 
>> (3638950032).Bailing.
>> 
>> migrate with RDMA is different from tcp. RDMA has its own control
>> message, and all traffic between RDMA_CONTROL_REGISTER_REQUEST and
>> RDMA_CONTROL_REGISTER_FINISHED should not be disturbed.
>> 
>> find_dirty_block() will be called during RDMA_CONTROL_REGISTER_REQUEST
>> and RDMA_CONTROL_REGISTER_FINISHED, it will send a extra traffic to
>> destination and cause migration to fail.
>> 
>> Since there's no existing subroutine to indicate whether it's migrated
>> by RDMA or not, and RDMA is not compatible with multifd, we use
>> migrate_multifd() here.
>> 
>> Fixes: 294e5a4034 ("multifd: Only flush once each full round of memory")
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>>  migration/ram.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 9040d66e61..89ae28e21a 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1399,7 +1399,8 @@ static int find_dirty_block(RAMState *rs, 
>> PageSearchStatus *pss)
>>          pss->page = 0;
>>          pss->block = QLIST_NEXT_RCU(pss->block, next);
>>          if (!pss->block) {
>> -            if (!migrate_multifd_flush_after_each_section()) {
>> +            if (migrate_multifd() &&
>> +                !migrate_multifd_flush_after_each_section()) {
>>                  QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
>>                  int ret = multifd_send_sync_main(f);
>>                  if (ret < 0) {
>> -- 
>> 2.31.1
>> 
>
> Maybe better to put that check at the entry of
> migrate_multifd_flush_after_each_section()?
>
> I also hope that some day there's no multifd function called in generic
> migration code paths..

I wonder what happened with that MigrationOps idea. We added the
ram_save_target_page pointer and nothing else. It seems like it could be
something in the direction of allowing different parts of the migration
code to provide different behavior without having to put these explicit
checks all over the place.

And although we're removing the QEMUFile hooks, those also looked like
they could help mitigate these cross-layer interactions. (I'm NOT
advocating bringing them back).



reply via email to

[Prev in Thread] Current Thread [Next in Thread]