qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-


From: Roman Kagan
Subject: Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay
Date: Wed, 3 Feb 2021 13:53:05 +0300

On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We pause reconnect process during drained section. So, if we have some
> requests, waiting for reconnect we should cancel them, otherwise they
> deadlock the drained section.
> 
> How to reproduce:
> 
> 1. Create an image:
>    qemu-img create -f qcow2 xx 100M
> 
> 2. Start NBD server:
>    qemu-nbd xx
> 
> 3. Start vm with second nbd disk on node2, like this:
> 
>   ./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
>      file=/work/images/cent7.qcow2 -drive \
>      
> driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
>  \
>      -vnc :0 -m 2G -enable-kvm -vga std
> 
> 4. Access the vm through vnc (or some other way?), and check that NBD
>    drive works:
> 
>    dd if=/dev/sdb of=/dev/null bs=1M count=10
> 
>    - the command should succeed.
> 
> 5. Now, kill the nbd server, and run dd in the guest again:
> 
>    dd if=/dev/sdb of=/dev/null bs=1M count=10
> 
> Now Qemu is trying to reconnect, and dd-generated requests are waiting
> for the connection (they will wait up to 60 seconds (see
> reconnect-delay option above) and than fail). But suddenly, vm may
> totally hang in the deadlock. You may need to increase reconnect-delay
> period to catch the dead-lock.
> 
> VM doesn't respond because drain dead-lock happens in cpu thread with
> global mutex taken. That's not good thing by itself and is not fixed
> by this commit (true way is using iothreads). Still this commit fixes
> drain dead-lock itself.
> 
> Note: probably, we can instead continue to reconnect during drained
> section. To achieve this, we may move negotiation to the connect thread
> to make it independent of bs aio context. But expanding drained section
> doesn't seem good anyway. So, let's now fix the bug the simplest way.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block/nbd.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 9daf003bea..912ea27be7 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -242,6 +242,11 @@ static void coroutine_fn 
> nbd_client_co_drain_begin(BlockDriverState *bs)
>      }
>  
>      nbd_co_establish_connection_cancel(bs, false);
> +
> +    if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> +        s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> +        qemu_co_queue_restart_all(&s->free_sema);
> +    }
>  }
>  
>  static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)

This basically defeats the whole purpose of reconnect: if the nbd client
is trying to reconnect, drain effectively cancels that and makes all
in-flight requests to complete with an error.

I'm not suggesting to revert this patch (it's now in the tree as commit
8c517de24a), because the deadlock is no better, but I'm afraid the only
real fix is to implement reconnect during the drain section.  I'm still
trying to get my head around it so no patch yet, but I just wanted to
bring this up in case anybody beats me to it.

Thanks,
Roman.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]