qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [External] Re: [PATCH 2/2] coroutine: take exactly one batch from gl


From: 王洪浩
Subject: Re: [External] Re: [PATCH 2/2] coroutine: take exactly one batch from global pool at a time
Date: Wed, 26 Aug 2020 14:06:19 +0800

The purpose of this patch is to improve performance without increasing
memory consumption.

My test case:
QEMU command line arguments
-drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \
    -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \
-drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \
    -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \

run these two fio jobs at the same time
[job-vda]
filename=/dev/vda
iodepth=64
ioengine=libaio
rw=randrw
bs=4k
size=300G
rwmixread=80
direct=1
numjobs=2
runtime=60

[job-vdb]
filename=/dev/vdb
iodepth=64
ioengine=libaio
rw=randrw
bs=4k
size=300G
rwmixread=90
direct=1
numjobs=2
loops=1
runtime=60

without this patch, test 3 times:
total iops: 278548.1, 312374.1, 276638.2
with this patch, test 3 times:
total iops: 368370.9, 335693.2, 327693.1

18.9% improvement in average.

In addition, we are also using a distributed block storage, of which
the io latency is much more than local nvme devices because of the
network overhead. So it needs higher iodepth(>=256) to reach its max
throughput.
Without this patch, it has more than 5% chance of calling
`qemu_coroutine_new` and the iops is less than 100K, while the iops is
about 260K with this patch.

On the other hand, there's a simpler way to reduce or eliminate the
cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it
will also bring much more memory consumption which we don't expect.
So it's the purpose of this patch.

Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道:
>
> On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote:
> > This patch replace the global coroutine queue with a lock-free stack of 
> > which
> > the elements are coroutine queues. Threads can put coroutine queues into the
> > stack or take queues from it and each coroutine queue has exactly
> > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but 
> > it's
> > enough for buffer pool.
> >
> > Coroutines will be put into thread-local pools first while release. Now the
> > fast pathes of both allocation and release are atomic-free, and there won't
> > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has 
> > been
> > reduced to 16.
> >
> > In practice, I've run a VM with two block devices binding to two different
> > iothreads, and run fio with iodepth 128 on each device. It maintains around
> > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new`
> > without this patch. And with this patch, it maintains no more than 273
> > coroutines and doesn't call `qemu_coroutine_new` after initial allocations.
>
> Does throughput or IOPS change?
>
> Is the main purpose of this patch to reduce memory consumption?
>
> Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]