[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] coroutine: resize pool periodically instead of limiting size
From: |
Stefan Hajnoczi |
Subject: |
Re: [PATCH] coroutine: resize pool periodically instead of limiting size |
Date: |
Thu, 9 Sep 2021 16:51:44 +0100 |
On Thu, Sep 09, 2021 at 09:37:20AM +0100, Daniel P. Berrangé wrote:
> On Wed, Sep 01, 2021 at 05:09:23PM +0100, Stefan Hajnoczi wrote:
> > It was reported that enabling SafeStack reduces IOPS significantly
> > (>25%) with the following fio benchmark on virtio-blk using a NVMe host
> > block device:
> >
> > # fio --rw=randrw --bs=4k --iodepth=64 --runtime=1m --direct=1 \
> > --filename=/dev/vdb --name=job1 --ioengine=libaio --thread \
> > --group_reporting --numjobs=16 --time_based \
> > --output=/tmp/fio_result
> >
> > Serge Guelton and I found that SafeStack is not really at fault, it just
> > increases the cost of coroutine creation. This fio workload exhausts the
> > coroutine pool and coroutine creation becomes a bottleneck. Previous
> > work by Honghao Wang also pointed to excessive coroutine creation.
> >
> > Creating new coroutines is expensive due to allocating new stacks with
> > mmap(2) and mprotect(2). Currently there are thread-local and global
> > pools that recycle old Coroutine objects and their stacks but the
> > hardcoded size limit of 64 for thread-local pools and 128 for the global
> > pool is insufficient for the fio benchmark shown above.
>
> Rather than keeping around a thread local pool of coroutine
> instances, did you ever consider keeping around a pool of
> allocated stacks ? Essentially it seems like you're syaing
> the stack allocation is the problem due to it using mmap()
> instead of malloc() and thus not benefiting from any of the
> performance tricks malloc() impls use to avoid repeated
> syscalls on every allocation. If 'qemu_alloc_stack' and
> qemu_free_stack could be made more intelligent by caching
> stacks, then perhaps the coroutine side can be left "dumb" ?
What is the advantage of doing that? Then the Coroutine struct needs to
be malloced each time. Coroutines are the only users of
qemu_alloc_stack(), so I think pooling the Coroutines is optimal.
> >
> > This patch changes the coroutine pool algorithm to a simple thread-local
> > pool without a size limit. Threads periodically shrink the pool down to
> > a size sufficient for the maximum observed number of coroutines.
> >
> > This is a very simple algorithm. Fancier things could be done like
> > keeping a minimum number of coroutines around to avoid latency when a
> > new coroutine is created after a long period of inactivity. Another
> > thought is to stop the timer when the pool size is zero for power saving
> > on threads that aren't using coroutines. However, I'd rather not add
> > bells and whistles unless they are really necessary.
> >
> > The global pool is removed by this patch. It can help to hide the fact
> > that local pools are easily exhausted, but it's doesn't fix the root
> > cause. I don't think there is a need for a global pool because QEMU's
> > threads are long-lived, so let's keep things simple.
>
> QEMU's threads may be long-lived, but are the workloads they service
> expected to be consistent over time?
>
> eg can we ever get a situation where Thread A has a peak of activity
> causing caching of many coroutines, and then go idle for a long time,
> while Thread B then has a peak and is unable to take advantage of the
> cache from Thread A that is now unused ?
It's possible to trigger that case, but it's probably not a real-world
scenario. It requires a storage controller that is emulated in the vCPU
thread (like AHCI) and a workload that alternates between vCPUs.
However, that configuration isn't expected to perform well anyway
(virtio-blk and virtio-scsi use IOThreads instead of running emulation
in vCPU threads because this reduces the cost of the vmexit).
Stefan
signature.asc
Description: PGP signature