qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX


From: Hanna Czenczek
Subject: Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
Date: Thu, 16 Mar 2023 10:43:38 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1

On 15.03.23 19:25, Eric Blake wrote:
On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter).  However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
    shorter.

I am wary of (1), because it seems like it may have unintended side
effects.

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.
Agreed that approach 2 is more conservative.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
  block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
  util/iov.c |   4 --
  2 files changed, 133 insertions(+), 10 deletions(-)
+/*
+ * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
+ * elements), collapse some elements into a single one so that it adheres to 
the
+ * IOV_MAX limit again.
+ *
+ * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
+ * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous 
entries
+ * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
+ * buffer content back for read requests.
+ *
+ * Note that we will not touch the padding head or tail entries here.  We 
cannot
+ * move them to a bounce buffer, because for RMWs, both head and tail expect to
+ * be in an aligned buffer with scratch space after (head) or before (tail) to
+ * perform the read into (because the whole buffer must be aligned, but head's
+ * and tail's lengths naturally cannot be aligned, because they provide padding
+ * for unaligned requests).  A collapsed bounce buffer for multiple IOV 
elements
+ * cannot provide such scratch space.
+ *
+ * Therefore, this function collapses the first IOV elements after the
+ * (potential) head element.
It looks like you blindly pick the first one or two non-padding iovs
at the front of the array.  Would it be any wiser (in terms of less
memmove() action or even a smaller bounce buffer) to pick iovs at the
end of the array, and/or a sequential search for the smallest
neighboring iovs?  Or is that a micro-optimization that costs more
than it saves?

Right, I didn’t think of picking near the end.  That makes sense indeed!  If not for performance, at least it allows dropping the non-trivial comment for the memmove().

Searching for the smallest buffers, I’m not sure.  I think it can make sense performance-wise – iterating over 1024 elements will probably pay off quickly when you indeed have differences in buffer size there.  My main concern is that it would be more complicated, and I just don’t think that’s worth it for such a rare case.

Would it be any easier to swap the order of padding vs. collapsing?
That is, we already know the user is giving us a long list of iovs; if
it is 1024 elements long, and we can detect that padding will be
needed, should we collapse before padding instead of padding, finding
that we now have 1026, and memmove'ing back into 1024?

I’d prefer that, but it’s difficult.  We need the temporary QIOV (pad->local_qiov) so we can merge entries, but this is only created by qemu_iovec_init_extended().

We can try to move the collapsing into qemu_iovec_init_extended(), but it would be a bit awkward still, and probably blow up that function’s interface (it’s in util/iov.c, so we can’t really immediately use the BdrvRequestPadding object).  I think, in the end, functionally not much would change, so I’d rather keep the order as it is (unless someone has a good idea here).

But logic-wise, your patch looks correct to me.

Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

Hanna




reply via email to

[Prev in Thread] Current Thread [Next in Thread]