qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
Date: Mon, 20 Mar 2023 13:31:13 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.2

On 17.03.23 20:50, Hanna Czenczek wrote:
When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter).  However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
    shorter.

I am wary of (1), because it seems like it may have unintended side
effects.

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.

To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
is effectively replaced by the new function bdrv_create_padded_qiov(),
which not only wraps the request IOV with padding head/tail, but also
ensures that the resulting vector will not have more than IOV_MAX
elements.  Putting that functionality into qemu_iovec_init_extended() is
infeasible because it requires allocating a bounce buffer; doing so
would require many more parameters (buffer alignment, how to initialize
the buffer, and out parameters like the buffer, its length, and the
original elements), which is not reasonable.

Conversely, it is not difficult to move qemu_iovec_init_extended()'s
functionality into bdrv_create_padded_qiov() by using public
qemu_iovec_* functions, so that is what this patch does.

Because bdrv_pad_request() was the only "serious" user of
qemu_iovec_init_extended(), the next patch will remove the latter
function, so the functionality is not implemented twice.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
  block/io.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++----
  1 file changed, 143 insertions(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8974d46941..1e9cdba17a 100644
--- a/block/io.c
+++ b/block/io.c

[..]

+    pad->write = write;
+
      return true;
  }
@@ -1545,6 +1561,18 @@ zero_mem: static void bdrv_padding_destroy(BdrvRequestPadding *pad)

Maybe, rename to _finalize, to stress that it's not only freeing memory.

  {
+    if (pad->collapse_bounce_buf) {
+        if (!pad->write) {
+            /*
+             * If padding required elements in the vector to be collapsed into 
a
+             * bounce buffer, copy the bounce buffer content back
+             */
+            qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0,
+                                pad->collapse_bounce_buf, pad->collapse_len);
+        }
+        qemu_vfree(pad->collapse_bounce_buf);
+        qemu_iovec_destroy(&pad->pre_collapse_qiov);
+    }
      if (pad->buf) {
          qemu_vfree(pad->buf);
          qemu_iovec_destroy(&pad->local_qiov);
@@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding 
*pad)
      memset(pad, 0, sizeof(*pad));
  }
+/*
+ * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
+ * ensuring that the resulting vector will not exceed IOV_MAX elements.
+ *
+ * To ensure this, when necessary, the first couple of elements (up to three)

maybe, "first two-three elements"

+ * of @iov are merged into pad->collapse_bounce_buf and replaced by a reference
+ * to that bounce buffer in pad->local_qiov.
+ *
+ * After performing a read request, the data from the bounce buffer must be
+ * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_destroy()).
+ */
+static int bdrv_create_padded_qiov(BlockDriverState *bs,
+                                   BdrvRequestPadding *pad,
+                                   struct iovec *iov, int niov,
+                                   size_t iov_offset, size_t bytes)
+{
+    int padded_niov, surplus_count, collapse_count;
+
+    /* Assert this invariant */
+    assert(niov <= IOV_MAX);
+
+    /*
+     * Cannot pad if resulting length would exceed SIZE_MAX.  Returning an 
error
+     * to the guest is not ideal, but there is little else we can do.  At least
+     * this will practically never happen on 64-bit systems.
+     */
+    if (SIZE_MAX - pad->head < bytes ||
+        SIZE_MAX - pad->head - bytes < pad->tail)
+    {
+        return -EINVAL;
+    }
+
+    /* Length of the resulting IOV if we just concatenated everything */
+    padded_niov = !!pad->head + niov + !!pad->tail;
+
+    qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX));
+
+    if (pad->head) {
+        qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head);
+    }
+
+    /*
+     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
+     * Instead, merge the first couple of elements of @iov to reduce the number

maybe, "first two-three elements"

+     * of vector elements as necessary.
+     */
+    if (padded_niov > IOV_MAX) {


[..]

@@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
          flags |= BDRV_REQ_COPY_ON_READ;
      }
- ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL, &flags);
+    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
+                           &pad, NULL, &flags);
      if (ret < 0) {
          goto fail;
      }

a bit later:

tracked_request_end(&req);
bdrv_padding_destroy(&pad);


Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, 
does it really violate something, but seems safer to swap these two calls. With 
that:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>



PS, I feel here still exists small space for optimization:

move the logic to bdrv_init_padding(), and

1. allocate only one buffer
2. make the new collpase are to be attached to head or tail padding
3. avoid creating extra iov-slice, maybe with help of some new qemu_iovec_* API 
that can control number of copied/to-be-copied iovs and/or calculation number 
of iovs in qiov/qiov_offset/bytes slice



--
Best regards,
Vladimir




reply via email to

[Prev in Thread] Current Thread [Next in Thread]