Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

From:	Leonardo Bras Soares Passos
Subject:	Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
Date:	Thu, 2 Sep 2021 03:38:11 -0300

Hello Daniel, thank you for the feedback!

Comments inline.

On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> >
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
>
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
>
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.

You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
a somehow synchronous way, but returning errp (and sometimes closing the
channel because of it) was a poor implementation.

>
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
>
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>
>   "Page pinning also changes system call semantics. It temporarily
>    shares the buffer between process and network stack. Unlike with
>    copying, the process cannot immediately overwrite the buffer
>    after system call return without possibly modifying the data in
>    flight. Kernel integrity is not affected, but a buggy program
>    can possibly corrupt its own data stream."
>

By the above piece of documentation, I get there is no problem in
overwriting the buffer, but a corrupt, or newer version of the memory may
be sent instead of the original one. I am pointing this out because there
are workloads like page migration that would not be impacted, given
once the buffer is changed, it will dirty the page and it will be re-sent.

But I agree with you.
It's not a good choice to expect all the users to behave like that,
and since an interface for dealing with those errors is not provided
to the using code, there is no way of using that in other scenarios.

> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
>
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.
>

That is a good point.
Proposing a new optional method like io_async_writev() + an error
checking mechanism could do the job.
io_async_writev() could fall-back to io_writev() in cases where it's not
implemented.

I am not sure about the error checking yet.
Options I can see are:
1 - A callback, as you suggested, which IIUC would be provided by
code using the QIOChannel, and would only fix the reported errors,
leaving the responsibility of checking for errors to the IOChannel code.

2 - A new method, maybe io_async_errck(), which would be called
whenever the using code wants to deal with pending errors. It could
return an array/list of IOVs that need to be re-sent, for example,
and code using QIOChannel could deal with it however it wants.

[snip]

> >   * qio_channel_set_cork:
> >   * @ioc: the channel object
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index e377e7303d..a69fec7315 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -26,8 +26,10 @@
> >  #include "io/channel-watch.h"
> >  #include "trace.h"
> >  #include "qapi/clone-visitor.h"
> > +#include <linux/errqueue.h>
>
> That's going to fail to biuld on non-Linux

Good catch, thanks!

[snip]

> > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >                           "Unable to write to socket");
> >          return -1;
> >      }
> > +
> > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > +        sioc->errq_pending += niov;
> > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > +            qio_channel_socket_errq_proc(sioc, errp);
> > +        }
> > +    }
>
> This silently looses any errors set from upto the final
> SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.

You are right.

>
> Also means if you have a series of writes without
> MSG_ZEROCOPY, it'll delay checking any pending
> errors.

That's expected... if there are only happening sends without MSG_ZEROCOPY,
it means the ones sent with zerocopy can wait. The problem would be
the above case.

>
> I would suggest checkig in close(), but as mentioned
> earlier, I think the design is flawed because the caller
> fundamentally needs to know about completion for every
> single write they make in order to know when the buffer
> can be released / reused.

Well, there could be a flush mechanism (maybe in io_sync_errck(),
activated with a
parameter flag, or on a different method if callback is preferred):
In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
syscall after each packet sent, and this means the fd gets a signal after each
sendmsg() happens, with error or not.

We could harness this with a poll() and a relatively high timeout:
- We stop sending packets, and then call poll().
- Whenever poll() returns 0, it means a timeout happened, and so it
took too long
without sendmsg() happening, meaning all the packets are sent.
- If it returns anything else, we go back to fixing the errors found (re-send)

The problem may be defining the value of this timeout, but it could be
called only
when zerocopy is active.

What do you think?

>
> > +static void
> > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > +                                bool enabled)
> > +{
> > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > +    int v = enabled ? 1 : 0;
> > +    int ret;
> > +
> > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, 
> > sizeof(v));
> > +    if (ret >= 0) {
> > +        sioc->zerocopy_enabled = true;
> > +    }
> > +}
>
> Surely we need to tell the caller wether this succeeed, otherwise
> the later sendmsg() is going to fail with EINVAL on older kernels
> where MSG_ZEROCOPY is not supported.

Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
should have been
something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
way it would
reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
otherwise.

or something like this, if we want it to stick with zerocopy if
setting it off fails.
if (ret >= 0) {
    sioc->zerocopy_enabled = enabled;
}

>
>
> > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > index 4ce890a538..bf44b0f7b0 100644
> > --- a/io/channel-tls.c
> > +++ b/io/channel-tls.c
> > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> >      qio_channel_set_delay(tioc->master, enabled);
> >  }
> >
> > +
> > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > +                                         bool enabled)
> > +{
> > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > +
> > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > +}
>
> This is going to be unsafe because gnutls will discard/reuse the
> buffer for the ciphertext after every write(). I can't see a
> way to safely enable MSG_ZEROCOPY when TLS is used.

Yeah, that makes sense.
It would make more sense to implement KTLS, as IIRC it kind of does
'zerocopy', since it saves the encrypted data directly in kernel buffer.

We could implement KTLS as io_async_writev() for Channel_TLS, and change this
flag to async_enabled. If KTLS is not available, it would fall back to
using gnuTLS
on io_writev, just like it would happen in zerocopy.

>
>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Daniel P . Berrangé, 2021/09/01
- Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Peter Xu, 2021/09/01
  - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Daniel P . Berrangé, 2021/09/01
- Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Leonardo Bras Soares Passos, 2021/09/02
- Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Leonardo Bras Soares Passos <=
  - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Daniel P . Berrangé, 2021/09/02
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Leonardo Bras Soares Passos, 2021/09/02
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Daniel P . Berrangé, 2021/09/02
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Leonardo Bras Soares Passos, 2021/09/02
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Daniel P . Berrangé, 2021/09/02
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Dr. David Alan Gilbert, 2021/09/07
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Peter Xu, 2021/09/07
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Dr. David Alan Gilbert, 2021/09/08
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Peter Xu, 2021/09/08
    - Re: [PATCH v1 2/3] io: Add zerocopy and errqueue, Dr. David Alan Gilbert, 2021/09/09

Prev by Date: Re: [PATCH v6 10/10] ACPI ERST: step 6 of bios-tables-test
Next by Date: How to comminicate between two QOM model?
Previous by thread: Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
Next by thread: Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
Index(es):
- Date
- Thread