qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] net: add initial support for AF_XDP network backend


From: Ilya Maximets
Subject: Re: [PATCH] net: add initial support for AF_XDP network backend
Date: Mon, 10 Jul 2023 12:56:09 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0

On 7/10/23 05:51, Jason Wang wrote:
> On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 7/7/23 03:43, Jason Wang wrote:
>>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>
>>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>
>>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>
>>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefanha@gmail.com> 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi <stefanha@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi 
>>>>>>>>>>> <stefanha@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasowang@redhat.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
>>>>>>>>>>>>> <stefanha@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasowang@redhat.com> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
>>>>>>>>>>>>>>> <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
>>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
>>>>>>>>>>>>>>>>> <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
>>>>>>>>>>>>>>>>>>> <jasowang@redhat.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
>>>>>>>>>>>>>>>>>>>> <i.maximets@ovn.org> wrote:
>>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on in 
>>>>>>>>>>>>>>>>>> terms of PPS.
>>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just 
>>>>>>>>>>>>>>>>>> rcu lock and
>>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet 
>>>>>>>>>>>>>>>>>> copy, some batching
>>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  And 
>>>>>>>>>>>>>>>>>> it shouldn't be
>>>>>>>>>>>>>>>>>> too hard to implement.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be 
>>>>>>>>>>>>>>>>>> improved by creating
>>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring 
>>>>>>>>>>>>>>>>>> allows.  Currently
>>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that 
>>>>>>>>>>>>>>>>>> doesn't allow to
>>>>>>>>>>>>>>>>>> scale well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" 
>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>> io_uring and AF_XDP:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
>>>>>>>>>>>>>>>>> 2) both use ring for communication
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, then 
>>>>>>>>>>>>>>>> we can
>>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, i.e. 
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be 
>>>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>>>> perform transmission for us.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the 
>>>>>>>>>>>>>>> main loop
>>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory 
>>>>>>>>>>>>>>> translation
>>>>>>>>>>>>>>> cost.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
>>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm 
>>>>>>>>>>>>>> working
>>>>>>>>>>>>>> on patches to re-enable it and will probably send them in July. 
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring operations 
>>>>>>>>>>>>>> so
>>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux 
>>>>>>>>>>>>>> hosts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from 
>>>>>>>>>>>>> guest to
>>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA which
>>>>>>>>>>>>> seems expensive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Vhost seems to be a shortcut for this.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
>>>>>>>>>>>>
>>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor 
>>>>>>>>>>>> monitoring)
>>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still 
>>>>>>>>>>>> needs to
>>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory and
>>>>>>>>>>>> umem.
>>>>>>>>>>>
>>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
>>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
>>>>>>>>>>> part seems to be very expensive according to my test in the past.
>>>>>>>>>>
>>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a 
>>>>>>>>>> QEMU
>>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
>>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in device
>>>>>>>>>> emulation.
>>>>>>>>>
>>>>>>>>> Just to make sure we're on the same page.
>>>>>>>>>
>>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
>>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
>>>>>>>>> using the Qemu memory core translations which need to take care about
>>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
>>>>>>>>> which only cares about ram so the translation could be very fast.
>>>>>>>>
>>>>>>>> What does using "vhost in io threads" mean?
>>>>>>>
>>>>>>> It means a vhost userspace dataplane that is implemented via io threads.
>>>>>>
>>>>>> AFAIK this does not exist today. QEMU's built-in devices that use
>>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
>>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
>>>>>> built-in devices implement VirtioDeviceClass callbacks directly and
>>>>>> use AioContext APIs to run in IOThreads.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> Do you have an idea for using vhost code for built-in devices? Maybe
>>>>>> it's fastest if you explain your idea and its advantages instead of me
>>>>>> guessing.
>>>>>
>>>>> It's something like I'd proposed in [1]:
>>>>>
>>>>> 1) a vhost that is implemented via IOThreads
>>>>> 2) memory translation is done via vhost memory table/IOTLB
>>>>>
>>>>> The advantages are:
>>>>>
>>>>> 1) No 3rd application like DPDK application
>>>>> 2) Attack surface were reduced
>>>>> 3) Better understanding/interactions with device model for things like
>>>>> RSS and IOMMU
>>>>>
>>>>> There could be some dis-advantages but it's not obvious to me :)
>>>>
>>>> Why is QEMU's native device emulation API not the natural choice for
>>>> writing built-in devices? I don't understand why the vhost interface
>>>> is desirable for built-in devices.
>>>
>>> Unless the memory helpers (like address translations) were optimized
>>> fully to satisfy this 10M+ PPS.
>>>
>>> Not sure if this is too hard, but last time I benchmark, perf told me
>>> most of the time spent in the translation.
>>>
>>> Using a vhost is a workaround since its memory model is much more
>>> simpler so it can skip lots of memory sections like I/O and ROM etc.
>>
>> So, we can have a thread running as part of QEMU process that implements
>> vhost functionality for a virtio-net device.  And this thread has an
>> optimized way to access memory.  What prevents current virtio-net emulation
>> code accessing memory in the same optimized way?
> 
> Current emulation using memory core accessors which needs to take care
> of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
> considered since day0 of vhost. You can do some experiment on this e.g
> just dropping packets after fetching it from the TX ring.

If I'm reading that right, virtio implementation is using address space
caching by utilizing a memory listener and pre-translated addresses of
interesting memory regions.  Then it's performing address_space_read_cached,
which is bypassing all the memory address translation logic on a cache hit.
That sounds pretty similar to how memory table is prepared for vhost.

> 
>> i.e. we likely don't
>> actually need to implement the whole vhost-virtio communication protocol
>> in order to have faster memory access from the device emulation code.
>> I mean, if vhost can access device memory faster, why device itself can't?
> 
> I'm not saying it can't but it would end up with something similar to
> vhost. And that's why I'm saying using vhost is a shortcut (at least
> for a POC).
> 
> Thanks
> 
>>
>> With that we could probably split the "datapath" part of the virtio-net
>> emulation into a separate thread driven by iothread loop.
>>
>> Then add batch API for communication with a network backend (af-xdp) to
>> avoid per-packet calls.
>>
>> These are 3 more or less independent tasks that should allow the similar
>> performance to a full fledged vhost control and dataplane implementation
>> inside QEMU.
>>
>> Or am I missing something? (Probably)
>>
>>>
>>> Thanks
>>>
>>>>
>>>>>
>>>>> It's something like linking SPDK/DPDK to Qemu.
>>>>
>>>> Sergio Lopez tried loading vhost-user devices as shared libraries that
>>>> run in the QEMU process. It worked as an experiment but wasn't pursued
>>>> further.
>>>>
>>>> I think that might make sense in specific cases where there is an
>>>> existing vhost-user codebase that needs to run as part of QEMU.
>>>>
>>>> In this case the AF_XDP code is new, so it's not a case of moving
>>>> existing code into QEMU.
>>>>
>>>>>
>>>>>>
>>>>>>>>>> Regarding pinning - I wonder if that's something that can be refined
>>>>>>>>>> in the kernel by adding an AF_XDP flag that enables on-demand pinning
>>>>>>>>>> of umem. That way only rx and tx buffers that are currently in use
>>>>>>>>>> will be pinned. The disadvantage is the runtime overhead to pin/unpin
>>>>>>>>>> pages. I'm not sure whether it's possible to implement this, I 
>>>>>>>>>> haven't
>>>>>>>>>> checked the kernel code.
>>>>>>>>>
>>>>>>>>> It requires the device to do page faults which is not commonly
>>>>>>>>> supported nowadays.
>>>>>>>>
>>>>>>>> I don't understand this comment. AF_XDP processes each rx/tx
>>>>>>>> descriptor. At that point it can getuserpages() or similar in order to
>>>>>>>> pin the page. When the memory is no longer needed, it can put those
>>>>>>>> pages. No fault mechanism is needed. What am I missing?
>>>>>>>
>>>>>>> Ok, I think I kind of get you, you mean doing pinning while processing
>>>>>>> rx/tx buffers? It's not easy since GUP itself is not very fast, it may
>>>>>>> hit PPS for sure.
>>>>>>
>>>>>> Yes. It's not as fast as permanently pinning rx/tx buffers, but it
>>>>>> supports unpinned guest RAM.
>>>>>
>>>>> Right, it's a balance between pin and PPS. PPS seems to be more
>>>>> important in this case.
>>>>>
>>>>>>
>>>>>> There are variations on this approach, like keeping a certain amount
>>>>>> of pages pinned after they have been used so the cost of
>>>>>> pinning/unpinning can be avoided when the same pages are reused in the
>>>>>> future, but I don't know how effective that is in practice.
>>>>>>
>>>>>> Is there a more efficient approach without relying on hardware page
>>>>>> fault support?
>>>>>
>>>>> I guess so, I see some slides that say device page fault is very slow.
>>>>>
>>>>>>
>>>>>> My understanding is that hardware page fault support is not yet
>>>>>> deployed. We'd be left with pinning guest RAM permanently or using a
>>>>>> runtime pinning/unpinning approach like I've described.
>>>>>
>>>>> Probably.
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>> Stefan
>>>>>>
>>>>>
>>>>
>>>
>>
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]