Re: [Qemu-devel] [PATCH 1/2] docs: update ivshmem device spec

Hi,

Thank you for everyone's interest and work on this. Sorry I haven't been...better. I will offer my knowledge where it helps. And the server is GPL in case that was seen as an issue.

On Mon, Jun 23, 2014 at 8:18 AM, Claudio Fontana <address@hidden> wrote:

Hi,

we were reading through this quickly today, and these are some of the questions that
we think can came up when reading this. Answers to some of these questions we think
we have figured out, but I think it's important to put this information into the
documentation.

I will quote the file in its entirety, and insert some questions inline.

> Device Specification for Inter-VM shared memory device
> ------------------------------------------------------
>
> The Inter-VM shared memory device is designed to share a region of memory to
> userspace in multiple virtual guests.

What does "to userspace" mean in this context? The userspace of the host, or the userspace in the guest?

The memory is intended to be shared between userspaces in the guests. However, since the memory is POSIX shm region, it is visible on the host too.

What about "The Inter-VM shared memory device is designed to share a memory region (created on the host via the POSIX shared memory API) between multiple QEMU processes running different guests. In order for all guests to be able to pick up the shared memory area, it is modeled by QEMU as a PCI device exposing said memory to the guest as a PCI BAR."

Whether in those guests the memory region is used in kernel space or userspace, or there is even any meaning for those terms is guest-dependent I would think (I think of an OSv here, where the application and kernel execute at the same privilege level and in the same address space).

I'm not exactly clear what you're asking here. The region is visible to both the guest kernel and userspace (once mounted).

> The memory region does not belong to any
> guest, but is a POSIX memory object on the host.

Ok that's clear.
One thing I would ask is, but I don't know if it makes sense to mention here, is who creates this memory object on the host?

I understand in some cases it's the contributed server (what you provide in contrib/), in some cases it's the "user" of this device who has to write some server code for that, but is it true that also the qemu process itself can create this memory object on its own, without any external process needed? Is this the use case for host<->guest only?

(Answering based on my original server code) When using the server, the server creates it. Without the server, each qemu process will check if it exists and if it does, it will use it. If it does not exist, the qemu process will create it.

> Optionally, the device may
> support sending interrupts to other guests sharing the same memory region.

This opens a lot of questions here which are partly answered later (If I understand correctly, not only interrupts are involved, but a complete communication protocol involving registers in BAR0), but what about staying a bit general here, like
"Optionally, the device may also provide a communication mechanism between guests sharing the same memory region. More details about that in the section 'OPTIONAL ivshmem guest to guest communication protocol'.

Thinking out loud, I wonder if this communication mechanism should be part of this device in QEMU, or it should be provided at another layer..

>
>
> The Inter-VM PCI device
> -----------------------
>
> *BARs*
>
> The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support
> registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is
> used to map the shared memory object from the host. The size of BAR2 is
> specified when the guest is started and must be a power of 2 in size.

Are BAR0 and BAR1 optional? That's what I would think by reading the whole, but I'm still not sure.
Am I forced to map BAR0 and BAR1 anyway? I don't think so, but..

They do not need to be mapped. You do not need to map them if you don't want to use them.

If so, can we separate the explanation into the base shared memory feature, and a separate section which explains the OPTIONAL communication mechanism, and the OPTIONAL MSI-X BAR?

For example, say that I am a potential ivshmem user (which I am), and I am interested in the shared memory but I want to use my own communication mechanism and protocol between guests, can we make it so that I don't have to wonder whether some of the info I read applies or not?
The solution to that I think is to put all the OPTIONAL parts into separate sections.

>
> *Registers*

Ok, so this should I think go into one such OPTIONAL sections.

>
> The device currently supports 4 registers of 32-bits each. Registers
> are used for synchronization between guests sharing the same memory object when
> interrupts are supported (this requires using the shared memory server).

So use of BAR0 goes together with interrupts, and goes together with the shared memory server (is it the one contributed in contrib/?)

>
> The server assigns each VM an ID number and sends this ID number to the QEMU
> process when the guest starts.
>
> enum ivshmem_registers {
> IntrMask = 0,
> IntrStatus = 4,
> IVPosition = 8,
> Doorbell = 12
> };
>
> The first two registers are the interrupt mask and status registers. Mask and
> status are only used with pin-based interrupts. They are unused with MSI
> interrupts.
>
> Status Register: The status register is set to 1 when an interrupt occurs.
>
> Mask Register: The mask register is bitwise ANDed with the interrupt status
> and the result will raise an interrupt if it is non-zero. However, since 1 is
> the only value the status will be set to, it is only the first bit of the mask
> that has any effect. Therefore interrupts can be masked by setting the first
> bit to 0 and unmasked by setting the first bit to 1.
>
> IVPosition Register: The IVPosition register is read-only and reports the
> guest's ID number. The guest IDs are non-negative integers. When using the
> server, since the server is a separate process, the VM ID will only be set when
> the device is ready (shared memory is received from the server and accessible via
> the device). If the device is not ready, the IVPosition will return -1.
> Applications should ensure that they have a valid VM ID before accessing the
> shared memory.

So the guest ID number is 32bits, but actually the doorbell is 16-bit, can we be
more explicit about this? So does it follow that the maximum number of guests
is 65536?

Yes, for each server and its corresponding memory region.

>
> Doorbell Register: To interrupt another guest, a guest must write to the
> Doorbell register. The doorbell register is 32-bits, logically divided into
> two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low
> 16-bits are the interrupt vector to trigger. The semantics of the value
> written to the doorbell depends on whether the device is using MSI or a regular
> pin-based interrupt. In short, MSI uses vectors while regular interrupts set the
> status register.
>
> Regular Interrupts
>
> If regular interrupts are used (due to either a guest not supporting MSI or the
> user specifying not to use them on startup) then the value written to the lower
> 16-bits of the Doorbell register results is arbitrary and will trigger an
> interrupt in the destination guest.
>
> Message Signalled Interrupts
>
> A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
> written to the Doorbell register must be between 0 and the maximum number of
> vectors the guest supports. The lower 16 bits written to the doorbell is the
> MSI vector that will be raised in the destination guest. The number of MSI
> vectors is configurable but it is set when the VM is started.
>
> The important thing to remember with MSI is that it is only a signal, no status
> is set (since MSI interrupts are not shared). All information other than the
> interrupt itself should be communicated via the shared memory region. Devices
> supporting multiple MSI vectors can use different vectors to indicate different

> events have occurred. The semantics of interrupt vectors are left to the
> user's discretion.
>
>

Maybe an example of a full exchange would be useful to explain the use of these registers, making the protocol used for communication clear; or does this only provide mechanisms that can be used by someone else to implement a protocol?

> IVSHMEM host services
> ---------------------

>
> This part is optional (see *Usage in the Guest* section below)

Ok this section is optional, but its role is not that clear to me.

So are there exactly 3 ways this can be used:

1) shared memory only, PCI BAR2
2) full device including registers in BAR0 but no MSI
3) full device including registers in BAR0 and MSI support in BAR1
?

>
> To handle notifications between users of a ivshmem device, a ivshmem server has

> been added. This server is responsible for creating the shared memory and

> creating a set of eventfds for each users of the shared memory.

Ok this is the first time eventfds are mentioned, after we spoke about interrupts in the other section before..

The interrupts are transported between QEMU processes using eventfds. The interrupts are delivered into the guest using regular interrupts or MSI-X. The interrupts can be delivered to user-level using eventfds in UIO.

> It behaves as a

> proxy between the different ivshmem clients (QEMU): giving the shared memory fd

> to each client,

telling each client which /dev/name to shm_open?

No, it passes a file descriptor to the region using SCM_RIGHTS. When using the server, the qemu clients do not know the name of the shm region.

> allocating eventfds to new clients and broadcasting to all

> clients when a client disappears.

What about VM Ids, are they also decided and shared by the server?

Yes, the server hands out increasing VM Ids.

>
> Apart from the current ivshmem implementation in QEMU, a ivshmem client can be

> written for debug, for development purposes, or to implement notifications

> between host and guests.

>
>
> Usage in the Guest
> ------------------
>

> The guest should map BAR0 to access the registers (an array of 32-bit ints

> allows simple writing) and map BAR2 to access the shared memory region itself.

Ok, but can I avoid mapping BAR0 if I don't use the registers?

Yes

> The size of the shared memory region is specified when the guest (or shared

> memory server) is started. A guest may map the whole shared memory region or

> only part of it.

So what does it mean here, I can choose to start the optional server contributed in contrib/
with a shared memory size parameter determining the size of the actual shared memory region,
and then the guest has the option to map only part of that?

You do not need to map the whole region.

Or can also the guest (or better, the QEMU process running the guest) create the shared memory region by itself?
Which parameters control these behaviours?

When giving a shared memory region name "foo"

-device ivshmem,shm=foo,size=2048,use64=1

1) if the 'foo' memory object doesn't exist, the qemu process will create it

2) if 'foo' already exists it will use it

3) if the object exists but does not match the size specified, ivshmem will exit.

Btw I would expect there to be a separate section with all the QEMU command line configuration parameters and their effect on behavior of this device. Also for the contributed code in contrib/, especially for the server, we need documentation about the command line parameters, env variables, whatever can be configured and which effect they have on this.

>
> ivshmem provides an optional notification mechanism through eventfds handled by

> QEMU that will trigger interrupts in guests. This mechanism is enabled when

> using a ivshmem-server which must be started prior to VMs and which serves as a

> proxy for exchanging eventfds.

Here also, a simple description of such a sequence of exchanges would be welcome, I would not mind some ASCII art as well.

>
> It is your choice how to use the ivshmem device.

Good :)

> - the simple way, you don't need anything else than what is already in QEMU.

If the server becomes part of the QEMU package, then this sentence is a bit unclear right? This was probably written at the time the server was not contributed to QEMU, right?

> You can map the shared memory in guest, then use it in userland as you see fit

In userland.. ? Can I create the shared memory by just running a qemu process with some parameters? Does this mean I now share memory between guest and host? If I run multiple guest providing the same device name, can I make them use the same shared memory without the need of any server?

Yes, the server is only necessary for the interrupt behaviour.

> (memnic for example works this way http://dpdk.org/browse/memnic),

I'll check that out..

> - the more advanced way, basically, if you want an event mechanism between the

> VMs using your ivshmem device. In this case, then you will most likely want to

> write a kernel driver that will handle interrupts.

Ok.

Let me ask you this, what about virtio?
Can I take this shared memory implementation, and then run virtio on top of that, which already has primitives for communication?

I understand this would restricts me to 1 vs 1 communication, while with the optional server in contrib/ I would have any to any communication available.

But what about the 1 to 1 guest-to-guest communication, is in this case in theory possible to put virtio on top of ivshmem and use that to make the two guests communicate?

This is just a list of questions that we came up with, but anybody please weigh in with your additional questions, comments, feedback. Especially I would like to know if the idea to have a virtio guest to guest communication is possible and realistic, maybe with minimal extension of virtio, or if I am being insane.

There was originally a virtio-based version of ivshmem. You could see the discussion around that sometime in 2009. I think you could use virtio over ivshmem but the 1-to-1 case is quite limiting. Virtio is well optimized for what it does and so it was decided to keep the two separate.

HTH,

Cam

Thank you,

Claudio

From:	Cam Macdonell
Subject:	Re: [Qemu-devel] [PATCH 1/2] docs: update ivshmem device spec
Date:	Thu, 26 Jun 2014 08:12:53 -0600