[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC 1/2] vhost-user: Add interface for virtio-fs migration
From: |
Stefan Hajnoczi |
Subject: |
Re: [RFC 1/2] vhost-user: Add interface for virtio-fs migration |
Date: |
Wed, 15 Mar 2023 09:58:44 -0400 |
On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> back-end-internal state. We plan to migrate the internal state simply
Luckily the interface does not need to be virtiofs-specific since it
only transfers opaque data. Any stateful device can use this for
migration. Please make it generic both at the vhost-user protocol
message level and at the QEMU vhost API level.
> as a binary blob after the streaming phase, so all we need is a way to
> transfer such a blob from and to the back-end. We do so by using a
> dedicated area of shared memory through which the blob is transferred in
> chunks.
Keeping the migration data transfer separate from the vhost-user UNIX
domain socket is a good idea since the amount of data could be large and
may congest the UNIX domain socket. The shared memory interface solves
this.
Where I get lost is why it needs to be shared memory instead of simply
an fd? On the source, the front-end could read the fd until EOF and
transfer the opaque data. On the destination, the front-end could write
to the fd and then close it. I think that would be simpler than the
shared memory interface and could potentially support zero-copy via
splice(2) (QEMU doesn't need to look at the data being transferred!).
Here is an outline of an fd-based interface:
- SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
transferring device state.
The @direction argument:
- SAVE: the back-end transfers an outgoing device state over the fd.
- LOAD: the back-end transfers an incoming device state over the fd.
The @phase argument:
- STOPPED: the device is stopped.
- PRE_COPY: reserved for future use.
- POST_COPY: reserved for future use.
The back-end transfers data over the fd according to @direction and
@phase upon receiving the SET_DEVICE_STATE_FD message.
There are loose ends like how the message interacts with the virtqueue
enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
sent, etc. I have ignored them for now.
What I wanted to mention about the fd-based interface is:
- It's just one message. The I/O activity happens via the fd and does
not involve GET_STATE/SET_STATE messages over the vhost-user domain
socket.
- Buffer management is up to the front-end and back-end implementations
and a bit simpler than the shared memory interface.
Did you choose the shared memory approach because it has certain
advantages?
>
> This patch adds the following vhost operations (and implements them for
> vhost-user):
>
> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
> to the back-end. This area will be used to transfer state via the
> other two operations.
> (After the transfer FS_SET_STATE_FD detaches the shared memory area
> again.)
>
> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
> internal state into the shared memory area.
>
> - FS_SET_STATE: The front-end puts a chunk of internal state into the
> shared memory area, and asks the back-end to fetch it.
>
> On the source side, the back-end is expected to serialize its internal
> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
> invoked the first time. On subsequent FS_GET_STATE calls, it memcpy()s
> parts of that serialized state into the shared memory area.
>
> On the destination side, the back-end is expected to collect the state
> blob over all FS_SET_STATE calls, and then deserialize and apply it once
> FS_SET_STATE_FD detaches the shared memory area.
What is the rationale for waiting to receive the entire incoming state
before parsing it rather than parsing it in a streaming fashion? Can
this be left as an implementation detail of the vhost-user back-end so
that there's freedom in choosing either approach?
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
> include/hw/virtio/vhost-backend.h | 9 ++
> include/hw/virtio/vhost.h | 68 +++++++++++++++
> hw/virtio/vhost-user.c | 138 ++++++++++++++++++++++++++++++
> hw/virtio/vhost.c | 29 +++++++
> 4 files changed, 244 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h
> b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..fa3bd19386 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev,
> void *opaque,
> typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
> typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
>
> +typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
> + uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
> + uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
> + size_t size);
> typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
> struct vhost_vring_file *file);
> typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
> @@ -138,6 +144,9 @@ typedef struct VhostOps {
> vhost_backend_init vhost_backend_init;
> vhost_backend_cleanup vhost_backend_cleanup;
> vhost_backend_memslots_limit vhost_backend_memslots_limit;
> + vhost_fs_get_state_op vhost_fs_get_state;
> + vhost_fs_set_state_op vhost_fs_set_state;
> + vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
> vhost_net_set_backend_op vhost_net_set_backend;
> vhost_net_set_mtu_op vhost_net_set_mtu;
> vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..b1ad9785dd 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
> struct vhost_inflight *inflight);
> int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
> struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
> + * back-end for transferring internal state for the purpose of
> + * migration. Calling this function again will have the back-end
> + * unregister (free) the previously shared memory area.
> + *
> + * @dev: The vhost device
> + * @memfd: File descriptor associated with the shared memory to share.
> + * If negative, no memory area is shared, only releasing the
> + * previously shared area, and announcing the end of transfer
> + * (which, on the destination side, should lead to the
> + * back-end deserializing and applying the received state).
> + * @size: Size of the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
> +
> +/**
> + * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
> + * a chunk of migration state into the shared memory area negotiated
> + * through vhost_fs_set_state_fd(). May only be used for migration,
> + * and only by the source side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * @dev: The vhost device
> + * @state_offset: Offset into the state blob of the first byte to be
> + * transferred
> + * @size: Number of bytes to transfer at most; must not exceed the
> + * size of the shared memory area
> + *
> + * On success, returns the number of bytes that remain in the full
> + * state blob from the beginning of this chunk (i.e. the full size of
> + * the blob, minus @state_offset). When transferring the final chunk,
> + * this may be less than @size. The shared memory will contain the
> + * requested data, starting at offset 0 into the SHM, and counting
> + * `MIN(@size, returned value)` bytes.
> + *
> + * On failure, returns -errno.
> + */
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> + uint64_t size);
> +
> +/**
> + * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
> + * a chunk of migration state from the shared memory area negotiated
> + * through vhost_fs_set_state_fd(). May only be used for migration,
> + * and only by the destination side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * The front-end (i.e. the caller) must transfer the whole state to
> + * the back-end, without holes.
> + *
> + * @vdev: the VirtIODevice structure
> + * @state_offset: Offset into the state blob of the first byte to be
> + * transferred
> + * @size: Length of the chunk to transfer; must not exceed the size of
> + * the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> + uint64_t size);
> #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..7fd1fb1ed3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
> VHOST_USER_REM_MEM_REG = 38,
> VHOST_USER_SET_STATUS = 39,
> VHOST_USER_GET_STATUS = 40,
> + VHOST_USER_FS_SET_STATE_FD = 41,
> + VHOST_USER_FS_GET_STATE = 42,
> + VHOST_USER_FS_SET_STATE = 43,
> VHOST_USER_MAX
> } VhostUserRequest;
>
> @@ -210,6 +213,15 @@ typedef struct {
> uint32_t size; /* the following payload size */
> } QEMU_PACKED VhostUserHeader;
>
> +/*
> + * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
> + * payload of VHOST_USER_FS_SET_STATE.
> + */
> +typedef struct VhostUserFsState {
> + uint64_t state_offset;
> + uint64_t size;
> +} VhostUserFsState;
> +
> typedef union {
> #define VHOST_USER_VRING_IDX_MASK (0xff)
> #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8)
> @@ -224,6 +236,7 @@ typedef union {
> VhostUserCryptoSession session;
> VhostUserVringArea area;
> VhostUserInflight inflight;
> + VhostUserFsState fs_state;
> } VhostUserPayload;
>
> typedef struct VhostUserMsg {
> @@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev
> *dev, uint16_t mtu)
> return 0;
> }
>
> +static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
> + size_t size)
> +{
> + int ret;
> + bool reply_supported = virtio_has_feature(dev->protocol_features,
> +
> VHOST_USER_PROTOCOL_F_REPLY_ACK);
> + VhostUserMsg msg = {
> + .hdr = {
> + .request = VHOST_USER_FS_SET_STATE_FD,
> + .flags = VHOST_USER_VERSION,
> + .size = sizeof(msg.payload.u64),
> + },
> + .payload.u64 = size,
> + };
> +
> + if (reply_supported) {
> + msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> + }
> +
> + if (memfd < 0) {
> + assert(size == 0);
> + ret = vhost_user_write(dev, &msg, NULL, 0);
> + } else {
> + ret = vhost_user_write(dev, &msg, &memfd, 1);
> + }
> + if (ret < 0) {
> + return ret;
> + }
> +
> + if (reply_supported) {
> + return process_message_reply(dev, &msg);
> + }
> +
> + return 0;
> +}
> +
> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
> + uint64_t state_offset,
> + size_t size)
> +{
> + int ret;
> + VhostUserMsg msg = {
> + .hdr = {
> + .request = VHOST_USER_FS_GET_STATE,
> + .flags = VHOST_USER_VERSION,
> + .size = sizeof(msg.payload.fs_state),
> + },
> + .payload.fs_state = {
> + .state_offset = state_offset,
> + .size = size,
> + },
> + };
> +
> + ret = vhost_user_write(dev, &msg, NULL, 0);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + ret = vhost_user_read(dev, &msg);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
> + error_report("Received unexpected message type: "
> + "Expected %d, received %d",
> + VHOST_USER_FS_GET_STATE, msg.hdr.request);
> + return -EPROTO;
> + }
> +
> + if (msg.hdr.size != sizeof(VhostUserFsState)) {
> + error_report("Received unexpected message length: "
> + "Expected %" PRIu32 ", received %zu",
> + msg.hdr.size, sizeof(VhostUserFsState));
> + return -EPROTO;
> + }
> +
> + if (msg.payload.fs_state.size > SSIZE_MAX) {
> + error_report("Remaining state size returned by back end is too high:
> "
> + "Expected up to %zd, reported %" PRIu64,
> + SSIZE_MAX, msg.payload.fs_state.size);
> + return -EPROTO;
> + }
> +
> + return msg.payload.fs_state.size;
> +}
> +
> +static int vhost_user_fs_set_state(struct vhost_dev *dev,
> + uint64_t state_offset,
> + size_t size)
> +{
> + int ret;
> + bool reply_supported = virtio_has_feature(dev->protocol_features,
> +
> VHOST_USER_PROTOCOL_F_REPLY_ACK);
> + VhostUserMsg msg = {
> + .hdr = {
> + .request = VHOST_USER_FS_SET_STATE,
> + .flags = VHOST_USER_VERSION,
> + .size = sizeof(msg.payload.fs_state),
> + },
> + .payload.fs_state = {
> + .state_offset = state_offset,
> + .size = size,
> + },
> + };
> +
> + if (reply_supported) {
> + msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> + }
> +
> + ret = vhost_user_write(dev, &msg, NULL, 0);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + if (reply_supported) {
> + return process_message_reply(dev, &msg);
> + }
> +
> + return 0;
> +}
> +
> static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
> struct vhost_iotlb_msg *imsg)
> {
> @@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
> .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
> .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
> .vhost_dev_start = vhost_user_dev_start,
> + .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
> + .vhost_fs_get_state = vhost_user_fs_get_state,
> + .vhost_fs_set_state = vhost_user_fs_set_state,
> };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..ef8252c90e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>
> return -ENOSYS;
> }
> +
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
> +{
> + if (dev->vhost_ops->vhost_fs_set_state_fd) {
> + return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
> + }
> +
> + return -ENOSYS;
> +}
> +
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> + uint64_t size)
> +{
> + if (dev->vhost_ops->vhost_fs_get_state) {
> + return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
> + }
> +
> + return -ENOSYS;
> +}
> +
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> + uint64_t size)
> +{
> + if (dev->vhost_ops->vhost_fs_set_state) {
> + return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
> + }
> +
> + return -ENOSYS;
> +}
> --
> 2.39.1
>
signature.asc
Description: PGP signature