Consider we want to debug virtio issue, only 2) or 3) is what we really cared>
So for kick you did (assume vhost is on):
virtio_device_event_kick()
virtio_queue_notify()
event_notifier_set()
It looks to me you're actaully testing if ioeventfd is correctly set by Qemu.
For call you did:
virtio_device_event_call()
event_notifier_set()
A test of irqfd is correctly set by Qemu. So all of those are not virtio
specific stuffs but you introduce virtio specific command to do debug non
virtio
functions.
In the case of what you mentioned for vring_need_event(), what we really
want is
to dump the virtqueue state from the guest. This might requries some work of
extending virtio spec (e.g to dump device status like indices, event, wrap
counters).
Suppose the issue is only randomly reproducible in production env, we should
always take 4) into consideration because we will not be able to know where is
the root cause at the very beginning of bug analysis.
So if it truns out to be an issue of irqchip, how will you do the debugging
then? I guess what's really helpful is a qmp command to dump irqchip
status/information.
Thank you very much for suggestion. That will be a different problem and we may
consider as future work.
This patchset is about to do introduce change/events to help narrow down where
may be the root case in order to facilitate diagnostic (especially for prod env
issue and when it is not easy to reproduce).
This was initially proposed for vhost only and I was going to export
ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
better implement this at QEMU.
The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell
(via
ioeventfd), while the backend vhost side sends responses back and calls
the IRQ
(via ioeventfd).
Unfortunately, sometimes there is issue with virtio/vhost so that
kick/call was
missed/ignored, or even never triggered. The example mentioned in the
patchset
cover letter is with virtio-net (I assume vhost=on), where a kick to ioventfd
was ignored, due to pci-bridge/hotplug issue.
So this is not a good example since it was a chipset bug. You need to use
other
tool to debug chipset code isn't it?
As this issue is reproducible only randomly, we will not be able to realize it
is a chipset bug at the very beginning.
While the link is about vhost-net, it is applicable to vhost-scsi as well.
Suppose DEBUG_UNASSIGNED is not enabled, the developer will need to investigate
all of blk-mq, scsi, virtio-scsi (ioeventfd), vhost-scsi (target), pci-bridge
and pci-hotplug, in order to identify the root case.
The "call/kick" interface is used to narrow down and verify the analysis,
especially when many kernel components are involved.
The hotplug is with a very small window but the IO hangs permanently. I did
test
that kicking the doorbell again will help recover the IO, so that I would be
able to conclude this was due to lost of kick.
The loss of irq/doorbell is painful especially in production environment
where
we are not able to attach to QEMU via gdb. While the patchset is only for
QEMU,
Xen PV driver used to experience loss of IRQ issue as well, e.g., linux
kernel
commit 4704fe4f03a5 ("xen/events: mask events when changing their VCPU
binding").
So looking at the git history we can see it has little possibility that the
missing is due to virtio/vhost itself. So the commit you mention here is not
good as well since it's not a netfront/netbackend bug.
I mentioned the xen issue just to explain that the loss of event/irq issue may
happen and is very painful. Another xen example is (equivalent to KVM VFIO):
https://urldefense.com/v3/__https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=56348df32bbc782e63b6e3fb978b80e015ae76e7__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yx10i7Hk$
Sorry, I can't figure out how is this related to VFIO or virtio. It should be
reproducible for devices without using eventfd?
Yes, although not involving eventfd, other drivers/virtualization may encounter
the loss of irq/kick as well. There is no relation between xen and vfio/virtio.
That's why a diagnostic interface is appreciated.
In my opinion, the 'diagnostic' is not only to collect data,
Usually, collecting the data is the first step :)
but also to
introduce event/change (e.g., inject IRQ) and then monitor/observe what will
happen to the stalled VM.
It might be helpful yes, but it's also very dangerous.
That's why I mentioned this idea helps for VFIO (not only VFIO, but actually
blk-mq+nvme+pci+kvm+vfio, assuming it is for nvme passthrough) as well.
So for the case of event call, what you did is:
satic void virtio_device_event_call(VirtQueue *vq, bool eventfd,
Error **errp)
{
#ifdef DEBUG_VIRTIO_EVENT
printf("The 'call' event is triggered for path=%s, queue=%d,
irqfd=%d.\n",
object_get_canonical_path(OBJECT(vq->vdev)),
vq->queue_index, eventfd);
#endif
if (eventfd) {
virtio_set_isr(vq->vdev, 0x1);
event_notifier_set(&vq->guest_notifier);
} else {
virtio_irq(vq);
}
}
This means, when eventfd is set, you bypasses the MSI mask which is very
dangerous to make it used in the case of production environment. And if you
check masking, it won't help a lot if the MSI is masked wrongly.
You are right.
Only the vhost-net is dangerous because it masks a vector by registering an
alternative masked_notifier to KVM, while virtio-blk/vhost-scsi/virtio-scsi
will
un-register the guest notifier.
This can help "narrow down" if the IO/networking hang is due to loss of
IRQ/doorbell issue (or VM MSI-x is erroneously masked), especially in
production
env. This can also be used as a workaround so that VM owner will not need to
reboot VM.
So having such extra workaround is pretty dangerous in production environemnt
where I think we need to be conservative which means we need to collect
information instead of generating artificial event.
And it doesn't help if the wrokaround can be triggered through management API.
I agree with this.
This depends on the administrator. This workaround should only be used in very
limited and special case.
In addition, the VFIO will benefit from it. We will be able to test if to
inject
IRQ on purpose helps when the driver (e.g., PCI, NVMe, Ethernet) developers
blame the hang is caused by loss of IRQ with KVM/VFIO.(It seems there is more
chance to loose IRQ during CPU hotplug or changing IRQ affinity).
I think we could not gain much for introducing an dedicated mechanism for
such a
corner case.
As replied by Dave for prior RFC, the QEMU already supports
hmp_ioport_write to
trigger an ioport write on purpose.
If that applies. I would rather have a hmp_mem_write then we can test both MSI
and doorbell. But again, they are very dangerous to be used in production
envronment.
This is just not convenient for production env administrator. We will need to
first obtain the virtio pci info (e.g., via "lspci -vv"), and then prepare for
the command after calculating the address of doorbell.
Something bad may happen if the developer did not calculate the address
correctly.
It won't be worse than hmp_ioport_write I think?
I always believe it is worth adding hmp_mem_write().
While it won't be worse than hmp_ioport_write(), in my opinion, it is not as
easy/convenient as to write to eventfd.
It should be much more easier for developer to just ask administrator to "call"
queue X for a specific virtio device.
We can have the commands like "info virtio" which can show all the MSI/doorbell
information for user to use. Or limit its use for virtio and vfio device only to
avoid unexpected result.
So far the method by this patchset is to introduce "DeviceEvent event" to
"struct DeviceClass".
Only virtio-pci-xxx and vfio (future work) will implement this interface.
The linux block layer also supports the below to kick the IO queue on
purpose.
echo "kick" > /sys/kernel/debug/block/sda/state
This might be fine for hardware device but not virtio. The device can
choose to
poll the virtqueue instead of depending of the doorbell to work. And for
networking subsystem, we don't have such stuffs, instead ethtool support to
dump
ring and vendor specific stuffs which could be used for dumping virtqueue
state
in this case.
This is just another example to help explain the philosophy behind the
"kick/call" idea: sometimes to trigger the event on purpose will help us narrow
down and verify our analysis of a kernel bug, especially a bug that is only
randomly reproducible in production environment.
I understand it is possibly not proper to introduce such interface to QEMU.
That's why I used to send out the RFC.
https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html__;!!GqivPVa7Brio!JY2OqmcXAmza_G2gR-dQwV2Oa0hGG_6trVkxSUMocoYi4A_VXwZbzVn_VA9yu-n97gA$
In my opinion, this interface is pretty useful when the diagnostic invokes many
kernel components, or when developers from different components are working on
the same bug, no matter whether the root cause is at virtio or not.
So for virtio, it's not hard to events without those interface. E.g for
networking you can generate some traffic and trace on where-ever you suspect
that could block the event (kick/call).
Suppose the vhost-net backend is TUN. Once virtio-net RX path is stuck and its
vring is full, the ring used by tun_net_xmit()-->ptr_ring_produce() will be full
as well. I do not have a way to generate traffic for RX path in such situation.
Right, but as discussed, we need interface to dump virtqueue state, then it
would be very easy to start with.
I still prefer hmp_mem_write()/read() which looks more generic, in the same
time, we can add more debug informaiton likes:
1) satistics like eventfd counters
2) device information, register layout, doorbell/MSI-X information
3) irqchip infromation
Would you mind help for below questions?
1. Regardless about kick/call or hmp_mem_write(), is it safe to add such
interfaces? I think it is safe because:
(1) This affects only specific VM (QEMU), not all/others.
(2) It is dangerous only when sysadmin triggers the events on purpose. If this
interface is dangerous, both "(qemu) mce 0 1 0xb200000000000000 0x0 0x0 0x0" (to
inject uncorrected error) and "echo c > /proc/sysrq-trigger" (to crash kernel)
will be dangerous as well.
(3) While this is implemented for only vhost-scsi-pci and vhost-vhost-pci, I do
not see issue for host kernel. It will be security bug if to read/write eventfd
from userspace crashes kernel space.
(4) We primarily use this interface when VM is running into issue (unless we use
it as workaround).
Besides the above, I think it's only "safe" if we clearly define the semanic of
this command. E.g:
1) Does it work at EventNotifier (eventfd) level or virtio/vfio level?