qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] monitor/qmp: fix race on CHR_EVENT_CLOSED without OOB


From: Markus Armbruster
Subject: Re: [PATCH v2] monitor/qmp: fix race on CHR_EVENT_CLOSED without OOB
Date: Thu, 08 Apr 2021 16:10:31 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Thomas Lamprecht <t.lamprecht@proxmox.com> writes:

> On 08.04.21 14:49, Markus Armbruster wrote:
>> Kevin Wolf <kwolf@redhat.com> writes:
>>> Am 08.04.2021 um 11:21 hat Markus Armbruster geschrieben:
>>>> Should this go into 6.0?
>>>
>>> This is something that the responsible maintainer needs to decide.
>> 
>> Yes, and that's me.  I'm soliciting opinions.
>> 
>>> If it helps you with the decision, and if I understand correctly, it is
>>> a regression from 5.1, but was already broken in 5.2.
>> 
>> It helps.
>> 
>> Even more helpful would be a risk assessment: what's the risk of
>> applying this patch now vs. delaying it?
>
> Stefan is on vacation this week, but I can share some information, maybe it
> helps.
>
>> 
>> If I understand Stefan correctly, Proxmox observed VM hangs.  How
>> frequent are these hangs?  Did they result in data corruption?
>
>
> They were not highly frequent, but frequent enough to get roughly a bit over a
> dozen of reports in our forum, which normally means something is off but its
> limited to certain HW, storage-tech used or load patterns.
>
> We had initially a hard time to reproduce this, but a user finally could send
> us a backtrace of a hanging VM and with that information we could pin it 
> enough
> down and Stefan came up with a good reproducer (see v1 of this patch).

Excellent work, props!

> We didn't got any report of actual data corruption due to this, but the VM
> hangs completely, so a user killing it may produce that theoretical; but only
> for those program running in the guest that where not made power-loss safe
> anyway...
>
>> 
>> How confident do we feel about the fix?
>> 
>
> Cannot comment from a technical POV, but can share the feedback we got with 
> it.
>
> Some context about reach:
> We have rolled the fix out to all repository stages which had already a build 
> of
> 5.2, that has a reach of about 100k to 300k installations, albeit we only have
> some rough stats about the sites that accesses the repository daily, cannot 
> really
> tell who actually updated to the new versions, but there are some quite 
> update-happy
> people in the community, so with that in mind and my experience of the 
> feedback
> loop of rolling out updates, I'd figure a lower bound one can assume without 
> going
> out on a limb is ~25k.
>
> Positive feedback from users:
> We got some positive feedback from people which ran into this at least once 
> per
> week about the issue being fixed with that. In total almost a dozen user 
> reported
> improvements, a good chunk of those which reported the problem in the first 
> place.
>
> Mixed feedback:
> We had one user which reported still getting QMP timeouts, but that their VMs 
> did
> not hang anymore (could be high load or the like). Only one user reported 
> that it
> did not help, still investigating there, they have quite high CPU pressure 
> stats
> and it actually may also be another issue, cannot tell for sure yet though.
>
> Negative feedback:
> We had no new users reporting of new/worse problems in that direction, at 
> least
> from what I'm aware off.
>
> Note, we do not use OOB currently, so above does not speak for the OOB case at
> all.

Thanks!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]