[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2] monitor/qmp: fix race on CHR_EVENT_CLOSED without OOB
From: |
Markus Armbruster |
Subject: |
Re: [PATCH v2] monitor/qmp: fix race on CHR_EVENT_CLOSED without OOB |
Date: |
Thu, 08 Apr 2021 16:10:31 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) |
Thomas Lamprecht <t.lamprecht@proxmox.com> writes:
> On 08.04.21 14:49, Markus Armbruster wrote:
>> Kevin Wolf <kwolf@redhat.com> writes:
>>> Am 08.04.2021 um 11:21 hat Markus Armbruster geschrieben:
>>>> Should this go into 6.0?
>>>
>>> This is something that the responsible maintainer needs to decide.
>>
>> Yes, and that's me. I'm soliciting opinions.
>>
>>> If it helps you with the decision, and if I understand correctly, it is
>>> a regression from 5.1, but was already broken in 5.2.
>>
>> It helps.
>>
>> Even more helpful would be a risk assessment: what's the risk of
>> applying this patch now vs. delaying it?
>
> Stefan is on vacation this week, but I can share some information, maybe it
> helps.
>
>>
>> If I understand Stefan correctly, Proxmox observed VM hangs. How
>> frequent are these hangs? Did they result in data corruption?
>
>
> They were not highly frequent, but frequent enough to get roughly a bit over a
> dozen of reports in our forum, which normally means something is off but its
> limited to certain HW, storage-tech used or load patterns.
>
> We had initially a hard time to reproduce this, but a user finally could send
> us a backtrace of a hanging VM and with that information we could pin it
> enough
> down and Stefan came up with a good reproducer (see v1 of this patch).
Excellent work, props!
> We didn't got any report of actual data corruption due to this, but the VM
> hangs completely, so a user killing it may produce that theoretical; but only
> for those program running in the guest that where not made power-loss safe
> anyway...
>
>>
>> How confident do we feel about the fix?
>>
>
> Cannot comment from a technical POV, but can share the feedback we got with
> it.
>
> Some context about reach:
> We have rolled the fix out to all repository stages which had already a build
> of
> 5.2, that has a reach of about 100k to 300k installations, albeit we only have
> some rough stats about the sites that accesses the repository daily, cannot
> really
> tell who actually updated to the new versions, but there are some quite
> update-happy
> people in the community, so with that in mind and my experience of the
> feedback
> loop of rolling out updates, I'd figure a lower bound one can assume without
> going
> out on a limb is ~25k.
>
> Positive feedback from users:
> We got some positive feedback from people which ran into this at least once
> per
> week about the issue being fixed with that. In total almost a dozen user
> reported
> improvements, a good chunk of those which reported the problem in the first
> place.
>
> Mixed feedback:
> We had one user which reported still getting QMP timeouts, but that their VMs
> did
> not hang anymore (could be high load or the like). Only one user reported
> that it
> did not help, still investigating there, they have quite high CPU pressure
> stats
> and it actually may also be another issue, cannot tell for sure yet though.
>
> Negative feedback:
> We had no new users reporting of new/worse problems in that direction, at
> least
> from what I'm aware off.
>
> Note, we do not use OOB currently, so above does not speak for the OOB case at
> all.
Thanks!