qemu-stable
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] acpi: cpuhp: fix guest-visible maximum access size to the le


From: Laszlo Ersek
Subject: Re: [PATCH] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block
Date: Wed, 1 Mar 2023 09:03:51 +0100

Hello Christian,

On 3/1/23 08:17, Christian Ehrhardt wrote:
> On Thu, Jan 5, 2023 at 8:14 AM Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> On 1/4/23 13:35, Michael S. Tsirkin wrote:
>>> On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
>>>> The modern ACPI CPU hotplug interface was introduced in the following
>>>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
>>>>
>>>>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
>>>>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
>>>>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
>>>>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
>>>>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
>>>>                   interface
>>>>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
>>>>                   interface
>>>>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
>>>>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine type
>>>>
> ...
>>
>> The solution to the riddle
> 
> Hi,
> just to add to this nicely convoluted case an FYI to everyone involved
> back then,
> the fix seems to have caused a regression [1] in - as far as I've
> found - an edge case.
> 
> [1]: https://gitlab.com/qemu-project/qemu/-/issues/1520

After reading the gitlab case, here's my theory on it:

- Without the patch applied, the CPU hotplug register block in QEMU is
broken. Effectively, it has *always* been broken; to put it differently,
you have most likely *never* seen a QEMU in which the CPU hotplug
register block was not broken. The reason is that the only QEMU release
without the breakage (as far as a guest could see it!) was v5.0.0, but
it got exposed to the guest as early as v5.1.0 (IOW, in the 5.* series,
the first stable release already exposed the issue), and the symptom has
existed since (up to and including 7.2).

- With the register block broken, OVMF's multiprocessing is broken, and
the random chaos just happens to play out in a way that makes OVMF think
it's running on a uniprocessor system.

- With the register block *fixed* (commit dab30fbe applied), OVMF
actually boots up your VCPUs. With MT-TCG, this translates to as many
host-side VCPU threads running in your QEMU process as you have VCPUs.

- Furthermore, if your OVMF build includes the SMM driver stack, then
each UEFI variable update will require all VCPUs to enter SMM. All VCPUs
entering SMM is a "thundering herd" event, so it seriously spins up all
your host-side threads. (I assume the SMM-enabled binaries are what you
refer to as "signed OVMF cases" in the gitlab ticket.)

- If you overcommit the VCPUs (#vcpus > #pcpus), then your host-side
threads will be competing for PCPUs. On s390x, there is apparently some
bottleneck in QEMU's locking or in the host kernel or wherever else that
penalizes (#threads > #pcpus) heavily, while on other host arches, the
penalty is (apparently) not as severe.

So, the QEMU fix actually "only exposes" the high penalty of the MT-TCG
VCPU thread overcommit that appears characteristic of s390x hosts.
You've not seen this symptom before because, regardless of how many
VCPUs you've specified in the past, OVMF has never actually attempted to
bring those up, due to the hotplug regblock breakage "masking" the
actual VCPU counts (the present-at-boot VCPU count and the possible max
VCPU count).

Here's a test you could try: go back to QEMU v5.0.0 *precisely*, and try
to reproduce the symptom. I expect that it should reproduce.

Here's another test you can try: with latest QEMU, boot an x86 Linux
guest, but using SeaBIOS, not OVMF, on your s390x host. Then, in the
Linux guest, run as many busy loops (e.g. in the shell) as there are
VCPUs. Compare the behavior between #vcpus = #pcpus vs. #vcpus > #pcpus.
The idea here is of course to show that the impact of overcommitting x86
VCPUs on s390x is not specific to OVMF. Note that I don't *fully* expect
this test to confirm the expectation, because the guest workload will be
very different: in the Linux guest case, your VCPUs will not be
attempting to enter SMM *or* to access pflash, so the paths exercised in
QEMU will be very different. But, the test may still be worth a try.

Yet another test (or more like, information gathering): re-run the
problematic case, while printing the OVMF debug log (the x86 debug
console) to stdout, and visually determine at what part(s) the slowdown
hits. (I guess you can also feed the debug console log through some
timestamping utility like "logger".) I suspect it's going to be those
log sections that relate to SMM entry -- initial SMBASE relocation, and
then whenever UEFI variables are modified.

Preliminary advice: don't overcommit VCPUs in the setup at hand, or else
please increase the timeout. :)

In edk2, a way to mitigate said "thundering herd" problem *supposedly*
exists (using unicast SMIs rather than broadcast ones), but that
configuration of the core SMM components in edk2 had always been
extremely unstable when built into OVMF *and* running on QEMU/KVM. So we
opted for broadcast SMIs (supporting which actually required some QEMU
patches). Broadcast SMIs generate larger spikes in host load, but
regarding guest functionally, they are much more stable/robust.

Laszlo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]