Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo checkpoint

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo checkpoint

From:	Derek Su
Subject:	Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo checkpoint
Date:	Thu, 13 Aug 2020 18:27:32 +0800

On Fri, Jul 31, 2020 at 3:52 PM Lukas Straub <lukasstraub2@web.de> wrote:
>
> On Sun, 21 Jun 2020 10:10:03 +0800
> Derek Su <dereksu@qnap.com> wrote:
>
> > This series is to reduce the guest's downtime during colo checkpoint
> > by migrating dirty ram pages as many as possible before colo checkpoint.
> >
> > If the iteration count reaches COLO_RAM_MIGRATE_ITERATION_MAX or
> > ram pending size is lower than 'x-colo-migrate-ram-threshold',
> > stop the ram migration and do colo checkpoint.
> >
> > Test environment:
> > The both primary VM and secondary VM has 1GiB ram and 10GbE NIC
> > for FT traffic.
> > One fio buffer write job runs on the guest.
> > The result shows the total primary VM downtime is decreased by ~40%.
> >
> > Please help to review it and suggestions are welcomed.
> > Thanks.
>
> Hello Derek,
> Sorry for the late reply.
> I think this is not a good idea, because it unnecessarily introduces a delay 
> between checkpoint request and the checkpoint itself and thus impairs network 
> bound workloads due to increased network latency. Workloads that are 
> independent from network don't cause many checkpoints anyway, so it doesn't 
> help there either.
>

Hello, Lukas & Zhanghailiang

Thanks for your opinions.
I went through my patch, and I feel a little confused and would like
to dig into it more.

In this patch, colo_migrate_ram_before_checkpoint() is before
COLO_MESSAGE_CHECKPOINT_REQUEST,
so the SVM and PVM should not enter the pause state.

In the meanwhile, the packets to PVM/SVM can still be compared and
notify inconsistency if mismatched, right?
Is it possible to introduce extra network latency?

In my test (randwrite to disk by fio with direct=0),
the ping from another client to the PVM  using generic colo and colo
used this patch are below.
The network latency does not increase as my expectation.

generic colo
```
64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=28.109 ms
64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=16.747 ms
64 bytes from 192.168.80.18: icmp_seq=89 ttl=64 time=2388.779 ms
<----checkpoint start
64 bytes from 192.168.80.18: icmp_seq=90 ttl=64 time=1385.792 ms
64 bytes from 192.168.80.18: icmp_seq=91 ttl=64 time=384.896 ms
<----checkpoint end
64 bytes from 192.168.80.18: icmp_seq=92 ttl=64 time=3.895 ms
64 bytes from 192.168.80.18: icmp_seq=93 ttl=64 time=1.020 ms
64 bytes from 192.168.80.18: icmp_seq=94 ttl=64 time=0.865 ms
64 bytes from 192.168.80.18: icmp_seq=95 ttl=64 time=0.854 ms
64 bytes from 192.168.80.18: icmp_seq=96 ttl=64 time=28.359 ms
64 bytes from 192.168.80.18: icmp_seq=97 ttl=64 time=12.309 ms
64 bytes from 192.168.80.18: icmp_seq=98 ttl=64 time=0.870 ms
64 bytes from 192.168.80.18: icmp_seq=99 ttl=64 time=2371.733 ms
64 bytes from 192.168.80.18: icmp_seq=100 ttl=64 time=1371.440 ms
64 bytes from 192.168.80.18: icmp_seq=101 ttl=64 time=366.414 ms
64 bytes from 192.168.80.18: icmp_seq=102 ttl=64 time=0.818 ms
64 bytes from 192.168.80.18: icmp_seq=103 ttl=64 time=0.997 ms
```

colo used this patch
```
64 bytes from 192.168.80.18: icmp_seq=72 ttl=64 time=1.417 ms
64 bytes from 192.168.80.18: icmp_seq=73 ttl=64 time=0.931 ms
64 bytes from 192.168.80.18: icmp_seq=74 ttl=64 time=0.876 ms
64 bytes from 192.168.80.18: icmp_seq=75 ttl=64 time=1184.034 ms
<----checkpoint start
64 bytes from 192.168.80.18: icmp_seq=76 ttl=64 time=181.297 ms
<----checkpoint end
64 bytes from 192.168.80.18: icmp_seq=77 ttl=64 time=0.865 ms
64 bytes from 192.168.80.18: icmp_seq=78 ttl=64 time=0.858 ms
64 bytes from 192.168.80.18: icmp_seq=79 ttl=64 time=1.247 ms
64 bytes from 192.168.80.18: icmp_seq=80 ttl=64 time=0.946 ms
64 bytes from 192.168.80.18: icmp_seq=81 ttl=64 time=0.855 ms
64 bytes from 192.168.80.18: icmp_seq=82 ttl=64 time=0.868 ms
64 bytes from 192.168.80.18: icmp_seq=83 ttl=64 time=0.749 ms
64 bytes from 192.168.80.18: icmp_seq=84 ttl=64 time=2.154 ms
64 bytes from 192.168.80.18: icmp_seq=85 ttl=64 time=1499.186 ms
64 bytes from 192.168.80.18: icmp_seq=86 ttl=64 time=496.173 ms
64 bytes from 192.168.80.18: icmp_seq=87 ttl=64 time=0.854 ms
64 bytes from 192.168.80.18: icmp_seq=88 ttl=64 time=0.774 ms
```

Thank you.

Regards,
Derek

> Hailang did have a patch to migrate ram between checkpoints, which should 
> help all workloads, but it wasn't merged back then. I think you can pick it 
> up again, rebase and address David's and Eric's comments:
> https://lore.kernel.org/qemu-devel/20200217012049.22988-3-zhang.zhanghailiang@huawei.com/T/#u
>
> Hailang, are you ok with that?
>
> Regards,
> Lukas Straub

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo checkpoint, Derek Su <=
- RE: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo checkpoint, Zhanghailiang, 2020/08/15
  - Re: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo checkpoint, Derek Su, 2020/08/15

Prev by Date: Re: virtio-vsock requires 'disable-legacy=on' in QEMU 5.1
Next by Date: Re: [PULL 0/9] Tracing patches
Previous by thread: [PATCH] docs: clarify absence of set_features in vhost-user
Next by thread: RE: [PATCH v1 0/1] COLO: migrate dirty ram pages before colo checkpoint
Index(es):
- Date
- Thread