[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v6 10/13] migration: Respect postcopy request order in preemp
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [PATCH v6 10/13] migration: Respect postcopy request order in preemption mode |
Date: |
Mon, 23 May 2022 11:56:14 +0100 |
User-agent: |
Mutt/2.2.1 (2022-02-19) |
* Peter Xu (peterx@redhat.com) wrote:
> With preemption mode on, when we see a postcopy request that was requesting
> for exactly the page that we have preempted before (so we've partially sent
> the page already via PRECOPY channel and it got preempted by another
> postcopy request), currently we drop the request so that after all the
> other postcopy requests are serviced then we'll go back to precopy stream
> and start to handle that.
>
> We dropped the request because we can't send it via postcopy channel since
> the precopy channel already contains partial of the data, and we can only
> send a huge page via one channel as a whole. We can't split a huge page
> into two channels.
>
> That's a very corner case and that works, but there's a change on the order
> of postcopy requests that we handle since we're postponing this (unlucky)
> postcopy request to be later than the other queued postcopy requests. The
> problem is there's a possibility that when the guest was very busy, the
> postcopy queue can be always non-empty, it means this dropped request will
> never be handled until the end of postcopy migration. So, there's a chance
> that there's one dest QEMU vcpu thread waiting for a page fault for an
> extremely long time just because it's unluckily accessing the specific page
> that was preempted before.
>
> The worst case time it needs can be as long as the whole postcopy migration
> procedure. It's extremely unlikely to happen, but when it happens it's not
> good.
>
> The root cause of this problem is because we treat pss->postcopy_requested
> variable as with two meanings bound together, as the variable shows:
>
> 1. Whether this page request is urgent, and,
> 2. Which channel we should use for this page request.
>
> With the old code, when we set postcopy_requested it means either both (1)
> and (2) are true, or both (1) and (2) are false. We can never have (1)
> and (2) to have different values.
>
> However it doesn't necessarily need to be like that. It's very legal that
> there's one request that has (1) very high urgency, but (2) we'd like to
> use the precopy channel. Just like the corner case we were discussing
> above.
>
> To differenciate the two meanings better, introduce a new field called
> postcopy_target_channel, showing which channel we should use for this page
> request, so as to cover the old meaning (2) only. Then we leave the
> postcopy_requested variable to stand only for meaning (1), which is the
> urgency of this page request.
>
> With this change, we can easily boost priority of a preempted precopy page
> as long as we know that page is also requested as a postcopy page. So with
> the new approach in get_queued_page() instead of dropping that request, we
> send it right away with the precopy channel so we get back the ordering of
> the page faults just like how they're requested on dest.
>
> Alongside, I touched up find_dirty_block() to only set the postcopy fields
> in the pss section if we're going through a postcopy migration. That's a
> very light optimization and shouldn't affect much.
>
> Reported-by: manish.mishra@nutanix.com
> Signed-off-by: Peter Xu <peterx@redhat.com>
So I think this is OK; getting a bit complicated!
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
> migration/ram.c | 69 +++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 55 insertions(+), 14 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 9d76db8491..fdcd9984fa 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -441,8 +441,28 @@ struct PageSearchStatus {
> unsigned long page;
> /* Set once we wrap around */
> bool complete_round;
> - /* Whether current page is explicitly requested by postcopy */
> + /*
> + * [POSTCOPY-ONLY] Whether current page is explicitly requested by
> + * postcopy. When set, the request is "urgent" because the dest QEMU
> + * threads are waiting for us.
> + */
> bool postcopy_requested;
> + /*
> + * [POSTCOPY-ONLY] The target channel to use to send current page.
> + *
> + * Note: This may _not_ match with the value in postcopy_requested
> + * above. Let's imagine the case where the postcopy request is exactly
> + * the page that we're sending in progress during precopy. In this case
> + * we'll have postcopy_requested set to true but the target channel
> + * will be the precopy channel (so that we don't split brain on that
> + * specific page since the precopy channel already contains partial of
> + * that page data).
> + *
> + * Besides that specific use case, postcopy_target_channel should
> + * always be equal to postcopy_requested, because by default we send
> + * postcopy pages via postcopy preempt channel.
> + */
> + bool postcopy_target_channel;
> };
> typedef struct PageSearchStatus PageSearchStatus;
>
> @@ -496,6 +516,9 @@ static QemuCond decomp_done_cond;
> static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock
> *block,
> ram_addr_t offset, uint8_t *source_buf);
>
> +static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss,
> + bool postcopy_requested);
> +
> static void *do_data_compress(void *opaque)
> {
> CompressParam *param = opaque;
> @@ -1516,8 +1539,14 @@ retry:
> */
> static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool
> *again)
> {
> - /* This is not a postcopy requested page */
> - pss->postcopy_requested = false;
> + if (migration_in_postcopy()) {
> + /*
> + * This is not a postcopy requested page, mark it "not urgent", and
> + * use precopy channel to send it.
> + */
> + pss->postcopy_requested = false;
> + pss->postcopy_target_channel = RAM_CHANNEL_PRECOPY;
> + }
Do you need the 'if' here?
Dave
> pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
> if (pss->complete_round && pss->block == rs->last_seen_block &&
> @@ -2038,15 +2067,20 @@ static bool get_queued_page(RAMState *rs,
> PageSearchStatus *pss)
> RAMBlock *block;
> ram_addr_t offset;
>
> -again:
> block = unqueue_page(rs, &offset);
>
> if (block) {
> /* See comment above postcopy_preempted_contains() */
> if (postcopy_preempted_contains(rs, block, offset)) {
> trace_postcopy_preempt_hit(block->idstr, offset);
> - /* This request is dropped */
> - goto again;
> + /*
> + * If what we preempted previously was exactly what we're
> + * requesting right now, restore the preempted precopy
> + * immediately, boosting its priority as it's requested by
> + * postcopy.
> + */
> + postcopy_preempt_restore(rs, pss, true);
> + return true;
> }
> } else {
> /*
> @@ -2070,7 +2104,9 @@ again:
> * really rare.
> */
> pss->complete_round = false;
> + /* Mark it an urgent request, meanwhile using POSTCOPY channel */
> pss->postcopy_requested = true;
> + pss->postcopy_target_channel = RAM_CHANNEL_POSTCOPY;
> }
>
> return !!block;
> @@ -2324,7 +2360,8 @@ static bool postcopy_preempt_triggered(RAMState *rs)
> return rs->postcopy_preempt_state.preempted;
> }
>
> -static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss)
> +static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss,
> + bool postcopy_requested)
> {
> PostcopyPreemptState *state = &rs->postcopy_preempt_state;
>
> @@ -2332,8 +2369,15 @@ static void postcopy_preempt_restore(RAMState *rs,
> PageSearchStatus *pss)
>
> pss->block = state->ram_block;
> pss->page = state->ram_page;
> - /* This is not a postcopy request but restoring previous precopy */
> - pss->postcopy_requested = false;
> +
> + /* Whether this is a postcopy request? */
> + pss->postcopy_requested = postcopy_requested;
> + /*
> + * When restoring a preempted page, the old data resides in PRECOPY
> + * slow channel, even if postcopy_requested is set. So always use
> + * PRECOPY channel here.
> + */
> + pss->postcopy_target_channel = RAM_CHANNEL_PRECOPY;
>
> trace_postcopy_preempt_restored(pss->block->idstr, pss->page);
>
> @@ -2344,12 +2388,9 @@ static void postcopy_preempt_restore(RAMState *rs,
> PageSearchStatus *pss)
> static void postcopy_preempt_choose_channel(RAMState *rs, PageSearchStatus
> *pss)
> {
> MigrationState *s = migrate_get_current();
> - unsigned int channel;
> + unsigned int channel = pss->postcopy_target_channel;
> QEMUFile *next;
>
> - channel = pss->postcopy_requested ?
> - RAM_CHANNEL_POSTCOPY : RAM_CHANNEL_PRECOPY;
> -
> if (channel != rs->postcopy_channel) {
> if (channel == RAM_CHANNEL_PRECOPY) {
> next = s->to_dst_file;
> @@ -2505,7 +2546,7 @@ static int ram_find_and_save_block(RAMState *rs)
> * preempted precopy. Otherwise find the next dirty bit.
> */
> if (postcopy_preempt_triggered(rs)) {
> - postcopy_preempt_restore(rs, &pss);
> + postcopy_preempt_restore(rs, &pss, false);
> found = true;
> } else {
> /* priority queue empty, so just search for something dirty
> */
> --
> 2.32.0
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
- [PATCH v6 11/13] tests: Add postcopy tls migration test, (continued)
- [PATCH v6 11/13] tests: Add postcopy tls migration test, Peter Xu, 2022/05/17
- [PATCH v6 12/13] tests: Add postcopy tls recovery migration test, Peter Xu, 2022/05/17
- [PATCH v6 13/13] tests: Add postcopy preempt tests, Peter Xu, 2022/05/17
- [PATCH v6 06/13] migration: Add property x-postcopy-preempt-break-huge, Peter Xu, 2022/05/17
- [PATCH v6 04/13] migration: Postcopy recover with preempt enabled, Peter Xu, 2022/05/17
- [PATCH v6 10/13] migration: Respect postcopy request order in preemption mode, Peter Xu, 2022/05/17
- Re: [PATCH v6 10/13] migration: Respect postcopy request order in preemption mode,
Dr. David Alan Gilbert <=