[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 10/14] migration/ram: Introduce 'fixed-ram' migration stre
From: |
Claudio Fontana |
Subject: |
Re: [PATCH v3 10/14] migration/ram: Introduce 'fixed-ram' migration stream capability |
Date: |
Mon, 20 Mar 2023 12:08:25 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 |
Added Fabiano to the thread,
CLaudio
On 3/20/23 12:05, Claudio Fontana wrote:
> On 2/10/23 18:11, Daniel P. Berrangé wrote:
>> On Fri, Oct 28, 2022 at 01:39:10PM +0300, Nikolay Borisov wrote:
>>> Implement 'fixed-ram' feature. The core of the feature is to ensure that
>>> each ram page of the migration stream has a specific offset in the
>>> resulting migration stream. The reason why we'd want such behavior are
>>> two fold:
>>>
>>> - When doing a 'fixed-ram' migration the resulting file will have a
>>> bounded size, since pages which are dirtied multiple times will
>>> always go to a fixed location in the file, rather than constantly
>>> being added to a sequential stream. This eliminates cases where a vm
>>> with, say, 1g of ram can result in a migration file that's 10s of
>>> Gbs, provided that the workload constantly redirties memory.
>>>
>>> - It paves the way to implement DIO-enabled save/restore of the
>>> migration stream as the pages are ensured to be written at aligned
>>> offsets.
>>>
>>> The features requires changing the format. First, a bitmap is introduced
>>> which tracks which pages have been written (i.e are dirtied) during
>>> migration and subsequently it's being written in the resultin file,
>>> again at a fixed location for every ramblock. Zero pages are ignored as
>>> they'd be zero in the destination migration as well. With the changed
>>> format data would look like the following:
>>>
>>> |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
>>>
>>> * pc - refers to the page_size/mr->addr members, so newly added members
>>> begin from "bitmap_size".
>>>
>>> This layout is initialized during ram_save_setup so instead of having a
>>> sequential stream of pages that follow the ramblock headers the dirty
>>> pages for a ramblock follow its header. Since all pages have a fixed
>>> location RAM_SAVE_FLAG_EOS is no longer generated on every migration
>>> iteration but there is effectively a single RAM_SAVE_FLAG_EOS right at
>>> the end.
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>> ---
>>> include/exec/ramblock.h | 7 +++
>>> migration/migration.c | 51 +++++++++++++++++++++-
>>> migration/migration.h | 1 +
>>> migration/ram.c | 97 +++++++++++++++++++++++++++++++++--------
>>> migration/savevm.c | 1 +
>>> qapi/migration.json | 2 +-
>>> 6 files changed, 138 insertions(+), 21 deletions(-)
>>
>> This patch probably ought to extends the docs/devel/migration.rst
>> file. Specifically the text following the 'Stream structure'
>> heading.
>>
>>>
>>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>>> index 6cbedf9e0c9a..30216c1a41d3 100644
>>> --- a/include/exec/ramblock.h
>>> +++ b/include/exec/ramblock.h
>>> @@ -43,6 +43,13 @@ struct RAMBlock {
>>> size_t page_size;
>>> /* dirty bitmap used during migration */
>>> unsigned long *bmap;
>>> + /* shadow dirty bitmap used when migrating to a file */
>>> + unsigned long *shadow_bmap;
>>> + /* offset in the file pages belonging to this ramblock are saved, used
>>> + * only during migration to a file
>>> + */
>>> + off_t bitmap_offset;
>>> + uint64_t pages_offset;
>>> /* bitmap of already received pages in postcopy */
>>> unsigned long *receivedmap;
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 11ceea340702..c7383845a5b4 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -165,7 +165,8 @@
>>> INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
>>> MIGRATION_CAPABILITY_XBZRLE,
>>> MIGRATION_CAPABILITY_X_COLO,
>>> MIGRATION_CAPABILITY_VALIDATE_UUID,
>>> - MIGRATION_CAPABILITY_ZERO_COPY_SEND);
>>> + MIGRATION_CAPABILITY_ZERO_COPY_SEND,
>>> + MIGRATION_CAPABILITY_FIXED_RAM);
>>>
>>> /* When we add fault tolerance, we could have several
>>> migrations at once. For now we don't need to add
>>> @@ -1326,6 +1327,27 @@ static bool migrate_caps_check(bool *cap_list,
>>> }
>>> #endif
>>>
>>> + if (cap_list[MIGRATION_CAPABILITY_FIXED_RAM]) {
>>> + if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
>>> + error_setg(errp, "Directly mapped memory incompatible with
>>> multifd");
>>> + return false;
>>> + }
>>> +
>>> + if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
>>> + error_setg(errp, "Directly mapped memory incompatible with
>>> xbzrle");
>>> + return false;
>>> + }
>>> +
>>> + if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
>>> + error_setg(errp, "Directly mapped memory incompatible with
>>> compression");
>>> + return false;
>>> + }
>>> +
>>> + if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
>>> + error_setg(errp, "Directly mapped memory incompatible with
>>> postcopy ram");
>>> + return false;
>>> + }
>>> + }
>>>
>>> /* incoming side only */
>>> if (runstate_check(RUN_STATE_INMIGRATE) &&
>>> @@ -2630,6 +2652,11 @@ MultiFDCompression migrate_multifd_compression(void)
>>> return s->parameters.multifd_compression;
>>> }
>>>
>>> +int migrate_fixed_ram(void)
>>> +{
>>> + return
>>> migrate_get_current()->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
>>> +}
>>> +
>>> int migrate_multifd_zlib_level(void)
>>> {
>>> MigrationState *s;
>>> @@ -4190,6 +4217,21 @@ static void *bg_migration_thread(void *opaque)
>>> return NULL;
>>> }
>>>
>>> +static int
>>> +migrate_check_fixed_ram(MigrationState *s, Error **errp)
>>> +{
>>> + if (!s->enabled_capabilities[MIGRATION_CAPABILITY_FIXED_RAM])
>>> + return 0;
>>> +
>>> + if (!qemu_file_is_seekable(s->to_dst_file)) {
>>> + error_setg(errp, "Directly mapped memory requires a seekable
>>> transport");
>>> + return -1;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +
>>> void migrate_fd_connect(MigrationState *s, Error *error_in)
>>> {
>>> Error *local_err = NULL;
>>> @@ -4265,6 +4307,12 @@ void migrate_fd_connect(MigrationState *s, Error
>>> *error_in)
>>> return;
>>> }
>>>
>>> + if (migrate_check_fixed_ram(s, &local_err) < 0) {
>>> + migrate_fd_cleanup(s);
>>> + migrate_fd_error(s, local_err);
>>> + return;
>>> + }
>>> +
>>> if (resume) {
>>> /* Wakeup the main migration thread to do the recovery */
>>> migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
>>> @@ -4398,6 +4446,7 @@ static Property migration_properties[] = {
>>> DEFINE_PROP_STRING("tls-authz", MigrationState, parameters.tls_authz),
>>>
>>> /* Migration capabilities */
>>> + DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
>>> DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>>> DEFINE_PROP_MIG_CAP("x-rdma-pin-all",
>>> MIGRATION_CAPABILITY_RDMA_PIN_ALL),
>>> DEFINE_PROP_MIG_CAP("x-auto-converge",
>>> MIGRATION_CAPABILITY_AUTO_CONVERGE),
>>> diff --git a/migration/migration.h b/migration/migration.h
>>> index 96f27aba2210..9aab1b16f407 100644
>>> --- a/migration/migration.h
>>> +++ b/migration/migration.h
>>> @@ -410,6 +410,7 @@ bool migrate_zero_blocks(void);
>>> bool migrate_dirty_bitmaps(void);
>>> bool migrate_ignore_shared(void);
>>> bool migrate_validate_uuid(void);
>>> +int migrate_fixed_ram(void);
>>>
>>> bool migrate_auto_converge(void);
>>> bool migrate_use_multifd(void);
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index dc1de9ddbc68..4f5ddaff356b 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -1261,9 +1261,14 @@ static int save_zero_page_to_file(RAMState *rs,
>>> QEMUFile *file,
>>> int len = 0;
>>>
>>> if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>>> - len += save_page_header(rs, file, block, offset |
>>> RAM_SAVE_FLAG_ZERO);
>>> - qemu_put_byte(file, 0);
>>> - len += 1;
>>> + if (migrate_fixed_ram()) {
>>> + /* for zero pages we don't need to do anything */
>>> + len = 1;
>>> + } else {
>>> + len += save_page_header(rs, file, block, offset |
>>> RAM_SAVE_FLAG_ZERO);
>>> + qemu_put_byte(file, 0);
>>> + len += 1;
>>> + }
>>> ram_release_page(block->idstr, offset);
>>> }
>>> return len;
>>> @@ -1342,15 +1347,22 @@ static bool control_save_page(RAMState *rs,
>>> RAMBlock *block, ram_addr_t offset,
>>> static int save_normal_page(RAMState *rs, RAMBlock *block, ram_addr_t
>>> offset,
>>> uint8_t *buf, bool async)
>>> {
>>> - ram_transferred_add(save_page_header(rs, rs->f, block,
>>> - offset | RAM_SAVE_FLAG_PAGE));
>>> - if (async) {
>>> - qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
>>> - migrate_release_ram() &&
>>> - migration_in_postcopy());
>>> - } else {
>>> - qemu_put_buffer(rs->f, buf, TARGET_PAGE_SIZE);
>>> - }
>>> +
>>> + if (migrate_fixed_ram()) {
>>> + qemu_put_buffer_at(rs->f, buf, TARGET_PAGE_SIZE,
>>> + block->pages_offset + offset);
>>> + set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>>> + } else {
>>> + ram_transferred_add(save_page_header(rs, rs->f, block,
>>> + offset |
>>> RAM_SAVE_FLAG_PAGE));
>>> + if (async) {
>>> + qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
>>> + migrate_release_ram() &&
>>> + migration_in_postcopy());
>>> + } else {
>>> + qemu_put_buffer(rs->f, buf, TARGET_PAGE_SIZE);
>>> + }
>>> + }
>>> ram_transferred_add(TARGET_PAGE_SIZE);
>>> ram_counters.normal++;
>>> return 1;
>>> @@ -2683,6 +2695,8 @@ static void ram_save_cleanup(void *opaque)
>>> block->clear_bmap = NULL;
>>> g_free(block->bmap);
>>> block->bmap = NULL;
>>> + g_free(block->shadow_bmap);
>>> + block->shadow_bmap = NULL;
>>> }
>>>
>>> xbzrle_cleanup();
>>> @@ -3044,6 +3058,7 @@ static void ram_list_init_bitmaps(void)
>>> */
>>> block->bmap = bitmap_new(pages);
>>> bitmap_set(block->bmap, 0, pages);
>>> + block->shadow_bmap = bitmap_new(block->used_length >>
>>> TARGET_PAGE_BITS);
>>> block->clear_bmap_shift = shift;
>>> block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>>> }
>>> @@ -3226,12 +3241,34 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>> qemu_put_buffer(f, (uint8_t *)block->idstr,
>>> strlen(block->idstr));
>>> qemu_put_be64(f, block->used_length);
>>> if (migrate_postcopy_ram() && block->page_size !=
>>> - qemu_host_page_size) {
>>> + qemu_host_page_size) {
>>> qemu_put_be64(f, block->page_size);
>>> }
>>> if (migrate_ignore_shared()) {
>>> qemu_put_be64(f, block->mr->addr);
>>> }
>>> +
>>> + if (migrate_fixed_ram()) {
>>> + long num_pages = block->used_length >> TARGET_PAGE_BITS;
>>> + long bitmap_size = BITS_TO_LONGS(num_pages) *
>>> sizeof(unsigned long);
>>> +
>>> +
>>> + /* Needed for external programs (think
>>> analyze-migration.py) */
>>> + qemu_put_be32(f, bitmap_size);
>>> +
>>> + /*
>>> + * Make pages offset aligned to TARGET_PAGE_SIZE to enable
>>> + * DIO in the future. Also add 8 to account for the page
>>> offset
>>> + * itself
>>> + */
>>> + block->bitmap_offset = qemu_get_offset(f) + 8;
>>> + block->pages_offset = ROUND_UP(block->bitmap_offset +
>>> + bitmap_size,
>>> TARGET_PAGE_SIZE);
>>
>> I'm not sure that rounding to TARGET_PAGE_SIZE is sufficient.
>>
>> If we've built QEMU for a 32-bit i686 target, but are running on a 64-bit
>> kernel, is TARGET_PAGE_SIZE going to offer sufficient alignement to keep
>> the kernel happy.
>>
>> Also O_DIRECT has alignment constraints beyond simply the page size. The
>> page size alignment is most important for the buffers being passed
>> back and forth. The on-disk alignment constraint is actually defined by
>> the filesystem and storage it is on, and on-disk alignment is what
>> matters when we decide on pages_offset.
>
> For comparison, in the libvirt iohelper the alignment of DIRECT I/O-friendly
> transfers is 64K,
> as per libvirt commit:
>
> commit 12291656b135ed788e41dadbd2d15e613ddea9b5
> Author: Eric Blake <eblake@redhat.com>
> Date: Tue Jul 12 08:35:05 2011 -0600
>
> save: let iohelper work on O_DIRECT fds
>
> Required for a coming patch where iohelper will operate on O_DIRECT
> fds. There, the user-space memory must be aligned to file system
> boundaries (at least 512, but using page-aligned works better, and
> some file systems prefer 64k). Made tougher by the fact that
> VIR_ALLOC won't work on void *, but posix_memalign won't work on
> char * and isn't available everywhere.
>
> This patch makes some simplifying assumptions - namely, output
> to an O_DIRECT fd will only be attempted on an empty seekable
> file (hence, no need to worry about preserving existing data
> on a partial block, and ftruncate will work to undo the effects
> of having to round up the size of the last block written), and
> input from an O_DIRECT fd will only be attempted on a complete
> seekable file with the only possible short read at EOF.
>
> * configure.ac (AC_CHECK_FUNCS_ONCE): Check for posix_memalign.
> * src/util/iohelper.c (runIO): Use aligned memory, and handle
> quirks of O_DIRECT on last write.
>
>
>>
>> If we want to allow saved state to be copied across different filesystems/
>> hosts, we need to consider the worst case alignment that would exist on
>> any conceivable host deployment now or in the future.
>>
>> Given that RAM sizes are measured 1000's of MBs in any scenario where
>> we care about save/restore speed enough to use the fixed-ram feature,
>> we can afford to waste a little space.
>>
>> IOW, we could round pages_offset upto the nearest 1 MB boundary,
>> which ought to be well enough aligned to cope with any constraint
>> we might imagine ? Or possibly even align to 4 MB, which is a common
>> size for small-ish huge pages.
>>
>>
>>> + qemu_put_be64(f, block->pages_offset);
>>> +
>>> + /* Now prepare offset for next ramblock */
>>> + qemu_set_offset(f, block->pages_offset +
>>> block->used_length, SEEK_SET);
>>> + }
>>> }
>>> }
>>>
>>> @@ -3249,6 +3286,17 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>> return 0;
>>> }
>>>
>>> +static void ram_save_shadow_bmap(QEMUFile *f)
>>> +{
>>> + RAMBlock *block;
>>> +
>>> + RAMBLOCK_FOREACH_MIGRATABLE(block) {
>>> + long num_pages = block->used_length >> TARGET_PAGE_BITS;
>>> + long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned
>>> long);
>>> + qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
>>> block->bitmap_offset);
>>> + }
>>> +}
>>> +
>>> /**
>>> * ram_save_iterate: iterative stage for migration
>>> *
>>> @@ -3358,9 +3406,15 @@ static int ram_save_iterate(QEMUFile *f, void
>>> *opaque)
>>> return ret;
>>> }
>>>
>>> - qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>>> - qemu_fflush(f);
>>> - ram_transferred_add(8);
>>> + /*
>>> + * For fixed ram we don't want to pollute the migration stream with
>>> + * EOS flags.
>>> + */
>>> + if (!migrate_fixed_ram()) {
>>> + qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>>> + qemu_fflush(f);
>>> + ram_transferred_add(8);
>>> + }
>>>
>>> ret = qemu_file_get_error(f);
>>> }
>>> @@ -3405,7 +3459,10 @@ static int ram_save_complete(QEMUFile *f, void
>>> *opaque)
>>> pages = ram_find_and_save_block(rs);
>>> /* no more blocks to sent */
>>> if (pages == 0) {
>>> - break;
>>> + if (migrate_fixed_ram()) {
>>> + ram_save_shadow_bmap(f);
>>> + }
>>> + break;
>>> }
>>> if (pages < 0) {
>>> ret = pages;
>>> @@ -3428,8 +3485,10 @@ static int ram_save_complete(QEMUFile *f, void
>>> *opaque)
>>> return ret;
>>> }
>>>
>>> - qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>>> - qemu_fflush(f);
>>> + if (!migrate_fixed_ram()) {
>>> + qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>>> + qemu_fflush(f);
>>> + }
>>>
>>> return 0;
>>> }
>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>> index 44a222888306..847a8bdfb6ce 100644
>>> --- a/migration/savevm.c
>>> +++ b/migration/savevm.c
>>> @@ -240,6 +240,7 @@ static bool should_validate_capability(int capability)
>>> /* Validate only new capabilities to keep compatibility. */
>>> switch (capability) {
>>> case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
>>> + case MIGRATION_CAPABILITY_FIXED_RAM:
>>> return true;
>>> default:
>>> return false;
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 88ecf86ac876..6196617171e8 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -485,7 +485,7 @@
>>> ##
>>> { 'enum': 'MigrationCapability',
>>> 'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>>> - 'compress', 'events', 'postcopy-ram',
>>> + 'compress', 'events', 'postcopy-ram', 'fixed-ram',
>>> { 'name': 'x-colo', 'features': [ 'unstable' ] },
>>> 'release-ram',
>>> 'block', 'return-path', 'pause-before-switchover', 'multifd',
>>> --
>>> 2.34.1
>>>
>>
>> With regards,
>> Daniel
>
> Thanks,
>
> Claudio