qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Na


From: Klaus Jensen
Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
Date: Tue, 29 Sep 2020 12:46:33 +0200

On Sep 28 22:54, Damien Le Moal wrote:
> On 2020/09/29 6:25, Keith Busch wrote:
> > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> >> On Sep 28 02:33, Dmitry Fomichev wrote:
> >>> You are making it sound like the entire WDC series relies on this 
> >>> approach.
> >>> Actually, the persistency is introduced in the second to last patch in the
> >>> series and it only adds a couple of lines of code in the i/o path to mark
> >>> zones dirty. This is possible because of using mmap() and I find the way
> >>> it is done to be quite elegant, not ugly :)
> >>>
> >>
> >> No, I understand that your implementation works fine without
> >> persistance, but persistance is key. That is why my series adds it in
> >> the first patch. Without persistence it is just a toy. And the QEMU
> >> device is not just an "NVMe-version" of null_blk.
> > 
> > I really think we should be a bit more cautious of commiting to an
> > on-disk format for the persistent state. Both this and Klaus' persistent
> > state feels a bit ad-hoc, and with all the other knobs provided, it
> > looks too easy to have out-of-sync states, or just not being able to
> > boot at all if a qemu versions have different on-disk formats.
> > 
> > Is anyone really considering zone emulation for production level stuff
> > anyway? I can't imagine a real scenario where you'd want put yourself
> > through that: you are just giving yourself all the downsides of a zoned
> > block device and none of the benefits. AFAIK, this is provided as a
> > development vehicle, closer to a "toy".
> > 
> > I think we should consider trimming this down to a more minimal set that
> > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > bells & whistles and flush out the meta data's data marshalling scheme
> > for persistence later.
> 
> +1 on this. Removing the persistence also removes the debate on endianess. 
> With
> that out of the way, it should be straightforward to get agreement on a series
> that can be merged quickly to get developers started with testing ZNS software
> with QEMU. That is the most important goal here. 5.9 is around the corner, we
> need something for people to get started with ZNS quickly.
> 

Wait. What. No. Stop!

It is unmistakably clear that you are invalidating my arguments about
portability and endianness issues by suggesting that we just remove
persistent state and deal with it later, but persistence is the killer
feature that sets the QEMU emulated device apart from other emulation
options. It is not about using emulation in production (because yeah,
why would you?), but persistence is what makes it possible to develop
and test "zoned FTLs" or something that requires recovery at power up.
This is what allows testing of how your host software deals with opened
zones being transitioned to FULL on power up and the persistent tracking
of LBA allocation (in my series) can be used to properly test error
recovery if you lost state in the app.

Please, work with me on this instead of just removing such an essential
feature. Since persistence seems to be the only thing we are really
discussing, we should have plenty of time until the soft-freeze to come
up with a proper solution on that.

I agree that my version had a format that was pretty ad-hoc and that
won't fly - it needs magic and version capabilities like in Dmitry's
series, which incidentially looks a lot like what we did in the
OpenChannel implementation, so I agree with the strategy.

ZNS-wise, the only thing my implementation stores is the zone
descriptors (in spec-native little-endian format) and the zone
descriptor extensions. So there are no endian issues with those. The
allocation tracking bitmap is always stored in little endian, but
converted to big-endian if running on a big-endian host.

Let me just conjure something up.

    #define NVME_PSTATE_MAGIC ...
    #define NVME_PSTATE_V1    1

    typedef struct NvmePstateHeader {
        uint32_t magic;
        uint32_t version;

        uint64_t blk_len;

        uint8_t  lbads;
        uint8_t  iocs;

        uint8_t  rsvd18[3054];

        struct {
            uint64_t zsze;
            uint8_t  zdes;
        } QEMU_PACKED zns;

        uint8_t  rsvd3089[1007];
    } QEMU_PACKED NvmePstateHeader;

With such a header we have all we need. We can bail out if any
parameters do not match and similar to nvme data structures it contains
reserved areas for future use. I'll be posting a v2 with this. If this
still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
feature.

I can agree that we drop other optional features like zone excursions
and the reset/finish recommended limit simulation, but PLEASE DO NOT
remove persistence and upstream a half-baked version when we are so
close and have time to get it right.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]