qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Na


From: Dmitry Fomichev
Subject: RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
Date: Tue, 29 Sep 2020 15:43:21 +0000

> -----Original Message-----
> From: Qemu-block <qemu-block-
> bounces+dmitry.fomichev=wdc.com@nongnu.org> On Behalf Of Klaus
> Jensen
> Sent: Tuesday, September 29, 2020 6:47 AM
> To: Damien Le Moal <Damien.LeMoal@wdc.com>
> Cc: Fam Zheng <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> block@nongnu.org; Niklas Cassel <Niklas.Cassel@wdc.com>; Klaus Jensen
> <k.jensen@samsung.com>; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Keith Busch <kbusch@kernel.org>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Matias Bjorling
> <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 28 22:54, Damien Le Moal wrote:
> > On 2020/09/29 6:25, Keith Busch wrote:
> > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > >>> You are making it sound like the entire WDC series relies on this
> approach.
> > >>> Actually, the persistency is introduced in the second to last patch in 
> > >>> the
> > >>> series and it only adds a couple of lines of code in the i/o path to 
> > >>> mark
> > >>> zones dirty. This is possible because of using mmap() and I find the way
> > >>> it is done to be quite elegant, not ugly :)
> > >>>
> > >>
> > >> No, I understand that your implementation works fine without
> > >> persistance, but persistance is key. That is why my series adds it in
> > >> the first patch. Without persistence it is just a toy. And the QEMU
> > >> device is not just an "NVMe-version" of null_blk.
> > >
> > > I really think we should be a bit more cautious of commiting to an
> > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > looks too easy to have out-of-sync states, or just not being able to
> > > boot at all if a qemu versions have different on-disk formats.
> > >
> > > Is anyone really considering zone emulation for production level stuff
> > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > through that: you are just giving yourself all the downsides of a zoned
> > > block device and none of the benefits. AFAIK, this is provided as a
> > > development vehicle, closer to a "toy".
> > >
> > > I think we should consider trimming this down to a more minimal set that
> > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > bells & whistles and flush out the meta data's data marshalling scheme
> > > for persistence later.
> >
> > +1 on this. Removing the persistence also removes the debate on
> endianess. With
> > that out of the way, it should be straightforward to get agreement on a
> series
> > that can be merged quickly to get developers started with testing ZNS
> software
> > with QEMU. That is the most important goal here. 5.9 is around the corner,
> we
> > need something for people to get started with ZNS quickly.
> >
> 
> Wait. What. No. Stop!
> 
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.
> 
> Please, work with me on this instead of just removing such an essential
> feature. Since persistence seems to be the only thing we are really
> discussing, we should have plenty of time until the soft-freeze to come
> up with a proper solution on that.
> 
> I agree that my version had a format that was pretty ad-hoc and that
> won't fly - it needs magic and version capabilities like in Dmitry's
> series, which incidentially looks a lot like what we did in the
> OpenChannel implementation, so I agree with the strategy.

Are you insinuating that I somehow took stuff from OCSSD code and try
to claim priority this way? I am not at all that familiar with that code.
And I've already sent you the link to tcmu-runner code that served me
as an inspiration for implementing persistence in WDC patchset.
That code has been around for years, uses mmap, works great and has
nothing to do with you.

> 
> ZNS-wise, the only thing my implementation stores is the zone
> descriptors (in spec-native little-endian format) and the zone
> descriptor extensions. So there are no endian issues with those. The
> allocation tracking bitmap is always stored in little endian, but
> converted to big-endian if running on a big-endian host.
> 
> Let me just conjure something up.
> 
>     #define NVME_PSTATE_MAGIC ...
>     #define NVME_PSTATE_V1    1
> 
>     typedef struct NvmePstateHeader {
>         uint32_t magic;
>         uint32_t version;
> 
>         uint64_t blk_len;
> 
>         uint8_t  lbads;
>         uint8_t  iocs;
> 
>         uint8_t  rsvd18[3054];
> 
>         struct {
>             uint64_t zsze;
>             uint8_t  zdes;
>         } QEMU_PACKED zns;
> 
>         uint8_t  rsvd3089[1007];
>     } QEMU_PACKED NvmePstateHeader;
> 

Why conjure something that already exists in WDC patchset? And that part
has been published in the very first version of our patches, weeks before
your entire ZNS series was posted. Add an rsvd[] here and there and that code
can be as suitable to achieve the stated goals as what you have above.

> With such a header we have all we need. We can bail out if any
> parameters do not match and similar to nvme data structures it contains
> reserved areas for future use. I'll be posting a v2 with this. If this
> still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
> feature.
> 
> I can agree that we drop other optional features like zone excursions
> and the reset/finish recommended limit simulation, but PLEASE DO NOT
> remove persistence and upstream a half-baked version when we are so
> close and have time to get it right.

One important thing IMO is to reduce future need for metadata versioning.
This requires a really good effort to design and review the proper metadata
format that would become stable for some time. Think about portability.
To get out something "conjured up" now and then have to move to V2
metadata in the next release is even worse than no persistence at all.
So maybe is makes sense to go with Keith's suggestion.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]