gnunet-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding for Robust Immutable Storage (ERIS)


From: Martin Schanzenbach
Subject: Re: Encoding for Robust Immutable Storage (ERIS)
Date: Fri, 11 Dec 2020 09:52:39 +0900
User-agent: Evolution 3.38.1 (3.38.1-1.fc33)

Hi pukkamustard,

I think this is very valuable and we should implement this in GNUnet.
Also, an LSD for it would be great as grothoff said especially since
you alredy started a technical specification.
From my brief glance over it, the spec is missing some crucial
information regarding wire formats in order to ensure implementations
are interoperable.

My question would be to you: Would you try and implement it in GNUnet
as well? Could you provide an RFC-style LSD document with test vectors
so somebody else may be able to pick it up?

I would be happy to assist with both but like grothoff, I currently
have other things on my plate.
We should definitely try to not have this get lost on the mailing list.

BR
Martin

On Mon, 2020-12-07 at 18:12 +0100, pukkamustard wrote:
> Hello Christian,
> Hello GNUnet,
> 
> Thanks again for the extremely valuable feedback on the initial 
> version
> of ERIS.
> 
> I'd like to request feedback on a second version of the encoding: 
> ERIS
> v0.2.0: http://purl.org/eris
> 
> The major change compared to the initial version from June: It is
> basically ECRS.
> 
> I have become convinced that the functionality offered by the
> "verification capability" - verify the integrity of all blocks 
> without
> being able to decode content - can be implemented with a 
> synchronization
> algorithm for blocks. Removing the verification capability from 
> the
> encoding itself simplifies the encoding and increases performance 
> of
> encoding process.
> 
> The differences to ECRS are (see also 
> http://purl.org/eris#_previous_work):
> 
> - Use Blake2b/ChaCha20 and allow a "convergence secret"
> - Block size: ERIS allows block size of either 1 KiB or 32 KiB. 
>   The
>   variety of use-cases (file sharing vs. robust storage of tiny 
>   pieces
>   of data) seem to make this necessary. For both use cases this 
>   seems
>   better than the 4 KiB compromise.
> - URN: A URN is defined independent of applications using the 
>   encoding.
> - No namespace mechanism: This can be implemented with things such 
>   as GNS.
> 
> Other reasons for not just referring to the ECRS paper:
> 
> - Concise specification of the encoding. E.g. the ECRS paper does 
>   not
>   define cryptographic primitives used or URN.
> - Include test vectors
> 
> The hope is that a wide variety of applications can use ERIS 
> encoded
> content over a variety of transport and storage layers. Some 
> third-party
> implementations (not by me) are already starting to pop up
> (http://purl.org/eris/#_implementations).
> 
> I'd be very happy for your insight, feedback and opinions on 
> whether
> ERIS might find a place in the GNUNet filesharing application.
> 
> Thanks!
> -pukkamustard
> 
> 
> Christian Grothoff <grothoff@gnunet.org> writes:
> 
> > On 7/26/20 7:28 PM, pukkamustard wrote:
> > > 
> > > Hello Christian,
> > > 
> > > Thank you for your comments!
> > > 
> > > > For my taste, the block size is much too small. I understand 
> > > > 4k can make
> > > > sense for page tables and SATA, but looking at benchmarks 4k 
> > > > is still
> > > > too small to maximize SATA throughput. I would also worry 
> > > > about 4k for a
> > > > request size in any database or network protocol. The 
> > > > overheads per
> > > > request are still too big for modern hardware.  You could 
> > > > easily go to
> > > > 8k, which could be justified with 9k jumbo frames for Ethernet 
> > > > and would
> > > > at least also utilitze all of the bits in your paths.  The 32k 
> > > > of ECRS
> > > > are close to the 64k which are reportedly the optimum for 
> > > > modern M.2
> > > > media. IIRC Torrents even use 256k.
> > > 
> > > I agree that increasing block size makes sense for improving 
> > > performance
> > > in storage and transport.
> > > 
> > > > The overhead from padding may be
> > > > large for very small files if you go beyond 4k, but you should 
> > > > also
> > > > think in terms of absolute overhead: even a 3100% overhead 
> > > > doesn't
> > > > change the fact that the absolute overhead is tiny for a 1k 
> > > > file.
> > > 
> > > The use-case I have in mind for ERIS is very small pieces of 
> > > data (not
> > > even small files). Examples include ActivityStreams objects or
> > > OpenStreetMaps nodes.
> > 
> > Ah, that's a different use case then file-sharing, so different
> > trade-offs certainly apply here.
> > 
> > > Apparently the average size of individual ActivityStreams 
> > > objects is
> > > less than 1kB (unfortunately I don't have the data to back this 
> > > up).
> > > 
> > > I agree that the overhead of 3100% for a single 1kB object is
> > > acceptable. But I would argue that an overhead of 3100% for 
> > > very many
> > > 1kB objects is not. The difference might be a 32 GB database 
> > > instead of
> > > a 1 GB database.
> > 
> > Sure, the only question is if it might not in this case make 
> > sense to
> > combine the tiny objects into larger ones, like merging all OSM 
> > nodes in
> > a region into one larger download. But of course, it again 
> > depends on
> > the use case you are shooting for.
> > 
> > > > Furthermore, you should consider a trick we use in GNUnet-FS, 
> > > > which is
> > > > that we share *directories*, and for small files, we simply 
> > > > _inline_ the
> > > > full file data in the meta data of the file that is stored 
> > > > with the
> > > > directory or search result. So you can basically avoid having 
> > > > to ever
> > > > download tiny files as separate entities, so for files <32k we 
> > > > have zero
> > > > overhead this way.
> > > 
> > > That makes a lot of sense.
> > > 
> > > But packing multiple objects into a single transport packet or 
> > > grouping
> > > for storage on disk/in database works for small block sizes as 
> > > well. The
> > > optimization just happens at a "different layer".
> > > 
> > > The key value I see in having small block sizes is that tiny 
> > > pieces of
> > > data can be individually referenced and used (securely).
> > 
> > Sure, if that's your only use case, 4k could make sense.
> > 
> > > > I'd be curious to see how much the two pass encoding costs in 
> > > > practice
> > > > -- it might be less expensive than ECRS if you are lucky 
> > > > (hashing one
> > > > big block being cheaper than many small hash operations), or 
> > > > much more
> > > > expensive if you are unlucky (have to actually read the data 
> > > > twice from
> > > > disk). I am not sure that it is worth it merely to reduce the 
> > > > number of
> > > > hashes/keys in the non-data blocks. Would be good to have some 
> > > > data on
> > > > this, for various file sizes and platforms (to judge IO/RAM 
> > > > caching
> > > > effects).  As I said, I can't tell for sure if the 2nd pass is 
> > > > virtually
> > > > free or quite expensive -- and that is an important detail. 
> > > > Especially
> > > > with a larger block size, the overhead of an extra key in the 
> > > > non-data
> > > > blocks could be quite acceptable.
> > > 
> > > I think the cost of the two-pass encoding in ERIS is quite 
> > > expensive.
> > > Considering that the hash of the individual blocks also needs 
> > > to be
> > > computed (as reference in parent nodes), I think ECRS will 
> > > always win
> > > performance wise.
> > > 
> > > Maybe the answer is not ECRS or ERIS but ECRS and ERIS. ECRS 
> > > for large
> > > pieces of data, where it makes more sense to have large block 
> > > size and
> > > single-pass encoding. And ERIS for (very many) small pieces of 
> > > data
> > > where a 3100% overhead is too much but the performance penalty 
> > > is
> > > acceptable and size of data is much smaller than memory.
> > > 
> > > There might be some heuristic that says: If data is larger than 
> > > 2MB use
> > > ECRS, else use ERIS and you get the verification capability.
> > > 
> > > If using ECRS, you can add the verification capability by 
> > > encoding a
> > > list of all the hash references to the ECRS block with ERIS. 
> > > The ERIS
> > > read capability of this list of ECRS block is enough to verify 
> > > the
> > > integrity of the original ECRS encoded content (without 
> > > revealing the
> > > content).
> > > 
> > > What do you think?
> > 
> > I don't know how important the verification capability is in 
> > practice,
> > or how much the block size trade-offs are relevant (vs. grouping 
> > tiny
> > objects into larger ones). If we can avoid proliferating 
> > encodings and
> > find one that fits all important use cases, that would be ideal. 
> > I would
> > not be _opposed_ to adopting ERIS in GNUnet (even considering 
> > the
> > possible increase in encoding cost), _except_ for the tiny block 
> > size
> > (which I know would be terrible for our use-case).
> > 
> > > > For 3.4 Namespaces, I would urge you to look at the GNU Name 
> > > > System
> > > > (GNS). My plan is to (eventually, when I have way too much 
> > > > time and
> > > > could actually re-do FS...) replace SBLOCKS and KBLOCKS of 
> > > > ECRS with
> > > > basically only GNS.
> > > 
> > > I have been looking into it. It does seem to be a perfect 
> > > application of
> > > GNS.
> > > 
> > > The crypto is way above my head and using readily available and 
> > > already
> > > implemented primitives would make implementation much easier 
> > > for me. But
> > > I understand the need for "non-standard" crypto and am 
> > > following the
> > > ongoing discussions.
> > 
> > Great. Feel free to chime in or ask questions. Right now, we're 
> > hoping
> > to find the time to update the draft based on the feedback 
> > already
> > received, but of course constructive feedback is always welcome.
> > 
> > Cheers!
> > 
> > Christian
> 
> 

Attachment: signature.asc
Description: This is a digitally signed message part


reply via email to

[Prev in Thread] Current Thread [Next in Thread]