[lsd0004] 03/03: Try to disentangle Bloom filters

gnunet-svn
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lsd0004] 03/03: Try to disentangle Bloom filters

From:	gnunet
Subject:	[lsd0004] 03/03: Try to disentangle Bloom filters
Date:	Fri, 23 Dec 2022 12:36:24 +0100
This is an automated email from the git hooks/post-receive script.

martin-schanzenbach pushed a commit to branch master
in repository lsd0004.

commit 3f70faf531ebd101004f5b93e1c80ee1ec5bcdd1
Author: Martin Schanzenbach <schanzen@gnunet.org>
AuthorDate: Fri Dec 23 20:36:03 2022 +0900

    Try to disentangle Bloom filters
---
 draft-schanzen-r5n.xml | 194 +++++++++++++++++++++++++++++--------------------
 1 file changed, 115 insertions(+), 79 deletions(-)

diff --git a/draft-schanzen-r5n.xml b/draft-schanzen-r5n.xml
index 98c46f9..ab5478d 100644
--- a/draft-schanzen-r5n.xml
+++ b/draft-schanzen-r5n.xml
@@ -349,7 +349,7 @@ Connectivity | |Underlay|  |Underlay|
       </figure>
     </section>
     <section anchor="overlay" numbered="true" toc="default">
-      <name>Overlay operations</name>
+      <name>Overlay Operations</name>
       <t>
         An implementation of this specification commonly exposes the two 
overlay
         operations "GET" and "PUT".
@@ -358,9 +358,9 @@ Connectivity | |Underlay|  |Underlay|
         picture of the protocol.
       </t>
       <section>
-        <name>The GET procedure</name>
+        <name>GET operation</name>
         <t>
-          A basic GET procedure may be exposed as:
+          A basic GET operation interface may be exposed as:
         </t>
         <t>
           <tt>GET(Query-Key, Block-Type) -> Results as List</tt>
@@ -462,9 +462,9 @@ Connectivity | |Underlay|  |Underlay|
         </dl>
       </section>
       <section>
-        <name>The PUT procedure</name>
+        <name>PUT operation</name>
         <t>
-          A PUT procedure may be exposed as:
+          A PUT operation interface may be exposed as:
         </t>
         <t>
           <tt>PUT(Key, Block-Type, Block-Expiration, Block-Data)</tt>
@@ -773,47 +773,6 @@ bchar = *(ALPHA / DIGIT)
         </t>
       </section>
     </section>
-    <section anchor="bloom_filters" numbered="true" toc="default">
-      <name>Bloom Filters</name>
-      <t>
-       R<sup>5</sup>N uses Bloom filters in several places.  This section
-       gives some general background on Bloom filters and defines functions
-       on this data structure shared by the various use-cases in 
R<sup>5</sup>N.
-      </t>
-      <t>
-        A Bloom filter (BF) is a space-efficient probabilistic datastructure
-        to test if an element is part of a set of elements.
-        Elements are identified by an element ID.
-        Since a BF is a probabilistic datastructure, it is possible to have
-        false-positives: when asked if an element is in the set, the answer 
from
-        a BF is either "no" or "maybe".
-      </t>
-      <t>
-        Bloom filters are defined as a string of <tt>L</tt> bits always
-        initially empty, consisting only of zeroes.
-        There are two functions which can be invoked on the Bloom filter:
-        BF-SET(bf, e) and BF-TEST(bf, e) where "e" is an element which is to
-        be added to the Bloom filter or queried against the set.
-        The mapping function <tt>M</tt> is defined as follows:
-      </t>
-      <t>
-        The element <tt>e</tt> is hashed using SHA-512.
-        The resulting byte string is interpreted as a string of 16
-        32-bit integers in network byte order.
-      </t>
-      <t>
-        When adding an element to the Bloom filter <tt>bf</tt> using
-        <tt>BF-SET(bf,e)</tt>, each integer <tt>n</tt> of the mapping
-        <tt>M(e)</tt> is interpreted as a bit offset <tt>n mod L</tt> within
-        <tt>bf</tt> and set to 1.
-      </t>
-      <t>
-        When testing if an element may be in the Bloom filter <tt>bf</tt> using
-        <tt>BF-TEST(bf,e)</tt>, each bit offset <tt>n mod L</tt> within
-        <tt>bf</tt> <bcp14>MUST</bcp14> have been set to 1.
-        Otherwise, the element is not considered to be in the Bloom filter.
-      </t>
-    </section>
     <section anchor="routing" numbered="true" toc="default">
       <name>Routing</name>
       <t>
@@ -920,9 +879,19 @@ bchar = *(ALPHA / DIGIT)
           traversed peers when searching for the next hops in the routing 
table.
         </t>
         <t>
-          The peer Bloom filter is always a 128 byte field. The elements
-         hashed into the Bloom filter are the 32 byte peer IDs.  We note
-         that the peer Bloom filter may exclude peers due to false-postive
+          The peer Bloom filter (see <xref target="bloom_filters"/>) consists 
of
+          L=1024 buckets.
+          The mapping function <tt>M</tt> is defined as follows:
+        </t>
+        <t>
+          The element <tt>e</tt> is a 32 byte peer ID and is hashed using
+          SHA-512.
+          The resulting byte string is interpreted as a string of k=16
+          32-bit integers in network byte order which are used to set the 
bucket bits
+          accordingly.
+        </t>
+        <t>
+         We note that the peer Bloom filter may exclude peers due to 
false-postive
          matches.  This is acceptable as routing should nevertheless
          terminate (with high probability) in close vicinity of the key.
         </t>
@@ -2395,10 +2364,33 @@ gnunet+tcp://12.3.4.5/ \
           <dd>
          <t>
            The RESULT_FILTER for HELLO blocks is implemented using a
-      Bloom filter.
-    </t>
-     <figure anchor="hello_rf" title="The HELLO Block Result Filter.">
-        <artwork name="" type="" align="left" alt=""><![CDATA[
+            Bloom filter (see <xref target="bloom_filters"/>).
+            The size <tt>S = L/8</tt> of the Bloom filter in bytes depends on
+           the number of elements <tt>F</tt> known to be filtered at the
+           initiator. If <tt>F</tt> is zero, the size <tt>S</tt> is just 8 
(bytes).
+           Otherwise, <tt>S</tt> is set to the minimum of
+           2<sup>15</sup> and the lowest power
+           of 2 that is strictly larger than <tt>K*F/4</tt>
+           (in bytes).
+          </t>
+          <t>
+            The <tt>k</tt>-value for the HELLO_BF
+            Bloom filter is always 16.
+            <tt>k</tt> is never transmitted.
+            The elements used in the Bloom filter
+            consist of an XOR between the <tt>H_ADDRS</tt> field (as computed 
using
+            SHA-512 over the <tt>ADDRESSES</tt>) and the SHA-512
+            hash of the <tt>MUTATOR</tt> field from a given HELLO block.
+            The mapping function M(<tt>H_ADDRS XOR MUTATOR</tt>) is an identity
+            function and returns the 512-bit XOR result unmodified.
+            This resulting byte string is interpreted as k=16 32-bit
+            integers in network byte order and used to set or check the bucket 
bits
+            accordingly.
+            This function returns an empty Bloom filter of length <tt>S= 
L/8</tt> as
+            part of RF:
+          </t>
+          <figure anchor="hello_rf" title="The HELLO Block Result Filter.">
+            <artwork name="" type="" align="left" alt=""><![CDATA[
 0     8     16    24    32    40    48    56
 +-----+-----+-----+-----+-----+-----+-----+-----+
 |        MUTATOR        |  HELLO_BF             /
@@ -2407,7 +2399,6 @@ gnunet+tcp://12.3.4.5/ \
 +-----+-----+-----+-----+-----+-----+-----+-----+
 ]]></artwork>
           </figure>
-
           <t>where:</t>
           <dl>
             <dt>MUTATOR</dt>
@@ -2416,24 +2407,16 @@ gnunet+tcp://12.3.4.5/ \
             </dd>
             <dt>HELLO_BF</dt>
             <dd>
-              The <tt>K</tt>-value for the HELLO_BF Bloom filter
-                   is always 16.  The size <tt>S</tt> of the Bloom filter in 
bytes depends on
-                   the number of elements <tt>F</tt> known to be filtered at 
the
-                   initiator. If <tt>F</tt> is zero, the size <tt>S</tt> is 
just 8 (bytes).
-                   Otherwise, <tt>S</tt> is set to the minimum of
-                   2<sup>15</sup> and the lowest power
-                   of 2 that is strictly larger than <tt>K*F/4</tt>
-                   (in bytes).  The wire format of HELLO_BF is the resulting 
byte
-              array. In particular, <tt>K</tt> is never transmitted.
+              The Bloom filter byte array.
             </dd>
           </dl>
-<t>
-            R<sup>5</sup>N HELLO blocks use a <tt>MUTATOR</tt> value
-       to additionally "randomize" the computation of the bloom filter while 
remaining
-           deterministic across peers.  The 32-bit <tt>MUTATOR</tt>
-           value is set by the peer initiating the GET request, and changed
-           every time the GET request is repeated by the initiator. Peers
-           forwarding GET requests <bcp14>MUST</bcp14> not change the
+          <t>
+           The <tt>MUTATOR</tt> value is used
+            to additionally "randomize" the computation of the Bloom filter 
while
+            remaining deterministic across peers.
+            It is only ever set by the peer initiating the GET
+            request, and changed every time the GET request is repeated.
+            Peers forwarding GET requests <bcp14>MUST</bcp14> not change the
            mutator value included in the <tt>RESULT_FILTER</tt> as they might 
not
            be able to recalculate the result filter with a different 
<tt>MUTATOR</tt>
            value.
@@ -2457,12 +2440,9 @@ gnunet+tcp://12.3.4.5/ \
           </dd>
             <dt>FilterResult(Block, Key, RF, XQuery) -&gt; 
(FilterEvaluationResult, RF')</dt>
             <dd>
-              To filter results of HELLO blocks using the Bloom filter, the
-             <tt>H_ADDRS</tt> field (as computed using SHA-512 over
-             the <tt>ADDRESSES</tt>) and XORed with the SHA-512
-             hash of the <tt>MUTATOR</tt> (in network byte order).
-             The resulting value is then used when hashing into the
-             Bloom filter as described in <xref target="bloom_filters" />.
+             The <tt>H_ADDRS</tt> field is XORed with the SHA-512
+             hash of the <tt>MUTATOR</tt> field from the HELLO block and the 
resulting
+             value is checked against the Bloom filter in RF.
              Consequently, HELLOs with completely identical sets of
              addresses will be filtered and FILTER_DUPLICATE is returned.
              Any small variation in the set of addresses will cause the block
@@ -2833,8 +2813,64 @@ Type    | Name            | References | Description
         </front>
       </reference>
     </references>
-    <!-- Change Log
-      v00 2017-07-23  MS   Initial version
-    -->
+    <section anchor="bloom_filters" numbered="true" toc="default">
+      <name>Bloom filters in R<sup>5</sup>N</name>
+      <t>
+       R<sup>5</sup>N uses Bloom filters in several places.  This section
+       gives some general background on Bloom filters and defines functions
+       on this data structure shared by the various use-cases in 
R<sup>5</sup>N.
+      </t>
+      <t>
+        A Bloom filter (BF) is a space-efficient probabilistic datastructure
+        to test if an element is part of a set of elements.
+        Elements are identified by an element ID.
+        Since a BF is a probabilistic datastructure, it is possible to have
+        false-positives: when asked if an element is in the set, the answer 
from
+        a BF is either "no" or "maybe".
+      </t>
+      <t>
+        Bloom filters are defined as a string of <tt>L</tt> bits called 
"buckets".
+        The buckets are initially always empty, meaning that the bits are set 
to
+        zero.
+        There are two functions which can be invoked on the Bloom filter "bf":
+        BF-SET(bf, e) and BF-TEST(bf, e) where "e" is an element that is to
+        be added to the Bloom filter or queried against the set.
+      </t>
+      <t>
+        A mapping function M is used to map each ID of each element from the 
set to a
+        subset of k buckets.
+        In the original proposal by Bloom, M is non-injective and can thus map 
the same
+        element multiple times to the same bucket.
+        The type of the mapping function can thus be described by the following
+        mathematical notation:
+      </t>
+      <figure anchor="figure_bf_func" title="Bloom filter mapping function.">
+        <artwork><![CDATA[
+        ------------------------------------
+        # M: E->B^k
+        ------------------------------------
+        # L = Number of buckets
+        # B = 0,1,2,3,4,...L-1 (the buckets)
+        # k = Number of buckets per element
+        # E = Set of elements
+        ------------------------------------
+        Example: L=256, k=3
+        M('element-data') = {4,6,255}
+]]>
+        </artwork>
+      </figure>
+      <t>
+        When adding an element to the Bloom filter <tt>bf</tt> using
+        <tt>BF-SET(bf,e)</tt>, each integer <tt>n</tt> of the mapping
+        <tt>M(e)</tt> is interpreted as a bit offset <tt>n mod L</tt> within
+        <tt>bf</tt> and set to 1.
+      </t>
+      <t>
+        When testing if an element may be in the Bloom filter <tt>bf</tt> using
+        <tt>BF-TEST(bf,e)</tt>, each bit offset <tt>n mod L</tt> within
+        <tt>bf</tt> <bcp14>MUST</bcp14> have been set to 1.
+        Otherwise, the element is not considered to be in the Bloom filter.
+      </t>
+    </section>
   </back>
 </rfc>

-- 
To stop receiving notification emails like this one, please contact
gnunet@gnunet.org.
[Prev in Thread]
Current Thread
[Next in Thread]
[lsd0004] branch master updated (db89ead -> 3f70faf), gnunet, 2022/12/23
- [lsd0004] 02/03: More language, gnunet, 2022/12/23
- [lsd0004] 03/03: Try to disentangle Bloom filters, gnunet <=
- [lsd0004] 01/03: Language, gnunet, 2022/12/23
Prev by Date: [lsd0004] 02/03: More language
Next by Date: [lsd0004] 01/03: Language
Previous by thread: [lsd0004] 02/03: More language
Next by thread: [lsd0004] 01/03: Language
Index(es):
- Date
- Thread