[RFC] Adding a real HashTable implementation to gnulib

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RFC] Adding a real HashTable implementation to gnulib

From:	Tim Rühsen
Subject:	[RFC] Adding a real HashTable implementation to gnulib
Date:	Tue, 4 Dec 2018 20:40:41 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.1

Hi Bruno,

thank you so much for taking a look :-)

(sorry, unencrypted to the ML now)

On 04.12.18 03:32, Bruno Haible wrote:
> Hi Tim,
> 
>> We have 'hashmap' [1] in GNU Wget2.
> 
> Things I like about the implementation:
> 
>   - Collision resolution through linked list (a robust algorithm).
> 
>   - Reasonably generic API.
> 
>   - The user-definable destructor functions.
> 
> Things that could be improved:
> 
>   - The ability to set a growth policy > 0. This leads to quadratic
>     run time behaviour. I mean, you make such an effort to have O(1)
>     access on average, and then the fixed-size increment of the size
>     turns the insertion operation into an average-O(n) operation.

In Wget2 we use reasonable default sizes for the hashmap to make
resizing more or less a corner case.

Currently we have
- positive integer (N): resize by old value * N
- negative integer (N): resize by old value + (-N)

How can we do it better ?
- using a positive floating point value N ?
- for flexibility, allow to set a callback function to "calculate" the
new size ?

Or anything else ?

>   - Some functions take keysize and valuesize arguments, some don't.
>     I'm confused.

The "noalloc" functions just store the given pointers for key and value,
that means without allocation. The hashmap takes ownership of the
objects and has to release them when being removed/destroyed (or use a
NULL destructor when the values are static). This can be used to
generate a search index for an existing bunch of values (objects /
structures). Like a "unique index" for a field in a database table.
Example function: wget_hashmap_put_noalloc()

Then we have the functions that do (shallow) allocations/clones, like
wget_hashmap_put(). This is for using the hashmap as a container, the
original objects have to be (shallow) released outside of the container.

In Wget2 we use both types of functions.

Of course we could also have a function callback for cloning deeply,
maybe one for the keys and one for the values. Does that make sense ?

>   - NULL values are special. The documentation says "If there are NULL
>     values in the stringmap, you should use wget_stringmap_get_null() to
>     distinguish between 'not found' and 'found NULL value'." I would
>     prefer an API that does not require me to think about whether I have
>     null values in the map or not.

Agreed, it had some historic reasons that I currently don't remember.
I'll review that functions.

>   - To iterate through the table, you need to define an instance of the
>     wget_hashmap_browse_t function. I love functional programming, but
>     C is not the right language for that. If the inner part (the body
>     of the 'browse' function) needs to access outer variables, the
>     programmer needs to manually create a "context" struct and fill it -
>     things that the compiler would be doing in other programming languages.
>     Some people say that the solution to this problem are the nested
>     functions supported by GCC, but
>       1. This is not portable; only GCC supports this.
>       2. On many CPUs (including x86_64), the use of nested functions
>          requires to disable on the entire process a security feature
>          (namely, stacks without execute permission).
>     I therefore prefer the concept of "iterator" objects that allow
>     the programmer to step through the hash table without contorting
>     their code and without compromising on security.

Nested functions are really nice, I used to use them a lot. But the
concept of needing executable stack memory is truly crap. As you say,
the callback functions needs a context variable. But you don't have to
use it, e.g. when you just need to set a variable for all objects in the
hashmap. In that case a NULL value is allowed.

Of if you want to print out all objects into a file, your 'context'
could just be a FILE * or a file descriptor.

We have several use cases in Wget2 for wget_hashmap_browse() which makes
code more readable than using an explicit iterator.

Still I agree that for a general purpose hashmap, there must be an
iterator type/object + functions. I will add such and report back.

Regards, Tim

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [RFC] Adding a real HashTable implementation to gnulib, Bruno Haible, 2018/12/02
- Re: [RFC] Adding a real HashTable implementation to gnulib, LRN, 2018/12/02
- Re: [RFC] Adding a real HashTable implementation to gnulib, Tim Rühsen, 2018/12/03
  - Re: [RFC] Adding a real HashTable implementation to gnulib, Bruno Haible, 2018/12/03
    - Re: [RFC] Adding a real HashTable implementation to gnulib, Tim Rühsen, 2018/12/04
    - [RFC] Adding a real HashTable implementation to gnulib, Tim Rühsen <=
    - Re: [RFC] Adding a real HashTable implementation to gnulib, Bruno Haible, 2018/12/04

Prev by Date: Re: [RFC] Adding a real HashTable implementation to gnulib
Next by Date: Re: [PATCH v2 2/2] maint.mk: Replace grep with $(GREP)
Previous by thread: Re: [RFC] Adding a real HashTable implementation to gnulib
Next by thread: Re: [RFC] Adding a real HashTable implementation to gnulib
Index(es):
- Date
- Thread