[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Efficiency and flexibility of hash-tables

From: Mikael Djurfeldt
Subject: Re: Efficiency and flexibility of hash-tables
Date: Mon, 10 Feb 2003 17:52:32 +0100
User-agent: Gnus/5.090008 (Oort Gnus v0.08) Emacs/21.2 (i386-pc-linux-gnu)

Roland Orre <address@hidden> writes:

(To Roland: Ledsen att jag inte hunnit svara tidigare.)

> My idea was to modify the routines 
> scm_c_make_hash_table in hashtab.c,
> scm_make_weak_key_hash_table and scm_make_weak_value_hash_table in 
> weaks.c so they allocate three extra items in the mallocated block
> but hides this to the gc by setting the actual vector length to the
> ordinary one. These extra items will contain the actual size in
> number of items stored in the table and current upper and lower
> thresholds for resizing the table.
> Then modify the routines:
> scm_hash_fn_create_handle_x and scm_hash_fn_remove_x in hashtab.c
> to do resizing when needed, using a table of prime numbers for next
> and previous size.
> The advantage with this solution is that it's minimalistic, i.e. all 
> routines working with hash tables will behave exactly the same. Only
> the routines changing the number of items in the table will be affected.


> Comments?

Firstly: People have suggested that we should re-use other
implementations of rehashing (resizable, adaptive, ...) hash-tables.

However, the kind of hash-tables we're using in Guile is basic stuff
described in any computer science textbook.  When I asked for a patch,
I asked for the Guile implemention.

Also, hash-tables are used throughout Guile so we need them in the
core, not as a loadable library.

Now a comment on Roland's suggestion:

If we want to store extra information in the hash-table object, in the
way you suggest, that information will be hidden from the user unless
we provide special selector functions which can extract it.  But this
defeats the very purpose of representing a hash-table as a simple
Scheme type, and we could just as well provide a "real" hash-table
object, isn't that so?

Also, how would a hash-function, when receiving a hash-table
represented as a vector, know that it safely can access memory
locations after it's last index?

I also think this would obfuscate the source.

What about this even more minimalistic solution:

The idea of having an upper threshold comes from the idea of
maintaining the load factor of the hash table in a reasonable
interval.  But the purpose of the load factor is to prevent to large
clusters forming in the same hash bucket.  What about simply counting
the pairs in a bucket during insertion and rehashing when we reach a
threshold?  Then no extra data needs to be stored in the table.  (We
would need to globally serialize rehashing with an external mutex,

One disadvantage of this solution is that since we're putting a hard
limit on the maximal length of a bucket list, the number of buckets
will be sensitive to things causing collisions, such as bad hash
functions or bad luck.

Another disadvantage is that the tables wouldn't shrink.  Is this an


reply via email to

[Prev in Thread] Current Thread [Next in Thread]