Hash Tables¶
Traditionally Scheme had no support for hash tables other than implementation-specific methods. There are several SRFIs to bridge that problem, notably SRFI-69, Basic Hash Tables.
Even SRFI-69 acknowledges that Scheme implementations almost certainly have some internal hash table functionality if only to support the notions of an environment (mapping variable names to values). Whilst SRFI-69 is not reliant on that – it uses vectors – you’re left with a sense of “to implement hash tables first implement hash tables.”
In the meanwhile I read around the subject of hash tables whereon I realised that I should have paid a bit more attention in school as open and closed addressing and coalesced hashing and any number of other concepts appeared before me.
Of course, dizzy with excitement about the prospects, I made a terrible mistake: I chose the wrong design. I picked on coalesced hashing as it seemed quite neat, which it is. What I didn’t do is read or internalise the simple sentence under Performance: deletion may be hard.
I have had no end of trouble with my implementation. Eventually I gave up and scrapped it in favour of the traditional bucket and chain mechanism.
Features¶
Dynamic¶
Hash tables are, of course, dynamic and will grow (and possibly shrink) over time.
Sentinel Values¶
We need some sentinel values:
#n
is the not-a-key sentinel valueIn other words you can’t use
#n
as a key in a hash.
Hashing and Equivalence¶
We should have (and, indeed, require) user-defined hashing and equivalence functions.
By default the hashing and equivalence functions are written in C, as you might expect. Internally, via some C macros, various hash tables are created with one of:
IDIO_HASH_EQP (size) IDIO_HASH_EQVP (size) IDIO_HASH_EQUALP (size)
which (should!) give the impression of setting the equivalence
function to (the C implementation of) eq?
, eqv?
or
equal?
and all of them using the default C hashing
function.
You can access those C functions from Idio by
passing the equivalent symbol as the equality function when you
create a hash table with make-hash [equiv-func [hash-func
[size]]]
. So make-hash 'eq? #n 10
is the Idio
version of IDIO_HASH_EQP (10)
in C – the #n
for
the hashing function means use the C default.
As a slight distraction, make-hash eq? #n 10
(ie. the function
value for eq?
is substituted as opposed to the symbol 'eq?
)
will ultimately perform the same thing except will require the
relatively costly invocation of the Idio primitive eq?
from C
However, for both the hashing and equivalence functions you can pass your own noting that:
the hashing function takes a single argument and should return a positive integer
the equivalence function should take two arguments and return
#f
or some other value
String Keys¶
Or, rather, C string keys. This is handy for symbols which are created from C strings but otherwise are not Idio values. We need a special flag to say “don’t treat this as an Idio value.”
Weak Keys¶
This is an important concept to avoid circular loops. It is also buggy, I’m sure.
The problem is for the garbage collector. Under normal circumstances, the GC iterates over the known root values and ticks off everything it sees. Anything left over is no longer referenced by anything in the program and can be “collected.”
However, many subsystems like to keep track of objects they have seen. For example, there is a properties hash table associating properties with values – the signature string associated with a function value, for example.
To prevent the properties table itself being deleted we need to add it to the list of known roots and with the key into the properties table being the function value that means that the function value can never be collected – something is always referring to it.
Hence the concept of weak keys which are an instruction to the GC to not flag the key as having been “seen” from the root. You can (read: must) flag the value associated with the key as being seen (otherwise we’d garbage collect the properties themselves!).
Subsequently, if the function value is still ultimately referred to from a root value then all is well and if the function value is no longer referred to from a root value then we don’t keep it around artificially.
This leaves a problem, however, in that the (now garbage collected) key is still in the properties hash. We need some special trickery in the GC to remove it from the properties hash at the time of collection (and, in a later cycle, collect the values previously associated with the key).
Implementation¶
How big a hash table?
We won’t be having negative indexes into a hash table so the type is
going to be a size_t
. We’ll call it an “idio hash index type” or
idio_hi_t
.
typedef size_t idio_hi_t;
typedef struct idio_hash_entry_s {
struct idio_hash_entry_s *next;
struct idio_s *key;
struct idio_s *value;
} idio_hash_entry_t;
#define IDIO_HASH_HE_NEXT(HE) ((HE)->next)
#define IDIO_HASH_HE_KEY(HE) ((HE)->key)
#define IDIO_HASH_HE_VALUE(HE) ((HE)->value)
#define IDIO_HASH_FLAG_NONE 0
#define IDIO_HASH_FLAG_STRING_KEYS (1<<0)
#define IDIO_HASH_FLAG_WEAK_KEYS (1<<1)
typedef struct idio_hash_s {
struct idio_s *grey;
idio_hi_t size;
idio_hi_t mask; /* bitmask for easy modulo arithmetic */
idio_hi_t count; /* (key) count */
int (*comp_C) (void *k1, void *k2); /* C equivalence function */
idio_hi_t (*hash_C) (struct idio_s *h, void *k); /* C hashing function */
struct idio_s *comp; /* user-supplied comparator */
struct idio_s *hash; /* user-supplied hashing function */
idio_hash_entry_t* *ha; /* a C array */
} idio_hash_t;
#define IDIO_HASH_GREY(H) ((H)->u.hash->grey)
#define IDIO_HASH_SIZE(H) ((H)->u.hash->size)
#define IDIO_HASH_MASK(H) ((H)->u.hash->mask)
#define IDIO_HASH_COUNT(H) ((H)->u.hash->count)
#define IDIO_HASH_COMP_C(H) ((H)->u.hash->comp_C)
#define IDIO_HASH_HASH_C(H) ((H)->u.hash->hash_C)
#define IDIO_HASH_COMP(H) ((H)->u.hash->comp)
#define IDIO_HASH_HASH(H) ((H)->u.hash->hash)
#define IDIO_HASH_HA(H,i) ((H)->u.hash->ha[i])
#define IDIO_HASH_FLAGS(H) ((H)->tflags)
Here we have:
size
is the nominal hash size, 2nmask
I wondered about the cost of doing modulo arithmetic and decided that I would set the nominal hash table size to be 2n for some n. The highest nominal index is then 2n-1. This means that the modulo arithmetic is simply the hashed value masked by a series of low-order ones: 0xf, 0x1f, 0x3f, 0x7f, 0xff, …
count
is the number of keys in the hash tablecomp_C
is the C equivalence function pointerIt must be
NULL
to use an Idio equivalence functionhash_C
is the C hashing function pointerIt must be
NULL
to use an Idio hashing functioncomp
is the Idio equivalence functionhash
is the Idio hashing function
Reading¶
A useful reader format for static hash tables uses the tradition braces syntax common to many programming languages.
#{ ... }
The key-value tuples are read as Pairs and Lists: (key &
value)
.
#{ (#\a & "apple") (#\p & "pear")}
The value-index operator, .
works with hash tables:
Idio> ht := make-hash #n #n 4
Idio> ht.#\a = "apple"
Idio> ht.#\p = "pear"
Idio> ht
#{ (#\a & "apple")(#\p & "pear")}
Writing¶
Hash tables will use the #{ ... }
reader format.
Operations¶
Above and beyond the normal hash table accessors there is a Scheme-ish functional feel.
- function hash? value¶
is value a hash table?
- function make-hash [equiv-func [hash-func [size]]]¶
equiv-func defines the equivalence function when comparing elements in the hash table and can be one of the following:
one of the symbols:
eq?
meaning use the C implementation ofeq?
eqv?
meaning use the C implementation ofeqv?
equal?
meaning use the C implementation ofequal?
#n
meaning use the C implementation ofequal?
(cf. the symbolequal?
above)make-hash #n
andmake-hash 'equal?
are equivalent.an Idio function which should be binary (ie. takes two arguments) and returns
#f
or some other value
hash-func defines the hashing function for placing elements in the hash table and can be one of the following:
#n
meaning use the C default hashing functionan Idio function which should be unary (ie. takes one argument) and returns an integer
size gives Idio a hint as to the size of the allocated internal array. The actual size is likely to be some function of the smallest 2**n - 1 that is greater than or equal to size.
The
eqv?
equivalence function is nearly useless, on reflection. The problem is not thateqv?
doesn’t work but rather that whilst1
and1.0
are equivalent according toeqv?
they will be hashed differently because they are different types, meaning they will most likely land in different buckets and therefore there is unlikely to be anything to beeqv?
against in the found chain.I’m leaving it in because 1) I can and b) someone might derive a hashing function where two numbers that would be
eqv?
generate the same hash value.
- function hash-equivalence-function hash¶
return the equivalence function being used for the hash table hash
- function hash-hash-function hash¶
return the hashing function being used for the hash table hash
This returns
#n
for the C default hashing function.
- function hash-size hash¶
return the number of elements in hash hash
There is no function to return the actual allocated size though idio-dump might disclose it.
- function hash-ref hash key [default]¶
return the value at key key of hash hash or default if it is not present
If no default is supplied then if key is not present the
^rt-hash-key-not-found
condition will be raised.
- function hash-set! hash key value¶
set the key key of hash hash to value
- function hash-delete! hash key¶
delete the key key from hash hash
- function hash-exists? hash key¶
return
#t
if the key key exists in hash hash or#f
otherwise
- function hash-keys hash¶
return a list of the keys of hash
- function hash-values hash¶
return a list of the values of hash
- function alist->hash assoc-list¶
return a hash table constructed from the entries in the association list assoc-list
- function hash-update! hash key func [default]¶
set the key key of hash hash to the result of calling func on the existing value associated with key in hash
func should be a unary function (takes 1 argument!)
- default serves the same purpose as for :ref:`hash-ref
<hash-ref>`
It is approximately:
hash-set! hash key (func (hash-ref hash key [default]))
- function hash-walk hash func¶
hash will be iterated over in some order and func will be called with each “key value” tuple
func should be a binary function (takes 2 arguments!)
func may modify or otherwise perturb hash
- function hash-fold hash func val¶
hash will be iterated over in some order and func will be called with each “key value val” tuple
func should be a ternary function (takes 3 arguments!)
The result of calling the function is the new val for the next iteration.
func may modify or otherwise perturb hash
The return value is the result of the final call to func.
- function copy-hash hash [depth]¶
copy hash hash
depth can be the symbol
'deep
or'shallow
where'deep
will recursively copy the elements of hash whereas'shallow
will just reference the same values as in hash
- function merge-hash! hash1 hash2¶
this is a destructive merge of the key-value tuples of hash2 into hash1
Conditions¶
- condition ``^rt-hash-error-key-not-found``¶
Last built at 2024-12-21T07:11:00Z+0000 from 463152b (dev)