Open-Addressing Hash Tables using Handmade-style memory

hotspur

#8720

September 26, 2016

Though Casey has used hash tables on stream plenty of times, I do not recall an instance where he used an open-addressing style hash table. I've been using chaining-style hashes to great effect, which in-house memory makes easy because you can just chain arbitrary locations when you get a collision.

However, I'm curious as to how one would go about implementing an open-addressing hash. As far as I can tell, the Handmade scheme for memory management does not allow for "realloc"-style operations. So when the table has to grow, how can it do so and remain contiguous in memory? My best guess is that you simply allocate a new table from the arena... but then all the memory from the smaller table is wasted, since it's in the "used" region and cannot be reclaimed!

I've rewatched Episode 55 when Hash Tables are introduced and although it explains open-addressing ("internal" chaining), it doesn't go into how one could implement this without realloc or without crazy wastage of space.

Interested in hearing your thoughts.

Mārtiņš Možeiko

#8721

September 26, 2016

We recently had discussion on this with insofaras on IRC channel.

The approach we discussed was to allocate virtual memory directly from OS and avoid arenas. To avoid doing realloc and copying elements first you reserve huge amount of virtual memory (100GB), but you commit only what you use. For example, first commit 1MB, then when table is getting close to full (~75% or something), commit next 2MB, after that 4MB, and so on...

This way there would be no need to copy elements and free old memory. Of course, when looking up elements a bit more work would be needed - first lookup in whole table, if not found lookup in first half of table, then first quarter and so on...

This would work only in 64-bit where there is huge amount of virtual memory available.

If you need to operate on smaller hash tables, then using OS alloc doesn't make sense. You'll need to implement custom memory management that would allow to release memory if you want to keep using arenas instead of C runtime malloc/free.

hotspur

#8722

September 26, 2016

Thanks for the informative answer Martins! I don't actually need such a hash table but I was curious about the more general limitation of not being able to realloc. I assume since HH has been able to work within this limitation so far that I will be able to as well!

ratchetfreak

#8723

September 26, 2016

and one of the things you can then do is move some of the values when you look for them; to break up the chains in the lower parts of the hash.

If you look form the largest size down then it's beneficial to spend some computing power to increase the chance that it will get found sooner.

and care must be taken when doing your probing and wrap around and insert and then a buffer increase happened. Things tend to get tricky then.

Alex Baines

#8725

September 27, 2016

After the IRC discussion I had with Mārtiņš, I ended up implementing a simpler open-addressed hash table that allocates a new table twice the size, and incrementally rehashes entries from the old to the new.

It only ever has 2 tables maximum in flight at once - if it needs to expand again then it'll rehash whatever is left in the old table immediately, but the incremental rehashing means that there will only be a maximum of a few entries left in there anyway.

I'm using malloc separately for each of the two tables to simplify freeing, but it would be possible to have them concatenated and use realloc, or a custom memory arena instead with some more work.

The code for it is here if you're interested: https://github.com/baines/insobot/blob/master/src/inso_ht.h

EDIT: I realize now that this doesn't really address the concern of having a "hole" in an arena where the old table used to be... I guess if that is a major concern then the previous idea of having an increasing number of tables without any rehashing could work, at the cost of increasing lookup time slightly.

Edited by Alex Baines on September 27, 2016, 8:06am Reason: Address the original topic

ratchetfreak

#8726

September 27, 2016

insofaras
After the IRC discussion I had with Mārtiņš, I ended up implementing a simpler open-addressed hash table that allocates a new table twice the size, and incrementally rehashes entries from the old to the new.

It only ever has 2 tables maximum in flight at once - if it needs to expand again then it'll rehash whatever is left in the old table immediately, but the incremental rehashing means that there will only be a maximum of a few entries left in there anyway.

I'm using malloc separately for each of the two tables to simplify freeing, but it would be possible to have them concatenated and use realloc, or a custom memory arena instead with some more work.

The code for it is here if you're interested: https://github.com/baines/insobot/blob/master/src/inso_ht.h

EDIT: I realize now that this doesn't really address the concern of having a "hole" in an arena where the old table used to be... I guess if that is a major concern then the previous idea of having an increasing number of tables without any rehashing could work, at the cost of increasing lookup time slightly.

sounds exactly like wikipedia's description of incremental resizing. They also mention that you can guarantee that the old table will be empty by the time you need to rehash again if you move at least 1 entry from the old table per insert given that you double the capacity each time.

ratchetfreak

#8760

September 29, 2016

One thing that may help is distributed hashing. It's originally meant for P2P systems where nodes (or allocated tables in this context) can enter and leave at any time.

Though many of the trade-offs they make for reliability, nodes leaving and fault tolerance don't apply to this use case.