...and one thing you might want to know.
Implementing hash tables is something that low-level programmers spend a nontrivial amount of time doing. Off-the-shelf hash table implementations (e.g. std::unordered_map in C++), or the hash tables built into dynamic languages like Python or Perl, tend to be okay for general use, but there are plenty of situations where some hash table is a performance-critical data structure, and the built-in one just won't do.
This comment in HH raises some interesting questions:
| // TODO(casey): BETTER HASH FUNCTION!!!!
|
I'm not going to talk about how to design good hash functions. That's an interesting topic, but it involves more hairy abstract maths than we really need to overload our brains with right now. It's sufficient to know that a "good enough" hash function scrambles
your keys to the point that you get kind-of-random (but still deterministic) numbers.
When we're talking about random or pseudo-random numbers, that means we're talking about statistics. I promise this isn't going to get too hairy, but to understand the performance of hash tables, there are two snippets of statistical knowledge that everyone working with this data structure should know, and one that may be useful to know.
The variant that Casey has used is hashing with external chaining. The hash function assigns each element to a bucket or slot (i.e. an entry in the hash table), and then a secondary data structure (in this case a linked list) is used to store all elements which fall in the same bucket.
Searching a linked list is linear, so the first obvious question is: how long are the chains?
The average length of a chain is the number of elements stored in the hash table divided by the number of buckets. This number is called the
load factor, which we'll denote by λ. This number is the average number of elements in each bucket, and the average length of a hash chain.
However, this is true no matter how good the hash function is! Even for a trivial hash function (e.g. h(x) = 42), this is still the average hash chain length, even though the data structure has degenerated into a linked list.
So clearly that's an unhelpful number by itself. So this brings me to the first piece of useful statistical knowledge: for a good hash function, the
distribution of hash chain lengths is very well approximated by a
Poisson distribution.
You don't need to know the details. Just know where to look up this formula:
| P(n;λ) = exp(-λ) λ^n / n!
|
For a hash table with load factor λ, the probability that a chain has length n is: e to the power of -λ (where e is
the base of natural logarithms), multiplied by λ to the power of n, divided by the factorial of n.
This has some interesting implications. Using λ=1 as an example, we find:
| P(0;1) = 1/e
P(1;1) = 1/e
|
If you have a hash table with a load factor of 1, we expect 1/e (or a bit over a third) of the slots to have no elements, and 1/e (same) to have one element. That's a lot of unoccupied slots. The rest (1-2/e, or a bit over a quarter) of the slots have two or more elements.
Similarly, for a load factor of two, 13.5% of slots have no elements. For a load factor of three, it's 5%, which is about the same as the number of slots which have six or more elements.
This formula gives you a way to understand how you can trade off long hash chains against empty hash slots, and a quantitative way to think about how long hash chains can get, even with a good hash function.
The second piece of useful information is a restricted form of the inverse problem: How many elements do you need to insert into a hash table before you get a collision?
Clearly, if you have N hash slots, then to be
guaranteed of a collision you need to insert N+1 elements, because you could get lucky and have a perfect hash function. (This is known as the
pigeonhole principle.) So this isn't particularly helpful. Instead, what we'll try to find out is how many elements you need to insert to get a 50% chance of a collision. Beyond this point, a collision is more likely than not.
You can work this out from the Poisson distribution. The details are unimportant, but the answer turns out to be very easy to remember: it's the square root of N. Technically, this is only true in the limit as N tends towards infinity, but it's a reliable estimate for N>10 or so, that is, for all real-sized hash tables.
This is known as the
birthday paradox or birthday problem, because the form it is often posed in is how likely it is that in a group of people random, there are two people who share a birthday. There is a 50% chance of a shared birthday when you have more than √365 people in the group, which means about 20 or more. At 23 people, it's close to guaranteed.
Incidentally, this is also a good rule of thumb for statistical sampling. If you have a universe of N individuals, then √N is usually a "good" random sample (assuming it's an unbiassed sample, of course), because if you sample with replacement, this is the point where you start getting duplicates.
Those are two snippets of statistical knowledge that you should know, to work with hash tables. But there's one more piece of knowledge that you might want to know, and that's how to tell if your hash function is "good enough". All the advice that you get is that hash functions should be tested on your data. But what should you actually test?
This is where it does get hairy if you don't know your statistics. We know that a "good" hash function will produce a Poisson distribution of hash chain lengths. So the most useful test is to try the hash function on your data, and check that the distribution actually is Poisson.
There is a principled statistical test if you want to know whether two sets of data came from the same probability distribution or not, known as
Pearson's χ² test. (That's pronounced "chi-squared".) The details are beyond the scope, but at least you know what to search for if you want to go down that rabbit hole.