Day 64 - Wrap-around & masking low-order bits

Simple question, why do we do:

1
2
        sim_entity_hash *Entry = SimRegion->Hash +
            ((HashValue + Offset) & (ArrayCount(SimRegion->Hash) - 1));


instead of:

1
2
        sim_entity_hash *Entry = SimRegion->Hash +
            ((HashValue + Offset) % ArrayCount(SimRegion->Hash));


No need for the size of the array to be a power of two anymore.

On the other hand, the wraparound operation is now clearly slower (I think by about a factor or 10 or so) since it involves a division. Is that really relevant here though? Dereferencing the entry is likely to be much more expensive than that anyway.
norswap
Simple question, why do we do:

[...]

No need for the size of the array to be a power of two anymore.


Once upon a time, people used to use hash tables of prime number size. The reason for this (well, there was some solid theory behind it too) was that you could use a cheap-but-reasonable hash function. It was unlikely that your hash function would produce values which were correlated with the size of the hash table (since it was prime), and so it would overcome any issues with the hash function.

GCC's libstdc++ still uses prime-sized hash tables. I had one program whose inner loop required a fast "set membership" test. I started with std::unordered_set, and instruction-level profiling revealed that 15% of the run time was spent in the division instruction. That's because...

On the other hand, the wraparound operation is now clearly slower (I think by about a factor or 10 or so) since it involves a division.
...it's more like a factor of 30-100. On Haswell, integer division takes up to 30 cycles for a 32-bit division, and 100 cycles for a 64-bit division.

(Incidentally, that raises another interesting point. A 64-bit integer addition or logic operation is rarely more expensive than 32-bit, but 64-bit division is much more expensive. So using 32-bit arithmetic if you're doing divisions may be worth it.)

For comparison, an L1 cache hit is around 4 cycles, L2 cache hit is around 10 cycles, and L3 cache hit is around 50 cycles.

That's why in recent years, a lot of research has gone into designing better hash functions. It's worth going to the trouble of using a better and more expensive hash function and avoiding the modulo operation.

One other recent trend is that thanks to SIMD instructions, you can compute two or four hash functions on the same key simultaneously, for very little extra cost. This has meant that multiple-hash-function data structures (e.g. double hashing, Bloom filters) have made a comeback, and new techniques such as Cuckoo hashing have appeared to take advantage of the hardware.
Yes - mod can be very expensive. Masking is almost always significantly cheaper. But, obviously, you have to make sure that you don't need the prime hashing bonus, etc. - but as long as you can get away with a power of two, you probably want to, since the reason you use a hash table is usually to make lookups fast, and paying an integer mod isn't fabulous. If the hash table is usually out of cache, maybe it's fine, but if you are typically in cash at least a little, the mod would hurt, I think.

Again, as with all things perf-related, the only way to really know is to test it under real conditions and see whether mod-with-npot-sized or mask-with-pot-sized wins.

- Casey