In the 90s, it was (probably) faster to do linked-list / chaining as a collision resolution mechanism. But because L1 caches are so fast on today's machines (and DDR4 remains very high latency), linear probing seems to be the winner on modern machines.
Consider that DDR4 RAM is maybe 200-clock cycles to access (50ns on a 4GHz modern machine). L1 cache can be accessed within 4 clock cycles. Fetching an entire cache line is like guessing on 8-positions in linear probing.
The if/else statements will execute all within single-digit nanoseconds (Modern CPUs execute at 4GHz, and can execute more than 1 instruction per clock tick... up to 6-instructions per tick if all instructions are in uop cache and are properly branch-predicted), long before the 2nd location (ex: the 2nd member in a Linked List) can be even accessed.
Furthermore, modern CPUs will perform pre-fetching when they notice code traversing through memory linearly. As such, the 200-clock cycle latency is "pipelined" and performed in parallel in practice. Your "effective latency" drops significantly when you traverse linearly.
EDIT: And finally: staying on the same DDR4 row (Roughly 1024 bytes) means that you only have to perform a Column-read command over and over again. If you leave a DDR4 Row, the RAM needs to precharge, open the new row, and then perform a new column read. Takes roughly 3x longer than reading from an already open column.
Rust's implementation of Google's SwissTables is pretty trippy, as well: https://blog.waffles.space/2018/12/07/deep-dive-into-hashbro...
Whilst Clojure, Scala, Erlang, and Haskell use Hash Array Mapped Tries https://en.wikipedia.org/wiki/Hash_array_mapped_trie and its lock-free variant, Ctrie.
Speaking of exotic data-structures used widely: HAProxy's creator u/wtarreau invented Elastic-Binary tree that powers HAProxy's caches and schedulers.
Presentation: https://www.youtube-nocookie.com/embed/fXuYWqWsdFQ
Blog: http://wtarreau.blogspot.com/2011/12/elastic-binary-trees-eb...
It depends on the use case, for example if it's read or write heavy, the number of successful lookups and the load factor. LP is often a good choice but in some cases bucket chaining wins. Quadratic probing and Robin Hood hashing are also winning candidates, as is Cuckoo hashing.
See the flowchart at the end of this paper: Richter et al., 2005. A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing, https://dl.acm.org/citation.cfm?id=2850585
And I'm not talking about Webservers (which stay up for days, but the compute problem only happens over the course of ~1 second or so). Consider that when 3d movies are made, they hit the "Render" button and then wait 1 or 2 weeks for the movie to be made. Even if they have an entire cluster of computers calculating the scene... it takes a LOT of compute power to make movies.
If you eek out a 10% improvement, your render-time drops from 10-days to 9-days. Alternatively, you can have 10% fewer computers to get the same job done. Or in the case of scientists (ex: a Car engineer who is running a FEA computer-simulated car crash), your models can be more detailed. (Ex: Instead of modeling the engine as a perfectly elastic sphere, maybe you can model the engine as a more appropriate shape in the simulated car-crash)
There are a few exceptions. High-frequency traders also count milliseconds for their speed of execution. I see some HPC work coming out of HFT companies. But for the most part, highly-optimized code is specialized to the communities who still wait days (or even weeks) for the results from modern computers.
Thank you!
Secondly, it is possible for all of a linked list to in fact be in L1.
Thirdly, if by "linear probing" you are at all referring to "open addressing", then there are some disadvantages to consider there.
If we suspect that linked lists are performing badly, the reasonable thing would be to stick with the chained hash table, but replace the chains by little vectors. Essentially, "linear probing" but sideways out of the table, into little sub-tables.
Issues with open addressing are issues like clustering if linear probing is used. If we use quadratic probing, then the table size has to be prime, otherwise we may not be able to able to insert a key into the table even if it is nowhere near full. When we delete keys, we have to leave "tombstone" entries in their place, so that the linear probing can continue past them to find other items. Things of that sort.
Robin Hood hashing more or less solves that problem.
> When we delete keys, we have to leave "tombstone" entries in their place, so that the linear probing can continue past them to find other items.
Actually no. Even ignoring Robin Hood Hashing, you just swap a later element into place.
idx = hash(value);
delete(array[idx]);
for(int i=idx+1; array[i] != empty; i = (i+1) % TABLE_SIZE){
if(hash(array[i].value) <= idx){
array[idx] = array[i];
idx = i; // Repeat for the new location
}
}
I just typed the above in like 2 minutes, so there's probably a bug. But its probably "correct enough" that you can get the concept. There's no tombstones needed for linear probing (and only linear probing).Conceptually, you can see that array[i] is simply being "re-hashed" into the hash table. Donald Knuth in "The Art of Computer Programming" proves that the above procedure (well... the correct bug-free version at least) is equivalent to clearing out the hash table and rehashing all elements. Except of course, the above procedure is way faster.
--------
Robin Hood hashing solves the problem in a different way. See this other guy's blog post for details: https://www.sebastiansylvan.com/post/robin-hood-hashing-shou...
It's important to be wary of assuming your data will be in cache. It usually isn't. The problem isn't making lots of high density searches in a hash map, the problem is making a search every now and then. A hash map that makes three DRAM accesses will always be worse that one that makes one, regardless of how much work it needs to do, and this is for two reasons. First is there obvious: RAM is slow as balls, we get it. Second is more insidious. You just evicted two cache lines. It's not just the lookup itself that's slower, you also made the consumer of your data structure slower.
Here's a look at what open addressing looks like these days: https://youtu.be/ncHmEUmJZf4 (warning, it's an hour. Takes a while to build up, too. Still worth it.)
Edit: somehow I missed your mention of partially unrolled lists. Those can of course mitigate many of the downsides.
Usually I like to say that they convert an input into a uniform distribution of fixed length output, since reversibility isn’t really important if you using it for a non-cryptographic purpose such as this one.
Oh my! SipHash is by far the slowest of all practical hash functions. https://github.com/rurban/smhasher#smhasher
As it turns out, hash functions optimized for billions of bytes don't work so well when you use them on only a few bytes. There's just too much startup time and too many irrelevant branches and too much bloat for the icache to handle. That's why most hash tables use simpler algorithms like FNV1A or SipHash, which are faster for small data.
No, you misunderstood the purpose of the project. From the benchmarks, SipHash:
Small key speed test - 1-byte keys - 110.91 cycles/hash
Small key speed test - 2-byte keys - 110.32 cycles/hash
...
Small key speed test - 20-byte keys - 164.79 cycles/hash
wyhash: Small key speed test - 1-byte keys - 18.00 cycles/hash
Small key speed test - 2-byte keys - 18.00 cycles/hash
...
Small key speed test - 20-byte keys - 21.00 cycles/hash
The whole purpose of smhasher is to help choose hash functions for hash tables and alike implementations.