Hash tables (opens in new tab)

(data-structures-in-practice.com)

139 pointsmilsebg6y ago33 comments

33 comments

21 comments · 4 top-level

dragontamer6y ago· 11 in thread

High-performance Hash table design is pretty interesting, because modern CPU architecture has grossly changed their performance characteristics.

In the 90s, it was (probably) faster to do linked-list / chaining as a collision resolution mechanism. But because L1 caches are so fast on today's machines (and DDR4 remains very high latency), linear probing seems to be the winner on modern machines.

Consider that DDR4 RAM is maybe 200-clock cycles to access (50ns on a 4GHz modern machine). L1 cache can be accessed within 4 clock cycles. Fetching an entire cache line is like guessing on 8-positions in linear probing.

The if/else statements will execute all within single-digit nanoseconds (Modern CPUs execute at 4GHz, and can execute more than 1 instruction per clock tick... up to 6-instructions per tick if all instructions are in uop cache and are properly branch-predicted), long before the 2nd location (ex: the 2nd member in a Linked List) can be even accessed.

Furthermore, modern CPUs will perform pre-fetching when they notice code traversing through memory linearly. As such, the 200-clock cycle latency is "pipelined" and performed in parallel in practice. Your "effective latency" drops significantly when you traverse linearly.

EDIT: And finally: staying on the same DDR4 row (Roughly 1024 bytes) means that you only have to perform a Column-read command over and over again. If you leave a DDR4 Row, the RAM needs to precharge, open the new row, and then perform a new column read. Takes roughly 3x longer than reading from an already open column.

ignoramous6y ago

One of my favourite lock-free, hash-table talks is by Dr. Cliff Click, ex-Azul Systems, presently at H2O.ai, from 6 years ago: https://www.youtube-nocookie.com/embed/WYXgtXWejRM

Rust's implementation of Google's SwissTables is pretty trippy, as well: https://blog.waffles.space/2018/12/07/deep-dive-into-hashbro...

Whilst Clojure, Scala, Erlang, and Haskell use Hash Array Mapped Tries https://en.wikipedia.org/wiki/Hash_array_mapped_trie and its lock-free variant, Ctrie.

Speaking of exotic data-structures used widely: HAProxy's creator u/wtarreau invented Elastic-Binary tree that powers HAProxy's caches and schedulers.

Presentation: https://www.youtube-nocookie.com/embed/fXuYWqWsdFQ

Blog: http://wtarreau.blogspot.com/2011/12/elastic-binary-trees-eb...

njonsson6y ago

Nitpick: The talk by Dr Cliff Click is dated February 2007, closer to 13 years ago than six years ago.

patrec6y ago

Nitpick: Cliff left H20.ai more than 3 years ago – I agree that he's very worth while listening to.

hesk6y ago

> In the 90s, it was (probably) faster to do linked-list / chaining as a collision resolution mechanism. But because L1 caches are so fast on today's machines (and DDR4 remains very high latency), linear probing seems to be the winner on modern machines.

It depends on the use case, for example if it's read or write heavy, the number of successful lookups and the load factor. LP is often a good choice but in some cases bucket chaining wins. Quadratic probing and Robin Hood hashing are also winning candidates, as is Cuckoo hashing.

See the flowchart at the end of this paper: Richter et al., 2005. A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing, https://dl.acm.org/citation.cfm?id=2850585

agumonkey6y ago

How often are these things or even thinking used ? is it obvious work done in gcc/clang ? what about the jvm ? or kernels ?

dragontamer6y ago

High performance compute techniques are basically used whenever you have code that runs for 20+ hours at a time. Things like Raytracers, supercomputers, database-analysis, physics simulations, chemical simulations, weather simulations, etc. etc.

And I'm not talking about Webservers (which stay up for days, but the compute problem only happens over the course of ~1 second or so). Consider that when 3d movies are made, they hit the "Render" button and then wait 1 or 2 weeks for the movie to be made. Even if they have an entire cluster of computers calculating the scene... it takes a LOT of compute power to make movies.

If you eek out a 10% improvement, your render-time drops from 10-days to 9-days. Alternatively, you can have 10% fewer computers to get the same job done. Or in the case of scientists (ex: a Car engineer who is running a FEA computer-simulated car crash), your models can be more detailed. (Ex: Instead of modeling the engine as a perfectly elastic sphere, maybe you can model the engine as a more appropriate shape in the simulated car-crash)

There are a few exceptions. High-frequency traders also count milliseconds for their speed of execution. I see some HPC work coming out of HFT companies. But for the most part, highly-optimized code is specialized to the communities who still wait days (or even weeks) for the results from modern computers.

java-man6y ago

This is the most coherent and elucidating answer on the topic I've seen on HN in many years.

Thank you!

kazinator6y ago

Firstly, we have L2 caches, not only L1.

Secondly, it is possible for all of a linked list to in fact be in L1.

Thirdly, if by "linear probing" you are at all referring to "open addressing", then there are some disadvantages to consider there.

If we suspect that linked lists are performing badly, the reasonable thing would be to stick with the chained hash table, but replace the chains by little vectors. Essentially, "linear probing" but sideways out of the table, into little sub-tables.

Issues with open addressing are issues like clustering if linear probing is used. If we use quadratic probing, then the table size has to be prime, otherwise we may not be able to able to insert a key into the table even if it is nowhere near full. When we delete keys, we have to leave "tombstone" entries in their place, so that the linear probing can continue past them to find other items. Things of that sort.

dragontamer6y ago

> Issues with open addressing are issues like clustering if linear probing is used.

Robin Hood hashing more or less solves that problem.

> When we delete keys, we have to leave "tombstone" entries in their place, so that the linear probing can continue past them to find other items.

Actually no. Even ignoring Robin Hood Hashing, you just swap a later element into place.

    idx = hash(value);
    delete(array[idx]);

    for(int i=idx+1; array[i] != empty; i = (i+1) % TABLE_SIZE){
        if(hash(array[i].value) <= idx){
            array[idx] = array[i];
            idx = i; // Repeat for the new location
        }
    }

I just typed the above in like 2 minutes, so there's probably a bug. But its probably "correct enough" that you can get the concept. There's no tombstones needed for linear probing (and only linear probing).

Conceptually, you can see that array[i] is simply being "re-hashed" into the hash table. Donald Knuth in "The Art of Computer Programming" proves that the above procedure (well... the correct bug-free version at least) is equivalent to clearing out the hash table and rehashing all elements. Except of course, the above procedure is way faster.

--------

Robin Hood hashing solves the problem in a different way. See this other guy's blog post for details: https://www.sebastiansylvan.com/post/robin-hood-hashing-shou...

1 more reply

nwallin6y ago

I think you're underestimating the number of memory redirections that are necessary for chaining. It's more than one. That's the killer.

It's important to be wary of assuming your data will be in cache. It usually isn't. The problem isn't making lots of high density searches in a hash map, the problem is making a search every now and then. A hash map that makes three DRAM accesses will always be worse that one that makes one, regardless of how much work it needs to do, and this is for two reasons. First is there obvious: RAM is slow as balls, we get it. Second is more insidious. You just evicted two cache lines. It's not just the lookup itself that's slower, you also made the consumer of your data structure slower.

Here's a look at what open addressing looks like these days: https://youtu.be/ncHmEUmJZf4 (warning, it's an hour. Takes a while to build up, too. Still worth it.)

gpderetta6y ago

Even if the list is fully in L1 at best you'll be able to check an element every 5-6 cycles. With open addressing you can test at least 2 elements per clock cycle, many more with SIMD.

Edit: somehow I missed your mention of partially unrolled lists. Those can of course mitigate many of the downsides.

1 more reply

saagarjha6y ago· 4 in thread

> A hash function is a one-way function that produces the same output for a given input. It’s one-way in the sense that, given the hash function, it should be difficult to convert the output back to the input without trial and error.

Usually I like to say that they convert an input into a uniform distribution of fixed length output, since reversibility isn’t really important if you using it for a non-cryptographic purpose such as this one.

barbegal6y ago

Yeah sometimes the best hash functions for hash tables are easily reversible. And hash functions that can't be reversed faster than brute force can be really bad if their output is biased making collisions more likely.

AdamN6y ago

It's an important distinction that many people forget: * Hash tables need speed and uniformity * Cryptographic hashes need irreversibility

1 more reply

papermachete6y ago

Why would it necessarily be uniform?

PeCaN6y ago

so that not too many things get hashed to the same location. that's kind of the entire point, you want hashes to be distributed equally among the entire array to minimize collisions.

rurban6y ago· 2 in thread

> SipHash is a relatively fast hash function.

Oh my! SipHash is by far the slowest of all practical hash functions. https://github.com/rurban/smhasher#smhasher

pubby6y ago

Those benchmarks are for hashing data in the billions of bytes range. Hash tables typically use data in the 1-20 byte range. Pretty big difference!

As it turns out, hash functions optimized for billions of bytes don't work so well when you use them on only a few bytes. There's just too much startup time and too many irrelevant branches and too much bloat for the icache to handle. That's why most hash tables use simpler algorithms like FNV1A or SipHash, which are faster for small data.

zzzcpan6y ago

> Those benchmarks are for hashing data in the billions of bytes range. Hash tables typically use data in the 1-20 byte range.

No, you misunderstood the purpose of the project. From the benchmarks, SipHash:

   Small key speed test -    1-byte keys -   110.91 cycles/hash
   Small key speed test -    2-byte keys -   110.32 cycles/hash
   ...
   Small key speed test -   20-byte keys -   164.79 cycles/hash

wyhash:

   Small key speed test -    1-byte keys -    18.00 cycles/hash
   Small key speed test -    2-byte keys -    18.00 cycles/hash
   ...
   Small key speed test -   20-byte keys -    21.00 cycles/hash

The whole purpose of smhasher is to help choose hash functions for hash tables and alike implementations.

sdegutis6y ago

Also http://craftinginterpreters.com/hash-tables.html

j / k navigate · click thread line to collapse

33 comments

21 comments · 4 top-level

dragontamer6y ago· 11 in thread

High-performance Hash table design is pretty interesting, because modern CPU architecture has grossly changed their performance characteristics.

ignoramous6y ago

One of my favourite lock-free, hash-table talks is by Dr. Cliff Click, ex-Azul Systems, presently at H2O.ai, from 6 years ago: https://www.youtube-nocookie.com/embed/WYXgtXWejRM

Rust's implementation of Google's SwissTables is pretty trippy, as well: https://blog.waffles.space/2018/12/07/deep-dive-into-hashbro...

Whilst Clojure, Scala, Erlang, and Haskell use Hash Array Mapped Tries https://en.wikipedia.org/wiki/Hash_array_mapped_trie and its lock-free variant, Ctrie.

Speaking of exotic data-structures used widely: HAProxy's creator u/wtarreau invented Elastic-Binary tree that powers HAProxy's caches and schedulers.

Presentation: https://www.youtube-nocookie.com/embed/fXuYWqWsdFQ

Blog: http://wtarreau.blogspot.com/2011/12/elastic-binary-trees-eb...

njonsson6y ago

Nitpick: The talk by Dr Cliff Click is dated February 2007, closer to 13 years ago than six years ago.

patrec6y ago

Nitpick: Cliff left H20.ai more than 3 years ago – I agree that he's very worth while listening to.

hesk6y ago

See the flowchart at the end of this paper: Richter et al., 2005. A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing, https://dl.acm.org/citation.cfm?id=2850585

agumonkey6y ago

How often are these things or even thinking used ? is it obvious work done in gcc/clang ? what about the jvm ? or kernels ?

dragontamer6y ago

java-man6y ago

This is the most coherent and elucidating answer on the topic I've seen on HN in many years.

Thank you!

kazinator6y ago

Firstly, we have L2 caches, not only L1.

Secondly, it is possible for all of a linked list to in fact be in L1.

Thirdly, if by "linear probing" you are at all referring to "open addressing", then there are some disadvantages to consider there.

dragontamer6y ago

> Issues with open addressing are issues like clustering if linear probing is used.

Robin Hood hashing more or less solves that problem.

> When we delete keys, we have to leave "tombstone" entries in their place, so that the linear probing can continue past them to find other items.

Actually no. Even ignoring Robin Hood Hashing, you just swap a later element into place.

    idx = hash(value);
    delete(array[idx]);

    for(int i=idx+1; array[i] != empty; i = (i+1) % TABLE_SIZE){
        if(hash(array[i].value) <= idx){
            array[idx] = array[i];
            idx = i; // Repeat for the new location
        }
    }

--------

Robin Hood hashing solves the problem in a different way. See this other guy's blog post for details: https://www.sebastiansylvan.com/post/robin-hood-hashing-shou...

1 more reply

nwallin6y ago

I think you're underestimating the number of memory redirections that are necessary for chaining. It's more than one. That's the killer.

Here's a look at what open addressing looks like these days: https://youtu.be/ncHmEUmJZf4 (warning, it's an hour. Takes a while to build up, too. Still worth it.)

gpderetta6y ago

Even if the list is fully in L1 at best you'll be able to check an element every 5-6 cycles. With open addressing you can test at least 2 elements per clock cycle, many more with SIMD.

Edit: somehow I missed your mention of partially unrolled lists. Those can of course mitigate many of the downsides.

1 more reply

saagarjha6y ago· 4 in thread

barbegal6y ago

AdamN6y ago

It's an important distinction that many people forget: * Hash tables need speed and uniformity * Cryptographic hashes need irreversibility

1 more reply

papermachete6y ago

Why would it necessarily be uniform?

PeCaN6y ago

so that not too many things get hashed to the same location. that's kind of the entire point, you want hashes to be distributed equally among the entire array to minimize collisions.

rurban6y ago· 2 in thread

> SipHash is a relatively fast hash function.

Oh my! SipHash is by far the slowest of all practical hash functions. https://github.com/rurban/smhasher#smhasher

pubby6y ago

Those benchmarks are for hashing data in the billions of bytes range. Hash tables typically use data in the 1-20 byte range. Pretty big difference!

zzzcpan6y ago

> Those benchmarks are for hashing data in the billions of bytes range. Hash tables typically use data in the 1-20 byte range.

No, you misunderstood the purpose of the project. From the benchmarks, SipHash:

   Small key speed test -    1-byte keys -   110.91 cycles/hash
   Small key speed test -    2-byte keys -   110.32 cycles/hash
   ...
   Small key speed test -   20-byte keys -   164.79 cycles/hash

wyhash:

   Small key speed test -    1-byte keys -    18.00 cycles/hash
   Small key speed test -    2-byte keys -    18.00 cycles/hash
   ...
   Small key speed test -   20-byte keys -    21.00 cycles/hash

The whole purpose of smhasher is to help choose hash functions for hash tables and alike implementations.

sdegutis6y ago

Also http://craftinginterpreters.com/hash-tables.html

j / k navigate · click thread line to collapse