On modern hardware the min-max heap beats a binary heap (opens in new tab)

(probablydance.com)

244 pointspedro845y ago43 comments

43 comments

32 comments · 10 top-level

twotwotwo5y ago· 8 in thread

Like folks mentioned there, I wonder if a higher-fanout heap (people asked about 4-ary) might also do well in practice. Looking at Wikipedia (https://en.wikipedia.org/wiki/D-ary_heap), it looks like maybe so -- the growth in number of comparisons isn't that bad, and it mentions 4-ary heaps working well specifically.

(Like how linear search wins for small N, or insertion sort helps with small subarrays in introsort: more steps, but each step is also a lot cheaper on modern hardware due to fewer branch mispredictions, better cache locality, or something else that better fits the hardware's strengths.)

mehrdadn5y ago

I've tried D-ary heaps before (even min-max d-ary heap). Somehow my memory was that D = 3 or 4 sometimes performed better than 2, but now that I checked, in my code (which I spent a lot of times optimizing) I settled on plain binary heap at the end. So maybe my memory was faulty? Though I could swear I saw a performance improvement for D > 2 at some point. Sadly I don't recall why I reverted to binary heap exactly, or even whether it was related to speed or something else. Anybody tried it and remembered how it turned out?

nayuki5y ago

Another technique to speed up priority queues in a system with a memory hierarchy (i.e. modern computers) is the https://en.wikipedia.org/wiki/B-heap . I wonder if the author of the article is aware of this and is able to benchmark this.

barbecue_sauce5y ago

I don't have a formal background in CS. Any insight into why "d-" is used as the prefix?

teraflop5y ago

It probably stands for "degree". In a tree, the degree of a node is how many children it has.

(And before anybody gets pedantic: technically, this is the node's "outdegree" when the tree is represented as a directed graph. If it was an undirected graph, we would have to count the parent node as well.)

ncmncm5y ago

There are only two important things taught in CS that most people don't seem to pick up on the job. The most important is order notation. The second is the relation between grammars and state machines. Both are worth as much attention as you can afford for a week.

Things not taught in CS that you need to know on the job are legion. By far the most important of these is the use of invariants. Second might be the memory hierarchy, and cache semantics. Third might be use of bitmaps and bitwise operations.

1 more reply

dpbriggs5y ago

It's the number of children each node can have

1 more reply

innocenat5y ago

I remember doing tests years ago and find that the extra comparison and loop control can tank the performance for n-ary when n>2. But that was >10 years ago.

bjoli5y ago

For very large heaps, I have had speedups of about 3x due to better cache locality with B-heaps.

ghj5y ago· 7 in thread

In certain domains, the trend has been to give up constant factors in order to increase programmer productivity (e.g., python pays a 10x slowdown but is batteries included).

So in that case I would use this data structure even if it weren't faster. I can't count the number of times I have had to mess with inserting negative priorities into a min heap to create a max heap! We should just have one data structure that does both.

(though taking this idea to the logical extreme means we should just use Order Statistic Tree for everything since it not only gives you log(n) min/max, but also log(n) find kth and get rank of x)

twotwotwo5y ago

So, kind of a non sequitur, but I occasionally think about Qt's QList<T>. When introduced, it was odd for reserving space at the beginning as well as the end, so you had constant-time inserts/removals at the beginning. That and storing pointers for any T larger than a pointer also reduced the constant factor on insertion/deletions from the middle. Since it was always stored as a list of pointer-sized items, some code could be shared between implementations for different types, so your program's binary could be a bit smaller. I think they said at the initial release that they benchmarked a bunch of options in some real code and this won.

That was waaay at odds with the mindset of the STL, which seemed to me and I think a lot of folks like the state-of-the-art of containers at the time: with the STL if you prepend/insert much you need to use a deque/linked list, you decide pointer vs. value each time you declare a type, and if you want to know about the constant factor on an insert in the middle of the list you probably already lost.

(Qt has/had some other fun pragmatic things, like copy-on-write types in a lot of places. I didn't really do nearly enough with it to discover the pitfalls, but was fun to read.)

So I think about that, about how there are likely some trees out there that could just as well be sorted arrays, about adaptive structures that change their behavior based on dynamically observing how they're used (one recently discussed here: http://blog.erlang.org/the-new-scalable-ets-ordered_set/), even about tricks in in JS string implementations and VMs. I don't have answers, but I sometimes wonder if we should be thinking more in terms of tools that may not be perfect on all axes but soften or eliminate performance cliffs we take for granted. Not that the zero-overhead fine-tuned types are going away, just that they aren't the only imaginable approach.

I wonder if we'll see progress in those sort of imperfect-but-resilient tools over time. I think the long-run trend is towards some form of programming being accessible to more people more easily (like how, as you note, folks do real work in Python now). That certainly fits with trying to make your tools resilient to "misuse"--or, better, properly supporting more patterns so they're not even misuse. No conclusions, just hand-waving, but I do wonder if anyone working in software construction tools will eventually more usefully wave their hands in that direction :P

ghj5y ago

You would probably enjoy reading about an alternative implementation for STL's std::deque, called a tier vector[1][2].

It supports O(1) push_front/pop_front/push_back/pop_back/operator[]/at (like you would expect from a deque) but also O(sqrt(N)) insert and remove from middle!!!

The paper was from 1999 and 2001 but I only learned about it from a recent HN post[3] where some guy rediscovered it (20 years too late). I still wonder why this design lost to the std::deque we have in STL today.

[1] http://www.cphstl.dk/Report/Deque-first-attempt/cphstl-repor...

[2] https://www.ics.uci.edu/~goodrich/pubs/wads99.pdf

[3] https://news.ycombinator.com/item?id=20872696

1 more reply

alexhutcheson5y ago

If you haven't already seen it, you'd probably really enjoy Chandler Carruth's "Efficiency with Algorithms, Performance with Data Structures" talk from CppCon 2014: https://youtu.be/fHNmRkzxHWs

dceddia5y ago

Along the lines of adaptive structures that change depending on how they're used, I thought this article about how JavaScript arrays work in V8 was interesting: https://ryanpeden.com/how-do-javascript-arrays-work-under-th...

V8's Array does different things depending on whether the array is sparse or packed and what data types it contains.

1 more reply

gpderetta5y ago

Surely you should ve able to just use a different comparator instead (> instead of <).

ghj5y ago

Not in python! But I guess using negatives is the same as using different key (which is what python uses instead of comparators).

The real problem is if you need min/max at the same time. Then you not only need a min heap and max heap, you also need to track what was already popped from either heap! This is because in python you can't delete an element if it's not at the root. So tracking "tombstones" will let you pop from one heap, then know to ignore it the next time it comes up in the other heap.

This is of course super complicated to maintain and I get it wrong all the time. Luckily only shows up in algo interview problems.

1 more reply

brandon_edens5y ago

Looks like 50x to me. https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Taking 27 milliseconds just to start the interpreter.

gliese13375y ago· 5 in thread

I have used a min-max heap once. I don't remember why I needed it at the time--it was a previous job--but I do remember that I had to roll my own, because it's just not that popular of a data structure, and it was the obvious and only good solution to the problem at the time.

So, it's nice to see a detailed analysis of the structure like this! Perhaps if it becomes more popular, I will find more places to use it.

bjo5905y ago

I used a min heap in a FANG interview. It's an obvious/good solution if the problem has a mix of reading/removing the smallest number in a data structure and writing new numbers to the data structure.

rgossiaux5y ago

A min heap is different from a min-max heap. A min-max heap supports the operations of both a min heap and a max heap (essentially by interleaving the two). A normal min heap is a standard data structure, a min-max heap less so.

mav3rick5y ago

IIRC it's used in chess programs to evaluate moves.

nwallin5y ago

Perhaps you're thinking of minimax? It's an unrelated concept to min-max heaps.

https://en.wikipedia.org/wiki/Minimax

https://en.wikipedia.org/wiki/Min-max_heap

Hello715y ago

you're thinking of minimax.

1 more reply

cellularmitosis5y ago· 1 in thread

It would be neat to fire this up on an older processor which doesn’t have modern instruction-level parallelism and verify the difference in performance

brandmeyer5y ago

On x86 you'd have to search pretty far back before the available ILP really dropped off. Some of the lower-end OoO ARMs might be a good testing ground, though. Say, a Raspberry Pi 4? Earlier-gen RPi used in-order cores.

gpderetta5y ago· 1 in thread

Impressive. Looking forward to the d-heap article.

IshKebab5y ago

Yeah I was going to say, it sounds like it is faster because it has higher arity rather than because it is min and max. So if you only need min or max a d-heap is probably better. Hopefully he will update the article.

https://en.wikipedia.org/wiki/D-ary_heap

(Also I didn't know they were called d-heaps, thanks!)

noctune5y ago

And if you only need a monotone priority queue (i.e. priority queue where the popped elements are monotonically increasing/decreasing) you should consider using a radix heap. This monotone requirement can be satisfied more than you would expect when eg. using time as key or in pathfinding. I have a simple radix heap implementation here: https://github.com/mpdn/radix-heap

ncmncm5y ago

> "...C++20 finally added a way of accessing this instruction, using std::countl_zero. I don’t have C++20 though so I had to do it the old platform specific ways ..."

You don't need C++20. Even C++98 had std::bitset<>::count(), which has a nice conversion from unsigned long, and which, when compiled for SSE3 or better (e.g. Core2), uses a POPCNT instruction. It is pretty simple to produce various other results, including countl_zero, from a popcount, with just a couple of additional bitwise ops.

Modern compilers are happy to keep a bitset<32> in the same register as an unsigned long, so the conversion takes exactly zero cycles. POPCNT takes three cycles, and the extra bitwise ops another couple.

nickcw5y ago

If you need a min-max heap (or double ended heap) in Go here is one I've used: https://github.com/aalpar/deheap

Very useful when you need it!

usefulcat5y ago

I've been using the author's flat_hash_map (https://github.com/skarupke/flat_hash_map) for several years now and have been really impressed. I've yet to find a single bug, it's nearly as fast as folly's hash maps (for my use case anyway) but far easier to integrate than folly.

ww5205y ago

Wow. This is a very good analysis on a fundamental algorithm. Haven’t seen a high quality analysis like this for a good while.

j / k navigate · click thread line to collapse

43 comments

32 comments · 10 top-level

twotwotwo5y ago· 8 in thread

mehrdadn5y ago

nayuki5y ago

barbecue_sauce5y ago

I don't have a formal background in CS. Any insight into why "d-" is used as the prefix?

teraflop5y ago

It probably stands for "degree". In a tree, the degree of a node is how many children it has.

ncmncm5y ago

1 more reply

dpbriggs5y ago

It's the number of children each node can have

1 more reply

innocenat5y ago

I remember doing tests years ago and find that the extra comparison and loop control can tank the performance for n-ary when n>2. But that was >10 years ago.

bjoli5y ago

For very large heaps, I have had speedups of about 3x due to better cache locality with B-heaps.

ghj5y ago· 7 in thread

In certain domains, the trend has been to give up constant factors in order to increase programmer productivity (e.g., python pays a 10x slowdown but is batteries included).

(though taking this idea to the logical extreme means we should just use Order Statistic Tree for everything since it not only gives you log(n) min/max, but also log(n) find kth and get rank of x)

twotwotwo5y ago

(Qt has/had some other fun pragmatic things, like copy-on-write types in a lot of places. I didn't really do nearly enough with it to discover the pitfalls, but was fun to read.)

ghj5y ago

You would probably enjoy reading about an alternative implementation for STL's std::deque, called a tier vector[1][2].

It supports O(1) push_front/pop_front/push_back/pop_back/operator[]/at (like you would expect from a deque) but also O(sqrt(N)) insert and remove from middle!!!

[1] http://www.cphstl.dk/Report/Deque-first-attempt/cphstl-repor...

[2] https://www.ics.uci.edu/~goodrich/pubs/wads99.pdf

[3] https://news.ycombinator.com/item?id=20872696

1 more reply

alexhutcheson5y ago

If you haven't already seen it, you'd probably really enjoy Chandler Carruth's "Efficiency with Algorithms, Performance with Data Structures" talk from CppCon 2014: https://youtu.be/fHNmRkzxHWs

dceddia5y ago

V8's Array does different things depending on whether the array is sparse or packed and what data types it contains.

1 more reply

gpderetta5y ago

Surely you should ve able to just use a different comparator instead (> instead of <).

ghj5y ago

Not in python! But I guess using negatives is the same as using different key (which is what python uses instead of comparators).

This is of course super complicated to maintain and I get it wrong all the time. Luckily only shows up in algo interview problems.

1 more reply

brandon_edens5y ago

Looks like 50x to me. https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Taking 27 milliseconds just to start the interpreter.

gliese13375y ago· 5 in thread

So, it's nice to see a detailed analysis of the structure like this! Perhaps if it becomes more popular, I will find more places to use it.

bjo5905y ago

rgossiaux5y ago

mav3rick5y ago

IIRC it's used in chess programs to evaluate moves.

nwallin5y ago

Perhaps you're thinking of minimax? It's an unrelated concept to min-max heaps.

https://en.wikipedia.org/wiki/Minimax

https://en.wikipedia.org/wiki/Min-max_heap

Hello715y ago

you're thinking of minimax.

1 more reply

cellularmitosis5y ago· 1 in thread

It would be neat to fire this up on an older processor which doesn’t have modern instruction-level parallelism and verify the difference in performance

brandmeyer5y ago

gpderetta5y ago· 1 in thread

Impressive. Looking forward to the d-heap article.

IshKebab5y ago

https://en.wikipedia.org/wiki/D-ary_heap

(Also I didn't know they were called d-heaps, thanks!)

noctune5y ago

ncmncm5y ago

> "...C++20 finally added a way of accessing this instruction, using std::countl_zero. I don’t have C++20 though so I had to do it the old platform specific ways ..."

nickcw5y ago

If you need a min-max heap (or double ended heap) in Go here is one I've used: https://github.com/aalpar/deheap

Very useful when you need it!

usefulcat5y ago

ww5205y ago

Wow. This is a very good analysis on a fundamental algorithm. Haven’t seen a high quality analysis like this for a good while.

j / k navigate · click thread line to collapse