A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation
The solution (at least for some problems) is just to lay your data out in memory better - optimal is a binary search tree in an array, like the way heaps are implemented.
That's what I did for bcache: http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/m...
I have some really ugly things to say to a new poster who cannot be bothered to read an article before trying to show off, but I'll refrain. In the future, read first, think, then comment. You'll be doing everyone else a favor.
One of the first things to learn about caches is how dependent modern processors are on prefetching, and a binary search just completely breaks prefetching.
The basic idea is to allow your compiler to use predicated instructions. Do do this, you have to transform control dependencies (e.g., `if (a > b)`) to data dependencies on index computations (e.g., `j = 2*j + (a > b)`).
For a complete explanation, see http://algo2.iti.kit.edu/sanders/papers/ssss.ps.gz
Not quite. The heap layout has bad spatial locality. Think about the last row of leaves - it occupies the last 50% of the array. The next-to-last-row (all their parents) is in the middle 25%-50%. All those leaves are far away from their parents, and thus the heap is not optimal.
Better is to cluster parents and children into small groups. With clusters of three, two thirds of the time a child will adjacent to its parent.
The beautiful thing about the heap layout is that you can prefetch more than one level ahead. If your keys are 4 bytes, 16 of them fit on a cacheline, and you can prefetch 4 levels ahead with a single cacheline.
That means that if nothing is in cache, worst case each loop iteration will take < 20 ns.
With any kind of clustering (really, it's just a variation on B-trees) you touch less memory but you can't prefetch as far ahead.
In practice I suspect a 4-ary heap would perform better than a binary heap for my application, but there's a bunch of math I'd have to work out again that was hard enough for the binary tree version.
Not quite sure how what you're describing would work, but I can't see offhand how you'd implement it without losing the prefetching the heap layout gives you.