undefined | Better HN

0 pointskoverstreet12y ago0 comments

There's two reasons a textbook b+ tree is never going to perform very well:

* Binary searches suck

The memory accesses of a binary search look pretty much like complete random access, so prefetching is useless. Once your b+ tree is big enough that most of it won't fit in L2, you've got a tight inner loop that's waiting on DRAM at every single iteration.

This one is at least more or less solvable without changing the rest of the design, but then if you also want to solve the other main issue it gets harder.

* With small btree nodes, the overhead of looking up and locking btree nodes (even assuming they're all cached in memory) kills you.

Trouble is if you make your btree nodes big enough to fix that overhead, they're now so big that rewriting them when you insert a single key is just ridiculous.

So where you end up is to get anywhere near the highest possible performance you're forced into log structured btree nodes. Which actually has its advantages for solving the first problem, but now what you've got is really a hybrid b+ tree/compacting data structure.

Source: bcache author.

0 comments

4 comments · 2 top-level

elbee12y ago· 2 in thread

I'm not talking about a b-tree that won't fit in L2, I'm talking about a B-tree that won't even fit in main memory. In those cases, even with SSDs, the cost of pulling a page off disk dominates the in-memory cost of the binary search so the number of I/Os needed to find something becomes the dominant factor in performance. That leads to a focus on data density and cache replacement algorithms.

"Binary searches suck"

That is a valid point. Cache misses when doing the search inside of the b-tree can be painful. I know that there is some research into structuring the b-tree nodes to be more cache-friendly (David Lomet mentions this in his "Evolution of Effective B-Tree" paper: http://research.microsoft.com/pubs/77583/p64-lomet.pdf) but haven't actually seen any in production.

Part of the problem here is that page density is so critical that people are often wary of having a less-efficient, but more cache-friendly data representaton. It is normally better to have 10% more records in main memory than being able to search the nodes a touch faster. Various forms of key prefix compression are an example of this tradeoff.

Another consideration is the actual cost of doing the key comparison. I'm most familiar with relational database systems and they all support multi-column keys (e.g string+integer+boolean+date), which normally lead to an expensive, branch-laden comparison function because the comparison has to be type-aware and support various collation options. The cost of doing that branching means the CPU cache misses are a smaller overall part of the overall search cost. One exception to that is ESENT, which uses memcmp-able normalized keys, but doing that creates its own set of problems (Unicode is a pain, lack of denormalization means storing data twice, etc.).

"Rewriting them when you insert a single key is just ridiculous."

I don't think anyone does this. Most implementations have settled on a system that stores the keys in the node in an ad-hoc fashion and have an array of 'pointers' to the keys which starts at the end of the page and grows towards the front (except for Postgres, which appears to do it the other way around). Inserting a new key means moving the pointers, but not the actual keys. Some systems try to avoid doing even that -- for example the InnoDB engine in MySQL links together the keys and the page directory only points to every 4th-8th key. I believe that this is an attempt to balance the speed of the binary search against the cost of update/delete.

In addition, with write-ahead-logging the node isn't flushed when updated, instead the log records describing the update are flushed and the page is lazily flushed in the background by a checkpointing process, hopefully after several other updates have been made to the same node.

Microsoft SQL Server, Microsoft ESENT, MySQL's InnoDB and Postgres' b-tree index all use minor variations on the basic textbook b-tree for their data storage.

Source: Many years working on the ESENT database engine, a brief stint in Microsoft SQL Server, some hacking done on the MySQL InnoDB engine.

cdavid12y ago

I am looking into B+-trees for some sparse array implementations, but little experience with b+-trees (or more advanced features). Do you have recommendation of code and papers to read besides the one already cited ?

elbee12y ago

This reading list will cover a lot the current generation of b-tree techniques, but not the cutting-edge stuff (e.g. Tokutek):

* Ubiquitous B-Tree (Douglas Comer): http://doi.acm.org/10.1145/356770.356776

[A good review of the state-of-the-art in 1979]

* Prefix B-trees (Bayer, Rudolf and Unterauer, Karl): http://doi.acm.org/10.1145/320521.320530

[This paper introduces the idea of what I'm used to calling suffix compression -- the internal page pointers in a B-tree don't need to store the entire key, just enough of it to uniquely identify the node compared to its neighbor. This can be a very useful optimization for some use cases. The idea of reassembling a key as you seek down the tree is cool, but I don't think anyone does it in practice.]

* B-tree indexes, interpolation search, and skew (Goetz Graefe): http://doi.acm.org/10.1145/1140402.1140409

[A lot of interesting ideas around doing search inside of a node, including cache awareness. Interpolated search looks very interesting, but I was never able to get it to go faster than binary search. That could have been my fault though.]

* A survey of B-tree locking techniques (Goetz Graefe): http://doi.acm.org/10.1145/1806907.1806908

[I really like Graefe's database survey papers. They are easy to read and practical/implementation focused. This one discusses a lot of important concepts: the difference between latches and locks, latch coupling, range locks and increment locks.]

nkurz12y ago

The memory accesses of a binary search look pretty much like complete random access

Why would this be? I would think that even in a naive implementation the fan-out is high enough that top levels of the B+ tree will always be in cache, so that you are only hitting RAM for the last level or two and the leaf. For in memory use, I thought the usual argument against binary search was (as 'elbee' mentions) the poor branch predictability, not the caching.

j / k navigate · click thread line to collapse