It would also be good to know more about the skip list implementation they compared against; their description in VI.A doesn't sound like any concurrent skip list I'm aware of (e.g., Java's java.util.concurrent.ConcurrentSkipListMap). They don't say what all their BW-tree implementation includes, but if it's just the data structure, 10k lines of C++ is an order of magnitude larger than even pretty complex concurrent skip lists.
[1] http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTI...
When walking a B+ tree index, the usual code often uses latches to have exclusive access to the index pages being walked from root on the way down so that the pages won't be split due to overflow caused by other threads since the walking thread itself can cause a split and needs to updates the walked pages. Smarter implementation would shorten the list of locked pages when it finds a page with enough room that won't split even if its child pages split. This shortens the scope of the locking of the walk but there are still pages being locked.
This can be a single point of contention when a lot of index walking happen. Pretty much most db operations touch the index. This is especially problematic in modern memory rich systems where most hot data are paged into memory so the locking of the index during walking stuck out like a sore thumb.
A latch-free B+ tree would allow multiple threads to walk the index at the same time, thus removing the single point of contention and allowing massive scaling with more threads added.
The terminology clash in this case is unfortunate, because I'd wager that 90% of the people active in the field of concurrent data structures will use 'lock' rather than 'latch'.
It also was part of the conceptual idea of the very cool demo language ANIC[2].
[1] http://research.microsoft.com/en-us/news/features/hekaton-12...
[1] http://www.cs.technion.ac.il/~erez/Papers/lfbtree-full.pdf
Those benchmarks seem too good to be true. And reading the associated information they do look like that they are too good to be true. They claim zero-copy access to the database, which is great. But that probably means that in their read benchmarks, they are just returning a pointer to a record, and are probably not reading the actual record from disk (or for databases that live entirely in memory: into CPU cache). This gives an unfair view of the performance compared to databases that do read the data into memory (or CPU cache). While it's great that the database itself doesn't read the record, lets face it, most clients will need the actual record and not just a pointer to it. That is after all the point of a database. This also explains the unreal performance for large records. They do 30 million reads of 100 kilobyte records per second. If they were actually reading the records that would mean that their disk is doing 3 terabytes per second throughput. I want that disk!!! The hard disk and SSD also have exactly the same performance, so that means that they aren't even hitting the disk at all. So yes, they are cheating.
It was almost like MS completely missed the technology shift due to their glacial release cycles. But maybe I was wrong.
The key feature of B+Trees is that they are optimized to allow sequential scans through the index - I suppose SSDs don't "need" the sequential scan property, but it doesn't hurt, and pragmatically would still reduce the number of disk reads required to perform a scan of the index.
“We had an ‘aha’ moment,” Lomet recalls, “when we realized that a single table that maps page identifiers to page locations would enable both latch-free page updating and log-structured page storage on flash memory. The other highlight, of course, was when we got back performance results that were stunningly good.”
The Bw-tree team first demonstrated its work in March 2011 during TechFest 2011, Microsoft Research’s annual showcase of cutting-edge projects. The Bw-tree performance results were dazzling enough to catch the interest of the SQL Server product group.
“When they learned about our performance numbers, that was when the Hekaton folks started paying serious attention to us,” researcher Justin Levandoski says. “We ran side-by-side tests of the Bw-tree against another latch-free technology they were using, which was based on ‘skiplists.’ The Bw-tree was faster by several factors. Shortly after that, we began engaging with the Hekaton team, mainly Diaconu and Zwilling.”
“A skiplist is often the first choice for implementing an ordered index, either latch-free or latch-based, because it is perceived to be more lightweight than a full B-tree implementation”, says Sudipta Sengupta, senior researcher in the Communication and Storage Systems Group. “An important contribution of our work is in dispelling this myth and establishing that latch-free B-trees can perform way better than latch-free skiplists. The Bw-tree also performs significantly better than a well-known latch-based B-tree implementation—BerkeleyDB—that is widely regarded in the community for its good performance.”
[1] http://research.microsoft.com/en-us/news/features/hekaton-12...