The Bw-Tree: A B-tree for New Hardware (opens in new tab)

(research.microsoft.com)

143 pointsmotter13y ago81 comments

81 comments

30 comments · 7 top-level

sparky13y ago· 7 in thread

Lock-free B+ trees seem like a natural and good idea. However, it's hard to evaluate how it compares to previous work, in part because this paper uses completely different terminology than any paper I've ever read on lock-free or wait-free data structures. For starters, it uses 'latch' and 'latch-free' probably a hundred times, in lieu of the ubiquitous'lock' and 'lock-free' . I gather from Google that this is an Oracle thing[1]; they call spinlocks 'latches' and more complicated queueing locks 'enqueues'.

It would also be good to know more about the skip list implementation they compared against; their description in VI.A doesn't sound like any concurrent skip list I'm aware of (e.g., Java's java.util.concurrent.ConcurrentSkipListMap). They don't say what all their BW-tree implementation includes, but if it's just the data structure, 10k lines of C++ is an order of magnitude larger than even pretty complex concurrent skip lists.

[1] http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTI...

ww52013y ago

Latch is a popular term in the database world. It's used to avoid confusing with the term Lock in a database. Lock in RDBMS usually associates with transaction, data in the tables, data consistence, etc. It has long term semantic and deadlock can be a problem due to user actions. Latch is same as a lock in the traditional CS sense. It's a mutax or semaphore in memory to protect shared data structure in memory, e.g. a paged in index page in memory. The locked duration usually is very short, and deadlock is not a problem due to user actions.

When walking a B+ tree index, the usual code often uses latches to have exclusive access to the index pages being walked from root on the way down so that the pages won't be split due to overflow caused by other threads since the walking thread itself can cause a split and needs to updates the walked pages. Smarter implementation would shorten the list of locked pages when it finds a page with enough room that won't split even if its child pages split. This shortens the scope of the locking of the walk but there are still pages being locked.

This can be a single point of contention when a lot of index walking happen. Pretty much most db operations touch the index. This is especially problematic in modern memory rich systems where most hot data are paged into memory so the locking of the index during walking stuck out like a sore thumb.

A latch-free B+ tree would allow multiple threads to walk the index at the same time, thus removing the single point of contention and allowing massive scaling with more threads added.

sparky13y ago

Thanks for clarifying! It makes sense to use a different word in the DB community if Lock has some other previous meaning. It's easy to forget that RDBMS terminology has been along longer than most areas of CS.

The terminology clash in this case is unfortunate, because I'd wager that 90% of the people active in the field of concurrent data structures will use 'lock' rather than 'latch'.

pnathan13y ago

As a note, latch is also used in the hardware world[1]. It's used to implement memory and is often connected to a clock.

It also was part of the conceptual idea of the very cool demo language ANIC[2].

[1] https://en.wikipedia.org/wiki/Latch_%28electronics%29

[2] http://code.google.com/p/anic/wiki/Tutorial

sparky13y ago

That was the primary reason I was confused; "WTF does it mean for software to not use latches?" :)

NatW13y ago

re "skiplist", from my other comment: "“A skiplist is often the first choice for implementing an ordered index, either latch-free or latch-based, because it is perceived to be more lightweight than a full B-tree implementation”, says Sudipta Sengupta, senior researcher in the Communication and Storage Systems Group. “An important contribution of our work is in dispelling this myth and establishing that latch-free B-trees can perform way better than latch-free skiplists. The Bw-tree also performs significantly better than a well-known latch-based B-tree implementation—BerkeleyDB—that is widely regarded in the community for its good performance.”

[1] http://research.microsoft.com/en-us/news/features/hekaton-12...

continuations13y ago

Are there any open source implementations of lock-free B+ trees? I'd love to play with it.

sparky13y ago

This is the closest I've found [1]; it's an expanded version of a SPAA paper by the same name. It's not executable, but it's pretty detailed and C-like as paper pseudocode goes :-/

[1] http://www.cs.technion.ac.il/~erez/Papers/lfbtree-full.pdf

jburgueno13y ago· 5 in thread

20x times faster than BerkeleyDB is quite impresive. Would love to see a implementation of this.

bigtones13y ago

You will, it's built into the next version of SQL Server.

alpb13y ago

Umm, how do you know? Do you work at Microsoft?

1 more reply

snaky13y ago

BerkeleyDB is not so fast actually comparing to modern alternatives

http://symas.com/mdb/microbench/

jules13y ago

Lies, damned lies, statistics, and benchmarks.

Those benchmarks seem too good to be true. And reading the associated information they do look like that they are too good to be true. They claim zero-copy access to the database, which is great. But that probably means that in their read benchmarks, they are just returning a pointer to a record, and are probably not reading the actual record from disk (or for databases that live entirely in memory: into CPU cache). This gives an unfair view of the performance compared to databases that do read the data into memory (or CPU cache). While it's great that the database itself doesn't read the record, lets face it, most clients will need the actual record and not just a pointer to it. That is after all the point of a database. This also explains the unreal performance for large records. They do 30 million reads of 100 kilobyte records per second. If they were actually reading the records that would mean that their disk is doing 3 terabytes per second throughput. I want that disk!!! The hard disk and SSD also have exactly the same performance, so that means that they aren't even hitting the disk at all. So yes, they are cheating.

4 more replies

ldng13y ago

"MDB write performance when using its writable mmap option." Not really comparing apples, is it ? It doesn't seem fair to me.

1 more reply

ttrreeww13y ago· 4 in thread

I wonder how many patents Microsoft filed on this tree.

gwern13y ago

A search in Google Patents for "BW-tree" or "BW tree" turns up nothing (but I don't know how fast their database is updated or how long patents can be hidden or delayed).

caf13y ago

It's likely that if it is patented, it's filed under an anodyne name like "A system and method for indexing data".

ww52013y ago

That's very good.

alpb13y ago

What makes you think Google, Oracle or any other corporation who funds its research division with couple of billions of dollars wouldn't file a patent?

davvid13y ago· 3 in thread

Does anyone have any idea about how this compares to Google's btree?

https://code.google.com/p/cpp-btree/

ww52013y ago

BTree and B+ Tree are different animals.

lvh13y ago

But both this Bw-tree and the implementation the parent linked to claim to be "B-trees" (variants thereof), so I'm not sure why that's relevant. (Apart from the fact that B+ trees are just another B-tree variant themselves.)

matt471113y ago

looks like this does not support concurrent lock-free operations.

CoolGuySteve13y ago· 2 in thread

I hope this makes its way into ReFS or some other Windows filesystem. A friend who used to work on the NTFS team told me ReFS was B-tree based, which disappointed me as B-Trees are ill suited to SSDs.

It was almost like MS completely missed the technology shift due to their glacial release cycles. But maybe I was wrong.

etrain13y ago

Since when are B-Trees ill-suited to SSDs? The big idea behind B-Trees is to store pages of keys, and SSDs still operate on pages.

The key feature of B+Trees is that they are optimized to allow sequential scans through the index - I suppose SSDs don't "need" the sequential scan property, but it doesn't hurt, and pragmatically would still reduce the number of disk reads required to perform a scan of the index.

CoolGuySteve13y ago

B-Trees are all fine and nice, and perfectly adequate for SSDs, but log structured filesystems provide better wear leveling and garbage collection, even with TRIM support.

NatW13y ago· 1 in thread

More context: "Adhering to the “latch-free” philosophy, the Bw-tree delivered far better processor-cache performance than previous efforts.

“We had an ‘aha’ moment,” Lomet recalls, “when we realized that a single table that maps page identifiers to page locations would enable both latch-free page updating and log-structured page storage on flash memory. The other highlight, of course, was when we got back performance results that were stunningly good.”

The Bw-tree team first demonstrated its work in March 2011 during TechFest 2011, Microsoft Research’s annual showcase of cutting-edge projects. The Bw-tree performance results were dazzling enough to catch the interest of the SQL Server product group.

“When they learned about our performance numbers, that was when the Hekaton folks started paying serious attention to us,” researcher Justin Levandoski says. “We ran side-by-side tests of the Bw-tree against another latch-free technology they were using, which was based on ‘skiplists.’ The Bw-tree was faster by several factors. Shortly after that, we began engaging with the Hekaton team, mainly Diaconu and Zwilling.”

“A skiplist is often the first choice for implementing an ordered index, either latch-free or latch-based, because it is perceived to be more lightweight than a full B-tree implementation”, says Sudipta Sengupta, senior researcher in the Communication and Storage Systems Group. “An important contribution of our work is in dispelling this myth and establishing that latch-free B-trees can perform way better than latch-free skiplists. The Bw-tree also performs significantly better than a well-known latch-based B-tree implementation—BerkeleyDB—that is widely regarded in the community for its good performance.”

[1] http://research.microsoft.com/en-us/news/features/hekaton-12...

hyc_symas13y ago

Performance of skiplists vs Btrees was already debunked 7 years ago, at least. So nice try M$ but as usual you're late to the party, not advancing the state of the art.

http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html

jmgrosen13y ago· 1 in thread

I'm glad that Microsoft Research publishes their studies for free like this instead of having to pony up for it through the IEEE -- this certainly looks intriguing!

raccer13y ago

Seriously, anytime I find a company that freely shares the details of a newer faster way, I'm more a fan. Though with the negative points MS has earned, they're still in the red in my book.

j / k navigate · click thread line to collapse

81 comments

30 comments · 7 top-level

sparky13y ago· 7 in thread

[1] http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTI...

ww52013y ago

A latch-free B+ tree would allow multiple threads to walk the index at the same time, thus removing the single point of contention and allowing massive scaling with more threads added.

sparky13y ago

The terminology clash in this case is unfortunate, because I'd wager that 90% of the people active in the field of concurrent data structures will use 'lock' rather than 'latch'.

pnathan13y ago

As a note, latch is also used in the hardware world[1]. It's used to implement memory and is often connected to a clock.

It also was part of the conceptual idea of the very cool demo language ANIC[2].

[1] https://en.wikipedia.org/wiki/Latch_%28electronics%29

[2] http://code.google.com/p/anic/wiki/Tutorial

sparky13y ago

That was the primary reason I was confused; "WTF does it mean for software to not use latches?" :)

NatW13y ago

[1] http://research.microsoft.com/en-us/news/features/hekaton-12...

continuations13y ago

Are there any open source implementations of lock-free B+ trees? I'd love to play with it.

sparky13y ago

This is the closest I've found [1]; it's an expanded version of a SPAA paper by the same name. It's not executable, but it's pretty detailed and C-like as paper pseudocode goes :-/

[1] http://www.cs.technion.ac.il/~erez/Papers/lfbtree-full.pdf

jburgueno13y ago· 5 in thread

20x times faster than BerkeleyDB is quite impresive. Would love to see a implementation of this.

bigtones13y ago

You will, it's built into the next version of SQL Server.

alpb13y ago

Umm, how do you know? Do you work at Microsoft?

1 more reply

snaky13y ago

BerkeleyDB is not so fast actually comparing to modern alternatives

http://symas.com/mdb/microbench/

jules13y ago

Lies, damned lies, statistics, and benchmarks.

4 more replies

ldng13y ago

"MDB write performance when using its writable mmap option." Not really comparing apples, is it ? It doesn't seem fair to me.

1 more reply

ttrreeww13y ago· 4 in thread

I wonder how many patents Microsoft filed on this tree.

gwern13y ago

A search in Google Patents for "BW-tree" or "BW tree" turns up nothing (but I don't know how fast their database is updated or how long patents can be hidden or delayed).

caf13y ago

It's likely that if it is patented, it's filed under an anodyne name like "A system and method for indexing data".

ww52013y ago

That's very good.

alpb13y ago

What makes you think Google, Oracle or any other corporation who funds its research division with couple of billions of dollars wouldn't file a patent?

davvid13y ago· 3 in thread

Does anyone have any idea about how this compares to Google's btree?

https://code.google.com/p/cpp-btree/

ww52013y ago

BTree and B+ Tree are different animals.

lvh13y ago

matt471113y ago

looks like this does not support concurrent lock-free operations.

CoolGuySteve13y ago· 2 in thread

It was almost like MS completely missed the technology shift due to their glacial release cycles. But maybe I was wrong.

etrain13y ago

Since when are B-Trees ill-suited to SSDs? The big idea behind B-Trees is to store pages of keys, and SSDs still operate on pages.

CoolGuySteve13y ago

B-Trees are all fine and nice, and perfectly adequate for SSDs, but log structured filesystems provide better wear leveling and garbage collection, even with TRIM support.

NatW13y ago· 1 in thread

More context: "Adhering to the “latch-free” philosophy, the Bw-tree delivered far better processor-cache performance than previous efforts.

[1] http://research.microsoft.com/en-us/news/features/hekaton-12...

hyc_symas13y ago

Performance of skiplists vs Btrees was already debunked 7 years ago, at least. So nice try M$ but as usual you're late to the party, not advancing the state of the art.

http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html

jmgrosen13y ago· 1 in thread

I'm glad that Microsoft Research publishes their studies for free like this instead of having to pony up for it through the IEEE -- this certainly looks intriguing!

raccer13y ago

Seriously, anytime I find a company that freely shares the details of a newer faster way, I'm more a fan. Though with the negative points MS has earned, they're still in the red in my book.

j / k navigate · click thread line to collapse