Edit: I don't mean to sound snarky, I'm definitely impressed.
What I can see from their code, is that they rely on a compaction strategy that is very similar to stock LevelDB (in terms of selecting the next uncompacted SSTs within a level), that they haven't done anything to improve multi-threaded performance, and that they've invested a lot of work into computing if/when writes should be delayed within the LevelDB code. We've drastically changed the compaction strategy, begun to improve concurrency (I know for a fact that we have opportunities to improve it further), and we believe that throttling writes at the storage layer is not the correct level to make such decisions, so we've removed the code which does so.
I guess the most fair thing to say is that we've taken complementary approaches, and nothing from either approach is not portable to the other.
The write multi-threading does not help Basho's Riak 1.x series. We parallelize by using multiple leveldb database (Riak term vnodes) and do NOT parallelize individual database until 2.0. Therefore, unfair to measure that feature agains Riak.
Our compaction code is adjusted with an emphasis on running multiple compactions of varied priority. Our use of multiple databases got hung up on leveldb's single compaction thread. Again, this complete difference of environments would be unfair in a Hyperleveldb direct comparison.
And the most difficult issue for dropping hyper into Riak is that our leveldb performs the write throttling, hyperleveldb leaves that to HyperDex. This is yet again an environment design decision ... but says coding is require to make hyperleveldb "just work" with Riak. That will be a while.
I therefore do not claim that Basho's leveldb would be better with HyperDex and suspect that today's hyperleveldb would not be better with Basho's Riak. We optimized to our different pain points.
Matthew
1. Multi-threaded synchronized write ordering. I'm interested in the internal synchronization mechanism...
"LevelDB uses very coarse-grained synchronization which forces all writes to proceed in an ordered, first-come-first-served fashion, effectively reducing throughput to that of a single thread. HyperLevelDB increases concurrency by allowing multiple threads to agree on the order of their respective writes, and then independently apply the writes in a manner consistent with the agreed-upon order."
2. Tuned write delay on compaction. What external instrumentation/markers are they passing into hyperleveldb to tune write delay?
"HyperLevelDB removes this artificial delay, allowing the application (in our case, HyperDex) to independently decide to delay writes, using information available outside the scope of LevelDB."
3. Tuned intra-level re-writes.
"LevelDB's compaction algorithm is not efficient, and in the "fillrand" benchmark will, on average, rewrite 3MB of data in the upper level for every 1MB of data in the lower level. HyperLevelDB avoids this waste by selecting the compaction with the smallest overhead."
1. Our internal synchronization mechanism is a simple change. The stock LevelDB does the following:
place our current on the back of wait_queue
wait for (our thread to be head of wait_queue || thread ahead of us to do our work)
if work done: exit
possibly build a batch of our writes and
append data to the log
insert data into the memtable
signal the next writer, and any writer whose work we finished
HyperLevelDB does this a little differently. We made the log and the memtable concurrent datastructures, so that multiple threads can write to each one at a time. We then do a little synchronization to ensure that we don't reveal the writes to readers in the wrong order. get a ticket, indicating the order of our writes
insert the data into the log
insert the data into the memtable
wait for writes with a lower token to complete
For the actual implementations, check out the code for LevelDB (Lines 1135-1196 of https://github.com/rescrv/HyperLevelDB/blob/28dad918f2ffb80f...) and HyperLevelDB (Line 1307-1428 of https://github.com/rescrv/HyperLevelDB/blob/master/db/db_imp...).Effectively, this change moves from a model where there is exactly one writer at a time, to one where the bulk of the work (inserting into log/memtable) is done in parallel by writer threads.
2. LevelDB provides a GetProperty call. We can inspect the number of files in Level-0 and back-off where appropriate. There is no write delay in LevelDB itself. By the end-to-end principle, the storage server is in a better position to decide whether to delay writes, or just keep pushing them into the database.
What alternatives did you test that performed better than LevelDB?
libtoolize
aclocal
autoheader
automake --add-missing
autoconf
./configure
make
No performance numbers 'cause I did it on a slow machine though. autoreconf -I
./configure
makeUnder the compaction section, a chart does show a dramatic improvement on the fillrand benchmark. But since HyperLevelDB removes the LevelDB write delay when compaction falls behind, writes in HyperLevelDB will continue at full speed even as compaction falls extremely far behind. So, yes, the data for the benchmark was technically written into the db, but at the costs of making reads insanely expensive.
Most reads need to touch every level-0 file. If compaction moving data out of level-0 fell 10GB behind, then there would be hundreds of level-0 files, and most reads would need to check all of them.
If the benchmark had mixed in one random read for every random write, I assume HyperLevelDB would have been dramatically slower than LevelDB.
But if the changes in what to pick for compaction made HyperLevelDB non-trivially faster, that would be quite interesting. But I can't tell if they did.