undefined | Better HN

0 pointsgavinray2y ago0 comments

The argument is that:

- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls

- mmap() complicates transactionality and error-handling

- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks

0 comments

1 - for reading any uncached data, the I/O stalls are unavoidable. Whatever client requested that data is going to have to wait regardless.

2 - complexity? this is simply false. LMDB's ACID txns using MVCC are much simpler than any "traditional" approach.

3 - contention is a red herring since this approach is already single-writer, as is common for most embedded k/v stores these days. You lose more perf by trying to make the write path multi-threaded, in lock contention and cache thrashing.

RaisingSpear2y ago

> for reading any uncached data, the I/O stalls are unavoidable.

Excuse me for a silly question, but whilst an I/O stall may be unavoidable, wouldn't a thread stall be avoidable if you're not using mmap?

Assuming that you're not swapping, you'll generally know if you've loaded something into memory or not, whilst mmap doesn't help you know if the relevant page is cached. If the data isn't in memory, you can send the I/O request to a thread to retrieve it, and the initiating thread can then move onto the next connection. I suspect this isn't doable under mmap based access?

pclmulqdq2y ago

It's kind of disingenuous to talk about how great your concurrency system is when you only allow a single writer. RCU (which I imagine your system is isomorphic to) is pretty simple compared to what many DB engines use to do ACID transactions that involve both reads and writes.

hyc_symas2y ago

You don't need more than single-writer concurrency if your write txns are fast enough.

Our experience with OpenLDAP was that multi-writer concurrency cost too much overhead. Even though you may be writing primary records to independent regions of the DB, if you're indexing any of that data (which all real DBs do, for query perf) you wind up getting a lot of contention in the indices. That leads to row locking conflicts, txn rollbacks, and retries. With a single writer txn model, you never get conflicts, never need rollbacks.

jandrewrogers2y ago

> You don't need more than single-writer concurrency if your write txns are fast enough.

This only works on systems with sufficiently slow storage. If your server has a bunch of NVMe, which is a pretty normal database config these days, you will be hard-pressed to get anywhere close to the theoretical throughput of the storage with a single writer. That requires 10+ GB/s sustained. It is a piece of cake with multiple writers and a good architecture.

Writes through indexing can be sustained at this rate (assuming appropriate data structures), most of the technical challenge is driving the network at the necessary rate in my experience.

1 more reply

pclmulqdq2y ago

That's probably because your OpenLDAP benchmarks used a tiny database. If you have multi-terabyte databases, you will start to see huge gains from a multi-writer setup because you will be regularly be loading pages from disk, rather than keeping almost all of your btree/LSM tree in RAM.

1 more reply

jeffffff2y ago

take a look at http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf - the overhead of coordinating multiple writers often makes multi-writer databases slower than single-writer databases. remember, everything has to be serialized when it goes to the write ahead log, so as long as you can do the database updates as fast as you can write to the log then concurrent writers are of no benefit.

pclmulqdq2y ago

This is another cool example of a toy database that is again very small:

> The database size for one warehouse is approximately 100 MB (we experiment with five warehouses for a total size of 500MB).

It is not surprising that when your database basically fits in RAM, serializing on one writer is worth doing, because it just plainly reduces contention. You basically gain nothing in a DB engine from multi-writer transactions when this is the case. A large part of a write (the vast majority of write latency) in many systems with a large database comes from reading the index up to the point where you plan to write. If that tree is in RAM, there is no work here, and you instead incur overhead on consistency of that tree by having multiple writers.

I'm not suggesting that these results are useless. They are useful for people whose databases are small because they are meaningfully better than RocksDB/LevelDB which implicitly assume that your database is a *lot* bigger than RAM.

2 more replies

AdamProut2y ago

Yeah for workloads with any long running write transactions a single writer design is a pretty big limitation. Say some long running data load (or a big bulk deletion) running along with some faster high throughput key value writes - the big data load would block all the faster key-value writes when it runs.

No "mainstream" database I'm aware of has a global single writer design.

j / k navigate · click thread line to collapse

0 comments

hyc_symas2y ago

1 - for reading any uncached data, the I/O stalls are unavoidable. Whatever client requested that data is going to have to wait regardless.

2 - complexity? this is simply false. LMDB's ACID txns using MVCC are much simpler than any "traditional" approach.

RaisingSpear2y ago

> for reading any uncached data, the I/O stalls are unavoidable.

Excuse me for a silly question, but whilst an I/O stall may be unavoidable, wouldn't a thread stall be avoidable if you're not using mmap?

pclmulqdq2y ago

hyc_symas2y ago

You don't need more than single-writer concurrency if your write txns are fast enough.

jandrewrogers2y ago

> You don't need more than single-writer concurrency if your write txns are fast enough.

Writes through indexing can be sustained at this rate (assuming appropriate data structures), most of the technical challenge is driving the network at the necessary rate in my experience.

1 more reply

pclmulqdq2y ago

1 more reply

jeffffff2y ago

pclmulqdq2y ago

This is another cool example of a toy database that is again very small:

> The database size for one warehouse is approximately 100 MB (we experiment with five warehouses for a total size of 500MB).

2 more replies

AdamProut2y ago

No "mainstream" database I'm aware of has a global single writer design.

j / k navigate · click thread line to collapse