> throughput increased by an order of magnitude almost immediately
But right near the start is the real story: the sync version had
> the classic fsync() call after every write to the log for durability
They are not comparing performance of sync APIs vs io_uring. They're comparing using fsync vs not using fsync! They even go on to say that a problem with async API is that
> you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage.
No! That's because you stopped using fsync. It's nothing to do with your code being async.
If you just removed the fsync from the sync code you'd quite possibly get a speedup of an order of magnitude too. Or if you put the fsync back in the async version (I don't know io_uring well enough to understand that but it appears to be possible with "io_uring_prep_fsync") then that would surely slide back. Would the io_uring version still be faster either way? Quite possibly, but because they made an apples-to-oranges comparison, we can't know from this article.
(As other commenters have pointed out, their two-phase commit strategy also fails to provide any guarantee. There's no getting around fsync if you want to be sure that your data is really on the storage medium.)
> No! That's because you stopped using fsync. It's nothing to do with your code being async.
From that section, it sounds like OP was tossing data into the io_uring submition queue and calling it "done" at that point (ie: not waiting for the io_uring completion queue to have the completion indicated). So yes, fsync is needed, but they weren't even waiting for the kernel to start the write before indicating success.
I think to some extent things have been confused because io_uring has a completion concept, but OP also has a separate completion concept in their dual wal design (where the second WAL they call the "completion" WAL).
But I'm not sure if OP really took away the right understanding from their issues with ignoring io_uring completions, as they then create a 5 step procedure that adds one check for an io_uring completion, but still omits another.
> 1. Write intent record (async)
> 2. Perform operation in memory
> 3. Write completion record (async)
> 4. Wait for the completion record to be written to the WAL
> 5. Return success to client
Note the lack of waiting for the io_uring completion of the intent record (and yes, there's still not any reference to fsync or alternates, which is also wrong). There is no ordering guarantee between independent io_urings (OP states they're using separate io_uring instances for each WAL), and even in the same io_uring there is limited ordering around completions (IOSQE_IO_LINK exists, but doesn't allow traversing submission boundaries, so won't work here because OP submits the work a separate times. They'd need to use IOSQE_IO_DRAIN which seems like it would effectively serialize their writes. which is why It seems like OP would need to actually wait for completion of the intent write).
Just to emphasize again that this blog post here is really quite different, since it does not fsync and breaks durability.
Not what we do in TigerBeetle or would recommend or encourage.
As I said, I don't know anything about fsync in io_uring. Maybe that has more control?
An article that did a fair comparison, by someone who actually knows what they're talking about, would be pretty interesting.
To deal with the risk of data loss, multiple such servers are used, with the hope that if one server dies before syncing, another server to which the data was replicated, performs an fsync without failure.
That's not correct; io_uring supports O_DIRECT write requests just fine. Obviously bypassing the cache isn't the same as just flushing it (which is what fsync does), so there are design impacts.
But database engines are absolutely the target of io_uring's feature set and they're expected to be managing this complexity.
io_uring includes an fsync opcode (with range support). When folks talk about fsync generally here, they're not saying the io_uring is unusable, they're saying that they'd expect the fsync to be used whether it's via the io_uring opcode, the system call, or some other mechanism yet to be created.
My point was really: you can't magically get the performance benefits of omitting fsync (or functional equivalent) while still getting the durability guarantees it gives.
For example, we never externalize commits without full fsync, to preserve durability [0].
Further, the motivation for why TigerBeetle has both a prepare WAL plus a header WAL is different, not performance (we get performance elsewhere, through batching) but correctness, cf. “Protocol-Aware Recovery for Consensus-Based Storage” [1].
Finally, TigerBeetle's recovery is more intricate, we do all this to survive TigerBeetle's storage fault model. You can read the actual code here [2] and Kyle Kingsbury's Jepsen report on TigerBeetle also provides an excellent overview [3].
[0] https://www.youtube.com/watch?v=tRgvaqpQPwE
[1] https://www.usenix.org/system/files/conference/fast18/fast18...
[2] https://github.com/tigerbeetle/tigerbeetle/blob/main/src/vsr...
During recovery, I only apply operations that have both intent and completion records. This ensures consistency while allowing much higher throughput. “
Does this mean that a client could receive a success for a request, which if the system crashed immediately afterwards, when replayed, wouldn’t necessarily have that request recorded?
How does that not violate ACID?
Yup. OP says "the intent record could just be sitting in a kernel buffer", but then the exact same issue applies to the completion record. So confirmation to the client cannot be issued until the completion record has been written to durable storage. Not really seeing the point of this blogpost.
So I fail to see how the two async writes are any guarantee at all. It sounds like they just happen to provide better consistency than the one async write because it forces an arbitrary amount of time to pass.
Seems like OP’s async approach removes that, so there’s no durability guarantee, so why even maintain a WAL to begin with?
Presumably the intent record is large (containing the key-value data) while the completion record is tiny (containing just the index of the intent record). Is the point that the completion record write is guaranteed to be atomic because it fits in a disk sector, while the intent record doesn't?
Write intent record (async)
Perform operation in memory
Write completion record (async)
* * Wait for intent and completion to be flushed to disk * *
Return success to client * * Wait for intent and completion to be flushed to disk * *
if you wait for both to complete, then how it can be faster than doing a single IO?I don't think this is necessarily the case, because the operations may have completed in a different order to how they are recorded in the intent log.
During recovery, since the server applies only the operations which have both records, you will not recover a record which was successful to the client.
-----------------
So the protocol ends up becoming:
Write intent record (async) Perform operation in memory Write completion record (async) Return success to client
-----------------
In other words, the client only knows its a success when both wal files have been written.
The goal is not to provide faster responses to the client, on the first intent record, but to ensure that the system is not stuck with I/O Waiting on fsync requests.
When you write a ton of data to database, you often see that its not the core writes but the I/O > fsync that eat a ton of your resources. Cutting back on that mess, results that you can push more performance out of a write heavy server.
Second, the durability is the same as fsync. The client only gets reported a success, if both wall writes have been done.
Its the same guarantee as fsync but you bypass the fsync bottleneck, what in turn allows for actually using the benefits of your NVME drives better (and shifting away the resource from the i/o blocking fsync).
Yes, it involves more management because now you need to maintain two states, instead of one with the synchronous fsync operation. But that is the thing about parallel programming, its more complex but you get a ton of benefits from it by bypassing synchronous bottlenecks.
I think this database doesn't have durability at all.
When you async write data, you do not need to wait for this confirmation. So by double writing two async requests, you are better using all your system CPU cores as they are not being stalled waiting for that I/O response. Seeing a 10x performance gain is not uncommon using a method like this.
Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.
It has literally the same durability as a fsync write. You need to take in account, that most databases are written 30, 40 ... years ago. In the time when HDDs ruled and stuff like NVME drives was a pipedream. But most DBs still work the same, and threat NVME drives like they are HDDs.
Doing this above operation on a HDD, will cost you 2x the performance because you barely have like 80 to 120 IOPS/s. But a cheap NVME drive easily does 100.000 like its nothing.
If you even monitored a NVME drive with a database write usage, you will noticed that those NVME drives are just underutilized. This is why you see a lot more work in trying new data storage layers being developed for Databases that better utilize NVME capabilities (and trying to bypass old HDD era bottlenecks).
I don't think we can ensure this without knowing what fsync() maps to in the NVMe standard, and somehow replicating that. Just reading back is not enough, e.g. the hardware might be reading from a volatile cache that will be lost in a crash.
What mechanism can be used to check that the writes are complete if not fsync (or adjacent fdatasync)? What specific io_uring operation or system call?
These are the steps described in the post:
1. Write intent record (async)
2. Perform operation in memory
3. Write completion record (async)
4. Wait for the completion record to be written to the WAL
5. Return success to client
If 4 is done correctly then 3 is not needed - it can just wait for the intent to be durable before replying to the client. Perhaps there's a small benefit to speculatively executing the operation before the WAL is committed - but I'm skeptical and my guess is that 4 is not being done correctly. The author added an update to the article:> This is tracked through io_uring's completion queue - we only send a success response after receiving confirmation that the completion record has been persisted to stable storage
This makes it sound like he's submitting write operations for the completion record and then misinterpreting the completion queue for those writes as "the record is now in durable storage".
I always use this approach for crash-resistance:
- Append to the data (WAL) file normally.
- Have a seperate small file that is like a hash + length for WAL state.
- First append to WAL file.
- Start fsync call on the WAL file, create a new hash/length file with different name and fsync it in parallel.
- Rename the length file onto the real one for making sure it is fully atomic.
- Update in-memory state to reflect the files and return from the write function call.
Curious if anyone knows tradeoffs between this and doing double WAL. Maybe doing fsync on everything is too slow to maintain fast writes?
I learned about append/rename approach from this article in case anyone is interested:
- https://discuss.hypermode.com/t/making-badger-crash-resilien...
- https://research.cs.wisc.edu/adsl/Publications/alice-osdi14....
The problem with naive async I/O in a database context at least, is that you lose the durability guarantee that makes databases useful. When a client receives a success response, their expectation is the data will survive a system crash. But with async I/O, by the time you send that response, the data might still be sitting in kernel buffers, not yet written to stable storage.
Shouldn't you just tie the successful response to a successful fsync?
Async or sync, I'm not sure what's different here.
While restoring: 1. Ignore all intents 2. Use only different operations with corresponding intents.
I think this article introduces so much chaos that it’s like many „almost” helpful info on io_uring and finally hurts the tech. io_uring IMHO lacks clean and simple examples and here we again have some bad-explained theories instead of meat.
The gains are from batching and doing work in-between. io_uring does "batching at a distance", and the DB can write to memory and perform operations in between. When io_uring checks the queues (intent/operation), it will find more than one operation, and do them all at once.
You don't lose durability with this setup -- you just do more speculative work (if you got the worst possible crash at the worst possible time), and if a bunch of things completed (because io_uring did them all at once) you get more confirmations you can send back faster.
Latency MIGHT suffer, but throughput would (and does) increase.
I updated the post based on the conversation below, I wholly missed an important callout about performance, and wasn't super clear that you do need to wait for the completion record to be written before responding to the client. That was implicitly mentioned by writing the completion record coming before responding, but I made it clearer to avoid confusion.
Also the dual WAL approach is worse for latency, unless you can amortize the double write over multiple async writes, so the cost paid amortizes across the batch, but when batch size is closer to 1, the cost is higher.
> This is tracked through io_uring's completion queue - we only send a success response after receiving confirmation that the completion record has been persisted to stable storage.
Which completion queue event(s) are you examining here? I ask because the way this is worded makes it sound like you're waiting solely for the completion queue event for the _write_ to the "completion wal".
Doing that (waiting only on the "completion wal" write CQE)
1. doesn't ensure that the "intent wal" has been written (because it's a different io_uring and a different submission queue event used to do the "intent wal" write from the "completion wal" write), and
2. doesn't indicate the "intent wal" data or the "completion wal" data has made it to durable storage (one needs fsync for that, the completion queue events for writes don't make that promise. The CQE for an fsync opcode would indicate that data has made it to durable storage if the fsync has the right ordering wrt the writes and refers to the appropriate fd and data ranges. Alternatively, there are some flags that have the effect of implying an fsync following a write that could be used, but those aren't mentioned)
If anyone else feels like doing this survey and publishing the results I'd love to see it.
But yes, this specific case seems to be a misunderstanding in what io_uring write completion means.
You would expect that they would have tested recovery by at least simulating system stops immediately after after Io completion notification.
Unless they are truly using asynchronous O_SYNC writes and are just bad at explaining it.
(We use it at work it in a network object storage service in order to use the underlying NVMe T10-DIF[1], which isn't exposed nicely by conventional POSIX/Linux interfaces.)
Ultimately, having a full, ~normal Linux stack around makes system management / orchestration easier. And programs other than our specialized storage software can still access other partitions, etc.