Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.
Durability is a knob. If you have enough data, or turn the knob too far in the direction of durability, you will simply bankrupt yourself or maybe drown your service in latency. It makes sense that you would have storage services that provide different levels of durability.
And I wouldn't assume they meant that number to be per record in the first place.
If you’re building a data storage system and are using the term “durable” to mean “it’s in RAM on three virtual machines”, for example, I don’t think it’s unfair to say that you are lying to your customers, because you are intentionally misusing a well-established term.
That was very helpful when choosing durability levels.
AFRs and discussions about different failure scenarios are the bare minimum. The bare minimum for scenarios is disk loss, total machine loss, and data center loss. This is just my take on things. I don’t care if something is on disk or not. I do care what happens when a sector on disk goes bad, when a faulty power supply destroys all the disks in a machine, or when a data center floods.
That forces you to think about things like whether you want to turn on synchronous replication.
I don't see how a virtualised NVMe disk is different from a physical one.
Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.
> O_DATA_SYNC
You mean `O_DSYNC`?
Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?
Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?
My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.
However, I suspect that that whole consideration is pointless:
The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.
On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.
[1] https://news.ycombinator.com/item?id=46532675
[2] https://tanelpoder.com/posts/using-pg-test-fsync-for-testing-low-latency-writes/
So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.Let me know if I got anything wrong.
The only remaining question is: Why do you then see any difference in your benchmark?
Configuration Throughput (obj/s)
-------------------------------------------
ext4 + O_DIRECT + fsync 116,041
Our engine 190,985
That is what I'd find very valuable to investigate.The first suspicion I have is: Shouldn't you be measuring `+ fdatasync` instead?
So I'd be interested in:
ext4 + O_DIRECT + fdatasync
ext4 + O_DIRECT + O_DSYNC
Our engine + O_DSYNC (which you're suggesting above)
Also I don't fully understand what the remaining diference between "ext4 + O_DIRECT + O_DSYNC" and "Our engine + O_DSYNC" would be.For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).
[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...
[2] https://github.com/fractalbits-labs/fractalbits/commit/12109...
That is where the disparity lies here. Reading back the data after the device reports that it has been written offers little in the way of additional assurances that it's successfully written. But if you report successful writes without syncing, there is a near certainty that you'll lose data on every power loss.