We need an API that is dead simple and hard to misuse with clearly defined semantics and guarantees but lets seasoned developers still exploit the hardware to its fullest with additional work. Hope dies last I guess :)
My vision for the operating system of the future is to build a microkernel that exclusively uses a modified ZFS that has a transactional API, that can run Linux drivers in userspace with zero or few changes, and that uses a uniform event API like Windows' handles or Plan 9's file descriptors.
But that modified ZFS is perhaps the most important part, and I would want to make it so `fsync()` on that platform behaves as people would want it to: transactional such that if it succeeds, the data was written, and if it doesn't, that data was not written.
The ability to run Linux drivers is so there isn't a chicken-and-egg problem with drivers. (It would also be nice to implement the POSIX API to solve that part of the chicken-and-egg problem.) The uniform event API is because the current OS API's are difficult to work with. OS's are based around resources and events. The resource API's are pretty good, but the event API's (select(), poll(), epoll(), io_uring, kqueue, WaitForMultipleObjects(), etc.) are still artificially constrained.
Any end-to-end guarantees pretty much requires control over hardware, which of course would be very interesting to a lot of people, but isn't really what mainstream operating system concern themselves with.
Already the block device hides too much of hardware, and any VFS layer will be even worse. Just because a write is done doesn't mean it's actually on the spinning rust. And even if it is, it could be done in several ways, maybe it was relocated, or parity hasn't been computed, or it's in the queue. Any attempt of abstracting different types of storage will pretty much have to resort to the least bad common denominator here.
I'm not saying it's a bad idea, just that it's not the same problem that operating systems try to solve. But in the post-Optane world maybe block devices are too low level anyway and we'll finally see higher level storage systems.
If you already have a microkernel capable of running Linux hardware drivers in userspace, it shouldn't be hard to also run userspace ABIs/personalities interfaces as userspace drivers; NetBSD rump drivers more or less officially support what you're doing, and I suspect you could modify user mode linux (UML) to provide ABI compatibility good enough to run unmodified Linux binaries.
https://docs.microsoft.com/en-us/windows/win32/fileio/about-...
> In 2013 Bill Gates cited WinFS as his greatest disappointment at Microsoft and that the idea of WinFS was ahead of its time, which will re-emerge
If you discard a requirement to look like a standard Linux filesystem to arbitrary applications, writing a purpose-built filesystem for a specific application often requires less code than trying to make off-the-shelf filesystems consistently deliver desired behavior and performance. And even then it is often difficult to guarantee that the off-the-shelf filesystem will really do what you expect in all cases. Installing a custom filesystem isn't always a practical option operationally but it is sometimes done for applications like high-scale data infrastructure because of the headaches it eliminates. I've built a few systems designed to be deployed either way.
I am not sure there is a "one size fits all" solution for this. Alternative filesystem implementations tend to make very application-specific and hardware-aware design choices.
Most FS are not even transactional, so they can't expose the necessary interfaces (hell, most individual FS calls are not guaranteed to be transactional).
I assume you could build an application-level transactional system on ZFS, but I don't know that it exposes any such APIs either.
Other filesystems may also be headed that way.
Surely there's some motivations behind these behaviors and it's not a bug that was implemented in all 3 filesystems, right?
The primary motivation is probably that it's an annoying case to handle, pretty hard to test, and very uncommon.
It's also the original behaviour (IIRC from the fsyncgate reports, freebsd had added keeping the buffers dirty in the early aught but other bsds had inherited the ur-behaviour of marking them clean).
A second motivation I can think of (with more historical relevance) is that I don't think there's a userland API to tell the kernel to discard dirty pages, so if you can't mark the pages clean there's good chances you've leaked them. In our modern world it's common for writeback errors to be transient (e.g. a USB key that's not ready yet or somesuch) but 40 years back I'm not sure it made much sense. Though I guess network drivers were always a thing and could always have issues.
Except none of that is true, the USB storage could be reattached and the write successful. The system has no way to know whether the failure is transient or permanent.
> any subsequent read or write should fail.
That’s a different solution than keeping the dirty pages. Instead you discard the pages and lock to failure. IIRC that’s what openbsd implemented in the wake of fsyncgate.
That makes it seem like an immediate abort might be the best action in most cases? Handling it wrong and then chugging along might amplify any corruption that has happened.
It might obviously depend on the application and use case, but I'd like to think projects like pgsql put a lot of effort into getting this right after fsyncgate. I've read quite a bit about it after that incident, but ultimately decided I'm too stupid to get that right and roll the "log error and bail out" route ever since.
I recommend reading the paper or watching the video, it is very interesting.
That's exactly where postgres ended up: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
> PANIC on fsync() failure.
from the macOS fsync manpage:
> fsync() causes all modified data and attributes of fildes to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk.
> Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.
> Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.
> This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.
> For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of
> writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
Distributed systems are the closest we've gotten to resilient, durable storage. Redundancy, external verification, quorum. Sometimes the distributed system lives in a single box on your desk.
To assume that any newbie has hit upon a potential failure condition that we didn't already anticipate and account for in LMDB is frankly laughable.