Can Applications Recover from Fsync Failures? (opens in new tab)

(usenix.org)

59 pointssimonz053y ago46 comments

46 comments

31 comments · 8 top-level

eis3y ago· 10 in thread

After decades of issues with the storage layer and even some of the most popular programs written by top notch developers having bugs due to the problematic nature of the APIs and filesystems involved I wish a completely new storage API would emerge. Something that exposes an asynchronous (and synchronous build upon it) API with ACID semantics. Filesystems are nothing more than specialized databases but they don't expose the necessary interface to use them as such.

We need an API that is dead simple and hard to misuse with clearly defined semantics and guarantees but lets seasoned developers still exploit the hardware to its fullest with additional work. Hope dies last I guess :)

ghoward3y ago

I agree with you.

My vision for the operating system of the future is to build a microkernel that exclusively uses a modified ZFS that has a transactional API, that can run Linux drivers in userspace with zero or few changes, and that uses a uniform event API like Windows' handles or Plan 9's file descriptors.

But that modified ZFS is perhaps the most important part, and I would want to make it so `fsync()` on that platform behaves as people would want it to: transactional such that if it succeeds, the data was written, and if it doesn't, that data was not written.

The ability to run Linux drivers is so there isn't a chicken-and-egg problem with drivers. (It would also be nice to implement the POSIX API to solve that part of the chicken-and-egg problem.) The uniform event API is because the current OS API's are difficult to work with. OS's are based around resources and events. The resource API's are pretty good, but the event API's (select(), poll(), epoll(), io_uring, kqueue, WaitForMultipleObjects(), etc.) are still artificially constrained.

xorcist3y ago

That's one of those things that sounds good in theory, but in practice has more to do with hardware than software semantics.

Any end-to-end guarantees pretty much requires control over hardware, which of course would be very interesting to a lot of people, but isn't really what mainstream operating system concern themselves with.

Already the block device hides too much of hardware, and any VFS layer will be even worse. Just because a write is done doesn't mean it's actually on the spinning rust. And even if it is, it could be done in several ways, maybe it was relocated, or parity hasn't been computed, or it's in the queue. Any attempt of abstracting different types of storage will pretty much have to resort to the least bad common denominator here.

I'm not saying it's a bad idea, just that it's not the same problem that operating systems try to solve. But in the post-Optane world maybe block devices are too low level anyway and we'll finally see higher level storage systems.

1 more reply

yjftsjthsd-h3y ago

> It would also be nice to implement the POSIX API to solve that part of the chicken-and-egg problem.

If you already have a microkernel capable of running Linux hardware drivers in userspace, it shouldn't be hard to also run userspace ABIs/personalities interfaces as userspace drivers; NetBSD rump drivers more or less officially support what you're doing, and I suspect you could modify user mode linux (UML) to provide ABI compatibility good enough to run unmodified Linux binaries.

1 more reply

GordonS3y ago

Windows used to have an almost unknown "transactional file system API" that sounds similar to what you're asking for. I think it was recently deprecated, but I don't know why, or what the history of this API is. Might be interesting to read up on it!

ectopod3y ago

Transactional NTFS:

https://docs.microsoft.com/en-us/windows/win32/fileio/about-...

eis3y ago

You are probably thinking of WinFS https://en.wikipedia.org/wiki/WinFS

> In 2013 Bill Gates cited WinFS as his greatest disappointment at Microsoft and that the idea of WinFS was ahead of its time, which will re-emerge

1 more reply

jandrewrogers3y ago

There is an impedance mismatch between what is required out of filesystems in terms of backward compatibility and the features/capabilities that would be useful for applications like databases.

If you discard a requirement to look like a standard Linux filesystem to arbitrary applications, writing a purpose-built filesystem for a specific application often requires less code than trying to make off-the-shelf filesystems consistently deliver desired behavior and performance. And even then it is often difficult to guarantee that the off-the-shelf filesystem will really do what you expect in all cases. Installing a custom filesystem isn't always a practical option operationally but it is sometimes done for applications like high-scale data infrastructure because of the headaches it eliminates. I've built a few systems designed to be deployed either way.

I am not sure there is a "one size fits all" solution for this. Alternative filesystem implementations tend to make very application-specific and hardware-aware design choices.

masklinn3y ago

> Filesystems are nothing more than specialized databases but they don't expose the necessary interface to use them as such.

Most FS are not even transactional, so they can't expose the necessary interfaces (hell, most individual FS calls are not guaranteed to be transactional).

I assume you could build an application-level transactional system on ZFS, but I don't know that it exposes any such APIs either.

eis3y ago

My point is that we need a new storage interface that is transactional. It's extremely hard to build a sound interface on the existing ones. And it would not be built on the existing FS because those are already built for the broken interfaces.

2 more replies

anamax3y ago

ReiserFS was headed that way before Reiser went to prison for murdering his wife.

Other filesystems may also be headed that way.

CGamesPlay3y ago· 4 in thread

> all three file systems mark pages clean after fsync fails, rendering techniques such as application-level retry ineffective. However, the content in said clean pages varies depending on the file system; ext4 and XFS contain the latest copy in memory while Btrfs reverts to the previous consistent state. Failure reporting is varied across file systems; for example, ext4 data mode does not report an fsync failure immediately in some cases, instead (oddly) failing the subsequent call. Failed updates to some structures (e.g., journal blocks) during fsync reliably lead to file-system unavailability. And finally, other potentially useful behaviors are missing; for example, none of the file systems alert the user to run a file-system checker after the failure.

Surely there's some motivations behind these behaviors and it's not a bug that was implemented in all 3 filesystems, right?

masklinn3y ago

> Surely there's some motivations behind these behaviors

The primary motivation is probably that it's an annoying case to handle, pretty hard to test, and very uncommon.

It's also the original behaviour (IIRC from the fsyncgate reports, freebsd had added keeping the buffers dirty in the early aught but other bsds had inherited the ur-behaviour of marking them clean).

A second motivation I can think of (with more historical relevance) is that I don't think there's a userland API to tell the kernel to discard dirty pages, so if you can't mark the pages clean there's good chances you've leaked them. In our modern world it's common for writeback errors to be transient (e.g. a USB key that's not ready yet or somesuch) but 40 years back I'm not sure it made much sense. Though I guess network drivers were always a thing and could always have issues.

username2233y ago

I’d go for historically very uncommon. fsync() meant “write RAM buffers to hard drive,” and if that failed, you were in a world of hurt, and should probably shut down while doing the least amount of additional harm. With NFS, the situation probably changed to “keep trying for a bit, but don’t do anything dramatic.”

eis3y ago

One example for marking dirty pages as clean after fsync failure that they mention filesystem developers having given them is a USB stick that has been pulled and keeping dirty pages causing memory leaks in that case. But to me that argument falls flat, they should free the cache if the underlying storage got removed and any subsequent read or write should fail.

masklinn3y ago

> But to me that argument falls flat, they should free the cache if the underlying storage got removed

Except none of that is true, the USB storage could be reattached and the write successful. The system has no way to know whether the failure is transient or permanent.

> any subsequent read or write should fail.

That’s a different solution than keeping the dirty pages. Instead you discard the pages and lock to failure. IIRC that’s what openbsd implemented in the wake of fsyncgate.

1 more reply

iforgotpassword3y ago· 4 in thread

> Our findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption.

That makes it seem like an immediate abort might be the best action in most cases? Handling it wrong and then chugging along might amplify any corruption that has happened.

It might obviously depend on the application and use case, but I'd like to think projects like pgsql put a lot of effort into getting this right after fsyncgate. I've read quite a bit about it after that incident, but ultimately decided I'm too stupid to get that right and roll the "log error and bail out" route ever since.

eis3y ago

The paper explains that even crashing after fsync fails you can end up with lost or corrupted data because after the next start you might get the wrong data from the page cache which survives the crash as it's not part of the programs memory but in the kernel.

I recommend reading the paper or watching the video, it is very interesting.

formerly_proven3y ago

If you see EIO 99.999 % of the time you want to do just that - stop anything you're doing, don't attempt any further writes. Chances are they'll fail anyway.

krkoch3y ago

The storage just bought the farm, EIEIO.

masklinn3y ago

> ultimately decided I'm too stupid to get that right and roll the "log error and bail out" route ever since.

That's exactly where postgres ended up: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

> PANIC on fsync() failure.

simonz05OP3y ago· 3 in thread

The paper analyzes how file systems and PostgreSQL, LMDB, LevelDB, SQLite, and Redis react to fsync failures. It shows that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption.

XorNot3y ago

This sounds a lot like we need to come up with the correct API for this and switch to it.

GordonS3y ago

I wonder if such an API would require hardware support in order to remain performant?

1 more reply

hyc_symas3y ago

The paper's analysis of LMDB is wrong.

https://news.ycombinator.com/item?id=32462537

formerly_proven3y ago· 2 in thread

IIRC Linux itself has only been reporting asynchronous writeback errors via fsync for a few short years, meaning before that basically any database that wasn't using O_DIRECT would miss I/O errors under memory pressure (or from out-of-process writebacks in general, e.g. root invoking sync). I looked into this stuff before postgres's fsyncgate, before "how are I/O errors actually handled in Linux, anyhow?" got attention, and walked away with the notion that anything other than O_DIRECT is best-effort-probably-works-most-of-the-time on a good day, and O_DIRECT's semantics are basically an unknowable opaque mixture of what drivers and hardware do and expect. There were some papers looking at error handling within Linux file systems at the time and they found a large number of issues in pretty much all of them. As far as I know, all efforts in the area of durable I/O are still focused on the notion of synchronizing I/O (fsync/fdatasync and equivalent), while many databases don't actually care about that too much and would rather want barriers instead. The kicker is of course that hardware (when honest) actually uses barriers and not block synchronization, and the databases that are journaling filesystems of course also use barriers and not synchronization to implement journaling. It struck me as a distinctly classic API-to-real-world mismatch.

the84723y ago

io_uring already has IOSQE_IO_DRAIN which sounds like it's currently implemented as stalling the IO pipeline, but maybe it could be translated to hardware barriers instead in some circumstances (e.g. when the surrounding IO is all O_DIRECT).

formerly_proven3y ago

That seems to me like it's not on the right abstraction layer, the drain flag sounds like it's just a barrier for the kernel's threadpool.

chrsig3y ago

On macOS, most likely not[0].

from the macOS fsync manpage:

> fsync() causes all modified data and attributes of fildes to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk.

> Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.

> Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.

> This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.

> For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of

> writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.

[0] https://twitter.com/marcan42/status/1494213855387734019

xyzzy_plugh3y ago

I said this elsewhere but, in isolation there will always be failure scenarios where recovery is impossible. There are plenty of verification strategies to detect failures, and combined with redundancy, you can reduce the probability of application failure in the face of fsync failures or other similar failures. But you can never eliminate failures. If your storage gives up the ghost, it's game over.

Distributed systems are the closest we've gotten to resilient, durable storage. Redundancy, external verification, quorum. Sometimes the distributed system lives in a single box on your desk.

hyc_symas3y ago

The description of LMDB's behavior and subsequent analysis are flat wrong. https://twitter.com/hyc_symas/status/1558909442737012736

To assume that any newbie has hit upon a potential failure condition that we didn't already anticipate and account for in LMDB is frankly laughable.

j / k navigate · click thread line to collapse

46 comments

31 comments · 8 top-level

eis3y ago· 10 in thread

ghoward3y ago

I agree with you.

xorcist3y ago

That's one of those things that sounds good in theory, but in practice has more to do with hardware than software semantics.

1 more reply

yjftsjthsd-h3y ago

> It would also be nice to implement the POSIX API to solve that part of the chicken-and-egg problem.

1 more reply

GordonS3y ago

ectopod3y ago

Transactional NTFS:

https://docs.microsoft.com/en-us/windows/win32/fileio/about-...

eis3y ago

You are probably thinking of WinFS https://en.wikipedia.org/wiki/WinFS

> In 2013 Bill Gates cited WinFS as his greatest disappointment at Microsoft and that the idea of WinFS was ahead of its time, which will re-emerge

1 more reply

jandrewrogers3y ago

There is an impedance mismatch between what is required out of filesystems in terms of backward compatibility and the features/capabilities that would be useful for applications like databases.

I am not sure there is a "one size fits all" solution for this. Alternative filesystem implementations tend to make very application-specific and hardware-aware design choices.

masklinn3y ago

> Filesystems are nothing more than specialized databases but they don't expose the necessary interface to use them as such.

Most FS are not even transactional, so they can't expose the necessary interfaces (hell, most individual FS calls are not guaranteed to be transactional).

I assume you could build an application-level transactional system on ZFS, but I don't know that it exposes any such APIs either.

eis3y ago

2 more replies

anamax3y ago

ReiserFS was headed that way before Reiser went to prison for murdering his wife.

Other filesystems may also be headed that way.

CGamesPlay3y ago· 4 in thread

Surely there's some motivations behind these behaviors and it's not a bug that was implemented in all 3 filesystems, right?

masklinn3y ago

> Surely there's some motivations behind these behaviors

The primary motivation is probably that it's an annoying case to handle, pretty hard to test, and very uncommon.

It's also the original behaviour (IIRC from the fsyncgate reports, freebsd had added keeping the buffers dirty in the early aught but other bsds had inherited the ur-behaviour of marking them clean).

username2233y ago

eis3y ago

masklinn3y ago

> But to me that argument falls flat, they should free the cache if the underlying storage got removed

Except none of that is true, the USB storage could be reattached and the write successful. The system has no way to know whether the failure is transient or permanent.

> any subsequent read or write should fail.

That’s a different solution than keeping the dirty pages. Instead you discard the pages and lock to failure. IIRC that’s what openbsd implemented in the wake of fsyncgate.

1 more reply

iforgotpassword3y ago· 4 in thread

> Our findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption.

That makes it seem like an immediate abort might be the best action in most cases? Handling it wrong and then chugging along might amplify any corruption that has happened.

eis3y ago

I recommend reading the paper or watching the video, it is very interesting.

formerly_proven3y ago

If you see EIO 99.999 % of the time you want to do just that - stop anything you're doing, don't attempt any further writes. Chances are they'll fail anyway.

krkoch3y ago

The storage just bought the farm, EIEIO.

masklinn3y ago

> ultimately decided I'm too stupid to get that right and roll the "log error and bail out" route ever since.

That's exactly where postgres ended up: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

> PANIC on fsync() failure.

simonz05OP3y ago· 3 in thread

XorNot3y ago

This sounds a lot like we need to come up with the correct API for this and switch to it.

GordonS3y ago

I wonder if such an API would require hardware support in order to remain performant?

1 more reply

hyc_symas3y ago

The paper's analysis of LMDB is wrong.

https://news.ycombinator.com/item?id=32462537

formerly_proven3y ago· 2 in thread

the84723y ago

formerly_proven3y ago

That seems to me like it's not on the right abstraction layer, the drain flag sounds like it's just a barrier for the kernel's threadpool.

chrsig3y ago

On macOS, most likely not[0].

from the macOS fsync manpage:

> This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.

> writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.

[0] https://twitter.com/marcan42/status/1494213855387734019

xyzzy_plugh3y ago

Distributed systems are the closest we've gotten to resilient, durable storage. Redundancy, external verification, quorum. Sometimes the distributed system lives in a single box on your desk.

hyc_symas3y ago

The description of LMDB's behavior and subsequent analysis are flat wrong. https://twitter.com/hyc_symas/status/1558909442737012736

To assume that any newbie has hit upon a potential failure condition that we didn't already anticipate and account for in LMDB is frankly laughable.

j / k navigate · click thread line to collapse