> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?
I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.
Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.
Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)
What about the Windows API? Windows is a pretty successful OS with a less leaky FS abstraction. I know it's a totally different deal than POSIX (files can't be devices etc), the FS function calls require a seemingly absurd number of arguments, but it does seem safer and clearer what's going to happen.
> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.
So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.
This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes
> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.
Yeah, sounds about right about quite a lot of the C programmers except for the "they commit to checking this in the future" part. I've responses like "well, don't upgrade your compiler; I'm gonna put 'Clang >= 9.0 is unsupported' in the README as a fix".
Because it was poorly designed, and there is a high resistance to change, so those design mistakes from decades ago continue to bite
Evaluating correctness without that consideration is too high of a bar.
Safety and correctness cannot be “impossible to misuse”
It is totally acceptable for applications to say "I do not support X conditions". Swap out the file half way through a read? Sorry don't support that. Remove power to the storage devise in the middle of a sync operation? Sorry don't support that.
For vital applications, for example databases, this is a known problem and risks of the API are accounted for. Other applications don't have nearly that level of risk associated with them. My music tagging app doesn't need to be resistant to the SSD being struck by lightning.
It is perfectly acceptable to design APIs for 95% of use cases and leave extremely difficult leaks to be solved by the small number of practitioners that really need to solve those leaks.
"If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and [basically save your ass]"[0]
This is why whenever I need to persist any kind of state to disk, SQLite is the first tool I reach for. Filesystem APIs are scary, but SQLite is well-behaved.
Of course, it doesn't always make sense to do that, like the dropbox use case.
In practice I believe I've seen SQLite databases corrupted due to what I suspect are two main causes:
1. The device powering off during the middle of a write, and
2. The device running out of space during the middle of a write.
https://lists.openldap.org/hyperkitty/list/openldap-devel@op...
I'm pretty sure that's not where I originally saw his comments. I remember his criticisms being a little more pointed. Although I guess "This is a bunch of academic speculation, with a total absence of real world modeling to validate the failure scenarios they presented" is pretty pointed.
Hopefully in whichever particular mode is referenced!
I wonder what is easy.
I kinda think, and I could be wrong, that SQLite rollback would not have any vulnerabilities with `synchronous=EXTRA` (and `fullfsync=F_FULLFSYNC` on macOS [2]).
The post supports its points with extensive references to prior research - research which hasn't been done in the Microsoft environment. For various reasons (NDAs, etc.) it's likely that no such research will ever be published, either. Basically it's impossible to write a post this detailed about safety issues in Microsoft file systems unless you work there. If you did, it would still take you a year or two of full-time work to do the background stuff, and when you finished, marketing and/or legal wouldn't let you actually tell anyone about it.
I can't say the Win32 File API is "pretty", but it's also an abstraction, like the .NET File Class is. And if you touch the NT API, you're naughty.
On Linux and macOS you use the same API, just the backends are different if you want async (epoll [blocking async] on Linux, kqueue on macOS).
ZFS fsync will not fail, although it could end up waiting forever when a pool faults due to hardware failures:
https://papers.freebsd.org/2024/asiabsdcon/norris_openzfs-fs...
https://github.com/openzfs/zfs/issues/9130#issuecomment-2614...
That said, there are many others who stress ZFS on a regular basis and ZFS handles the stress fine. I do not doubt that there are bugs in the code, but I feel like there are other things at play in that report. Messages saying that the txg_sync thread has hung for 120 seconds typically indicate that disk IO is running slowly due to reasons external to ZFS (and sometimes, reasons internal to ZFS, such as data deduplication).
I will try to help everyone in that issue. Thanks for bringing that to my attention. I have been less active over the past few years, so I was not aware of that mega issue.
> In conclusion, computers don't work (but I guess you already know this...
Just not all the time.
https://archive.wikiwix.com/cache/index2.php?rev_t=&url=http...
closest I come to working with files is localStorage, but that's thread safe.
its not a real problem for most modern developers.
pwrite? wtf?
not one mention of fopen.
granted some of the fine detail discussion is interesting, but it doesn't make practical sense since about 1990.
"fopen"? That is outdated stuff from a shitty ecosystem, and how do you think it's implemented?
Meanwhile you can read plenty of stories of others having the exact opposite experience.
If you keep losing data to power losses or crashes, perhaps fix the cause of that? It doesn't make sense to try to work around it.
Ponder this notion for a moment: there are problems within one's control and problems outside of one's control.
For example, we can't control the weather. If it snows three feet overnight you simply have to deal with the fact that you're not getting to work today.
Since we can't simply stop hardware from failing, we have to deal with the fact that hardware fails. Your seventeen redundant UPSes might experience a one in a trillion cascade failure. It might take the utility ten minutes longer to restore your power than you have onsite generation.
This is not a class of problem we can control or prevent. We fix these problems by building systems which withstand failures. You can't just will electrons out of the wall socket, but you can build a better disk or FS that corrupts less data when the electrons stop.
b7/b74a/b74a56
where the digits are derived from a hash of the file name but lately I've had some NTFS volumes with a 1M file directory that seem to be OK.Hardware problems also manifest in mysterious ways. On both Windows and MacOS I had computers that seemed to be OK until I did an OS update which caused enough IO that a failing HDD was pushed over the edge and the update failed; in one case I was able to roll back the update but not apply the update, in another case the machine was trashed. Careful investigation (like taking the disk out and inspecting it on another computer) revealed a hard drive error although there was no clear indication of this in the UI and the average person would blame to software update
I keep telling my users to make sure to plug their phones in before the battery dies, but for some reason they keep forgetting...