On the other hand, I've heard people recommend running Postgres on ZFS so you can enable on the fly compression. This increases CPU utilization on the postgres server by quite a bit, read latency of uncached data a bit, but it decreases necessary write IOPS a lot. And as long as the compression is happening a lot in parallel (which it should, if your database has many parallel queries), it's much easier to throw more compute threads at it than to speed up the write-speed of a drive.
And after a certain size, you start to need atomic filesystem snapshots to be able to get a backup of a very large and busy database without everything exploding. We already have the more efficient backup strategies from replicas struggle on some systems and are at our wits end how to create proper backups and archives without reducing the backup freqency to weeks. ZFS has mature mechanisms and zfs-send to move this data around with limited impact ot the production dataflow.
For Postgres specifically you may also want to look at using hot_standby_feedback, as described in this recent HN article: https://news.ycombinator.com/item?id=44633933
EDIT: It seems they're opt-in for PostgreSQL, too: https://www.postgresql.org/docs/current/checksums.html
bad news is, most databases don't do checksums by default.
Say more? I've heard people say that ZFS is somewhat slower than, say, ext4, but I've personally had zero issues running postgres on zfs, nor have I heard any well-reasoned reasons not to.
> What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.
Sorry, what? Just yesterday matrix.org had a post about how they (using ext4 + postgres) had disk corruption which led to postgres returning garbage data: https://matrix.org/blog/2025/07/postgres-corruption-postmort...
The corruption was likely present for months or years, and postgres didn't notice.
ZFS, on the other hand, would have noticed during a weekly scrub and complained loudly, letting you know a disk had an error, letting you attempt to repair it if you used RAID, etc.
It's stuff like in that post that are exactly why I run postgres on ZFS.
If you've got specifics about what you mean by "databases should bring their own implementation", I'd be happy to hear it, but I'm having trouble thinking of any sorta technically sound reason for "databases actually prefer it if filesystems can silently corrupt data lol" being true.
Btrfs is a better choice for SQLite.
Ext4 uses 16-/32-bit CRCs, which is very weak for storage integrity in 2025. Many popular filesystems for databases are similarly weak. Even if they have a strong option, the strong option is not enabled by default. In real-world Linux environments, the assumption that the filesystem has weak checksums usually true.
Postgres has (IIRC) 32-bit CRCs but they are not enabled by default. That is also much weaker than you would expect from a modern database. Open source databases do not have a good track record of providing robust corruption detection generally nor the filesystems they often run on. It is a systemic problem.
ZFS doesn't support features that high-performance database kernels use and is slow, particularly on high-performance storage. Postgres does not use any of those features, so it matters less if that is your database. XFS has traditionally been the preferred filesystem for databases on Linux and Ext4 will work. Increasingly, databases don't use external filesystems at all.
One possible instance of that is a database providing its own data checksumming, but another perfectly valid one is running one that does not on a lower layer with a sufficiently low data corruption rate.
Btrfs is a better choice for sqlite, haven’t seen that issue there.
The latest comment seems to be a nice summary of the root cause, with earlier in the thread pointing to ftruncate instead of fsync being a trigger:
>amotin
>I see. So ZFS tries to drop some data from pagecache, but there seems to be some dirty pages, which are held by ZFS till them either written into ZIL, or to disk at the end of TXG. And if those dirty page writes were asynchronous, it seems there is nothing that would nudge ZFS to actually do something about it earlier than zfs_txg_timeout. Somewhat similar problem was recently spotted on FreeBSD after #17445, which is why newer version of the code in #17533 does not keep references on asynchronously written pages.
Might be worth testing zfs_txg_timeout=1 or 0
Which you can do on a per dataset ('directory') basis very easily:
zfs set sync=disabled mydata/mydb001
* https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...Meanwhile all the rest of your pools / datasets can keep the default POSIX behaviour.
You cannot have SQLite keep your data and run well on ZFS unless you make a zvol and format it as btrfs or ext4 so they solve the problem for you.
What you're describing sounds like a bug specific to whichever OS you're using that has a port of ZFS.
I've encountered this bug both on illumos, specifically OpenIndiana, and Linux (Arch Linux).