undefined | Better HN

0 pointshinkley11mo ago0 comments

For instance, running on ZFS or one of its peers.

0 comments

19 comments · 2 top-level

jandrewrogers11mo ago· 10 in thread

Apropos this use case, ZFS is usually not recommended for databases. Competent database storage engines have their own strong corruption detection mechanisms regardless. What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.

tetha11mo ago

Hm.

On the other hand, I've heard people recommend running Postgres on ZFS so you can enable on the fly compression. This increases CPU utilization on the postgres server by quite a bit, read latency of uncached data a bit, but it decreases necessary write IOPS a lot. And as long as the compression is happening a lot in parallel (which it should, if your database has many parallel queries), it's much easier to throw more compute threads at it than to speed up the write-speed of a drive.

And after a certain size, you start to need atomic filesystem snapshots to be able to get a backup of a very large and busy database without everything exploding. We already have the more efficient backup strategies from replicas struggle on some systems and are at our wits end how to create proper backups and archives without reducing the backup freqency to weeks. ZFS has mature mechanisms and zfs-send to move this data around with limited impact ot the production dataflow.

supriyo-biswas11mo ago

Is an incremental backup of the database not possible? Pgbackrest etc. can do this by creating a full backup followed by incremental backups from the WAL.

For Postgres specifically you may also want to look at using hot_standby_feedback, as described in this recent HN article: https://news.ycombinator.com/item?id=44633933

1 more reply

hinkleyOP11mo ago

This was my understanding as well, color me also confused.

wahern11mo ago

But what ZFS provides isn't weaker, and in SQLite page checksums are opt-in: https://www.sqlite.org/cksumvfs.html

EDIT: It seems they're opt-in for PostgreSQL, too: https://www.postgresql.org/docs/current/checksums.html

avinassh11mo ago

you might like my other post - https://avi.im/blag/2024/databases-checksum/

bad news is, most databases don't do checksums by default.

2 more replies

TheDong11mo ago

> ZFS is usually not recommended for databases

Say more? I've heard people say that ZFS is somewhat slower than, say, ext4, but I've personally had zero issues running postgres on zfs, nor have I heard any well-reasoned reasons not to.

> What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.

Sorry, what? Just yesterday matrix.org had a post about how they (using ext4 + postgres) had disk corruption which led to postgres returning garbage data: https://matrix.org/blog/2025/07/postgres-corruption-postmort...

The corruption was likely present for months or years, and postgres didn't notice.

ZFS, on the other hand, would have noticed during a weekly scrub and complained loudly, letting you know a disk had an error, letting you attempt to repair it if you used RAID, etc.

It's stuff like in that post that are exactly why I run postgres on ZFS.

If you've got specifics about what you mean by "databases should bring their own implementation", I'd be happy to hear it, but I'm having trouble thinking of any sorta technically sound reason for "databases actually prefer it if filesystems can silently corrupt data lol" being true.

zaarn11mo ago

SQLite on ZFS needs the Fsync behaviour to be off, otherwise SQLite will randomly hang the application as the fsync will wait for the txg to commit. This can take a minute or two, in my experience.

Btrfs is a better choice for SQLite.

2 more replies

jandrewrogers11mo ago

The point is that a database cannot rely on being deployed on a filesystem with proper checksums.

Ext4 uses 16-/32-bit CRCs, which is very weak for storage integrity in 2025. Many popular filesystems for databases are similarly weak. Even if they have a strong option, the strong option is not enabled by default. In real-world Linux environments, the assumption that the filesystem has weak checksums usually true.

Postgres has (IIRC) 32-bit CRCs but they are not enabled by default. That is also much weaker than you would expect from a modern database. Open source databases do not have a good track record of providing robust corruption detection generally nor the filesystems they often run on. It is a systemic problem.

ZFS doesn't support features that high-performance database kernels use and is slow, particularly on high-performance storage. Postgres does not use any of those features, so it matters less if that is your database. XFS has traditionally been the preferred filesystem for databases on Linux and Ext4 will work. Increasingly, databases don't use external filesystems at all.

1 more reply

lxgr11mo ago

No, competent systems just need to have something that, taken together, prevents data corruption.

One possible instance of that is a database providing its own data checksumming, but another perfectly valid one is running one that does not on a lower layer with a sufficiently low data corruption rate.

johncolanduoni11mo ago

Is not great for databases that do updates in place. Log-structured merge databases (which most newer DB engines are) work fine with its copy-on-write semantics.

zaarn11mo ago· 7 in thread

ZFS isn’t viable for SQLite unless you turn off fsync’s in ZFS, because otherwise you will have the same experience I had for years; SQLite may randomly hang for up to a few minutes with no visible cause, if there isn’t sufficient write txg’s to fill up in the background. If your app depends on SQLite, it’ll randomly die.

Btrfs is a better choice for sqlite, haven’t seen that issue there.

Modified301911mo ago

Interesting. Found a GitHub issue that covers this bug: https://github.com/openzfs/zfs/issues/14290

The latest comment seems to be a nice summary of the root cause, with earlier in the thread pointing to ftruncate instead of fsync being a trigger:

>amotin

>I see. So ZFS tries to drop some data from pagecache, but there seems to be some dirty pages, which are held by ZFS till them either written into ZIL, or to disk at the end of TXG. And if those dirty page writes were asynchronous, it seems there is nothing that would nudge ZFS to actually do something about it earlier than zfs_txg_timeout. Somewhat similar problem was recently spotted on FreeBSD after #17445, which is why newer version of the code in #17533 does not keep references on asynchronously written pages.

Might be worth testing zfs_txg_timeout=1 or 0

throw0101b11mo ago

> ZFS isn’t viable for SQLite unless you turn off fsync’s in ZFS

Which you can do on a per dataset ('directory') basis very easily:

    zfs set sync=disabled mydata/mydb001

* https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...

Meanwhile all the rest of your pools / datasets can keep the default POSIX behaviour.

ezekiel6811mo ago

You know what's even easier than doing that? Neglecting to do it or meaning to do it then getting pulled in to some meeting (or other important distraction) and then imagining you did it.

1 more reply

zaarn11mo ago

Disabling sync corrupts SQLite databases on powerloss, I've personally experienced this following disabling sync because it causes SQLite to hang.

You cannot have SQLite keep your data and run well on ZFS unless you make a zvol and format it as btrfs or ext4 so they solve the problem for you.

kentonv11mo ago

Doesn't turning off sync mean you can lose confirmed writes in a power failure?

jclulow11mo ago

This isn't an inherent property of ZFS at all. I have made heavy use of SQLite for years (on illumos systems) without ever hitting this, and I would never counsel anybody to disable sync writes: it absolutely can lead to data loss under some conditions and is not safe to do unless you understand what it means.

What you're describing sounds like a bug specific to whichever OS you're using that has a port of ZFS.

zaarn11mo ago

I wouldn't recommend SQLite on ZFS (or in general for other reasons), for the precise reason that it either lags or is unsafe.

I've encountered this bug both on illumos, specifically OpenIndiana, and Linux (Arch Linux).

j / k navigate · click thread line to collapse

0 comments

19 comments · 2 top-level

jandrewrogers11mo ago· 10 in thread

tetha11mo ago

Hm.

supriyo-biswas11mo ago

Is an incremental backup of the database not possible? Pgbackrest etc. can do this by creating a full backup followed by incremental backups from the WAL.

For Postgres specifically you may also want to look at using hot_standby_feedback, as described in this recent HN article: https://news.ycombinator.com/item?id=44633933

1 more reply

hinkleyOP11mo ago

This was my understanding as well, color me also confused.

wahern11mo ago

But what ZFS provides isn't weaker, and in SQLite page checksums are opt-in: https://www.sqlite.org/cksumvfs.html

EDIT: It seems they're opt-in for PostgreSQL, too: https://www.postgresql.org/docs/current/checksums.html

avinassh11mo ago

you might like my other post - https://avi.im/blag/2024/databases-checksum/

bad news is, most databases don't do checksums by default.

2 more replies

TheDong11mo ago

> ZFS is usually not recommended for databases

Say more? I've heard people say that ZFS is somewhat slower than, say, ext4, but I've personally had zero issues running postgres on zfs, nor have I heard any well-reasoned reasons not to.

> What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.

The corruption was likely present for months or years, and postgres didn't notice.

ZFS, on the other hand, would have noticed during a weekly scrub and complained loudly, letting you know a disk had an error, letting you attempt to repair it if you used RAID, etc.

It's stuff like in that post that are exactly why I run postgres on ZFS.

zaarn11mo ago

SQLite on ZFS needs the Fsync behaviour to be off, otherwise SQLite will randomly hang the application as the fsync will wait for the txg to commit. This can take a minute or two, in my experience.

Btrfs is a better choice for SQLite.

2 more replies

jandrewrogers11mo ago

The point is that a database cannot rely on being deployed on a filesystem with proper checksums.

1 more reply

lxgr11mo ago

No, competent systems just need to have something that, taken together, prevents data corruption.

johncolanduoni11mo ago

Is not great for databases that do updates in place. Log-structured merge databases (which most newer DB engines are) work fine with its copy-on-write semantics.

zaarn11mo ago· 7 in thread

Btrfs is a better choice for sqlite, haven’t seen that issue there.

Modified301911mo ago

Interesting. Found a GitHub issue that covers this bug: https://github.com/openzfs/zfs/issues/14290

The latest comment seems to be a nice summary of the root cause, with earlier in the thread pointing to ftruncate instead of fsync being a trigger:

>amotin

Might be worth testing zfs_txg_timeout=1 or 0

throw0101b11mo ago

> ZFS isn’t viable for SQLite unless you turn off fsync’s in ZFS

Which you can do on a per dataset ('directory') basis very easily:

    zfs set sync=disabled mydata/mydb001

* https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...

Meanwhile all the rest of your pools / datasets can keep the default POSIX behaviour.

ezekiel6811mo ago

You know what's even easier than doing that? Neglecting to do it or meaning to do it then getting pulled in to some meeting (or other important distraction) and then imagining you did it.

1 more reply

zaarn11mo ago

Disabling sync corrupts SQLite databases on powerloss, I've personally experienced this following disabling sync because it causes SQLite to hang.

You cannot have SQLite keep your data and run well on ZFS unless you make a zvol and format it as btrfs or ext4 so they solve the problem for you.

kentonv11mo ago

Doesn't turning off sync mean you can lose confirmed writes in a power failure?

jclulow11mo ago

What you're describing sounds like a bug specific to whichever OS you're using that has a port of ZFS.

zaarn11mo ago

I wouldn't recommend SQLite on ZFS (or in general for other reasons), for the precise reason that it either lags or is unsafe.

I've encountered this bug both on illumos, specifically OpenIndiana, and Linux (Arch Linux).

j / k navigate · click thread line to collapse