Also, impressive work!
I've been using ZFS for my NAS-like thing since then. It's been rock solid ().
(
): I know about the block cloning bug, and the encryption bug. Luckily I avoided those (I don't tend to enable new features like block cloning, and I didn't have an encrypted dataset at the time). Still, all in all it's been really good in comparison to btrfs.The btrfs solution has a mixed history, and had a lot of the same issues DRBD could get. They are great until some hardware/kernel-mod eventually goes sideways, and then the auto-heal cluster filesystems start to make a lot more sense. Note, with cluster based complete-file copy/repair object features the damage is localized to single files at worst, and folks don't have to wait 3 days to bring up the cluster on a crash.
Best of luck, =3
I wonder how could a requirement like that possibly arise. Especially with an obvious exception for zfs.
I suspect this is still vulnerable to the write hole problem.
You can add LVM to get snapshots, but this still not an end-to-end copy-on-write solution that btrfs and ZFS should provide.
(50TB+ on ext4 and xfs, and no, no bit rot. Yes, I've checked most of it against separate sha256sum files now and then. As long as you have ECC RAM, disks just magically corrupting your data is largely a myth.)
RAID and logical block redundancy has scaled to petabytes for years in serious production use, before btrfs was even developed.
Please don't be btrfs please don't be btrfs please don't be btrfs...
https://www.reddit.com/r/bcachefs/comments/1rblll1/the_blog_...
But no. It was btrfs.
As a side note, it's somewhat impressive that an LLM agent was able to produce a suite of custom tools that were apparently successfully used to recover some data from a corrupted btrfs array, even ad-hoc.
And on btrfs anything above raid1 (5,6 etc) has had very serious bugs. Actually read an opinion somewhere (don't remember where) raid5,6 on btrfs cannot work due to on-disk format being just bad for the case. I guess this is why raid1c3/c4 is being promoted and worked on now?
Edit: found some comments below: ZFS on Linux has had many bugs over the years, notably with ZFS-native encryption and especially sending/receiving encrypted volumes. Another issue is that using swap on ZFS is still guaranteed to hang the kernel in low memory scenarios, because ZFS needs to allocate memory to write to swap.
Er, I appreciate trying to be constructive, but in what possible situation is it not a bug that a power cycle can lose the pool? And if it's not technically a "bug" because BTRFS officially specifies that it can fail like that, why is that not in big bold text at the start of any docs on it? 'Cuz that's kind of a big deal for users to know.
EDIT: From the longer write-up:
> Initial damage. A hard power cycle interrupted a commit at generation 18958 to 18959. Both DUP copies of several metadata blocks were written with inconsistent parent and child generations.
Did the author disable safety mechanisms for that to happen? I'm coming from being more familiar with ZFS, but I would have expected BTRFS to also use a CoW model where it wasn't possible to have multiple inconsistent metadata blocks in a way that didn't just revert you to the last fully-good commit. If it does that by default but there's a way to disable that protection in the name of improving performance, that would significantly change my view of this whole thing.
I suspect that the author's intent is less "I do not view this as a bug" and more "I do not think it's useful to get into angry debates over whether something is a bug". I do not know whether this is a common thing on btrfs discussions, but I have certainly seen debates to that effect elsewhere.
(My personal favorite remains "it's not a data loss bug if someone could technically theoretically write something to recover the data". Perhaps, technically, that's true, but if nobody is writing such a tool, nobody is going to care about the semantics there.)
Agreed, and I appreciate the attempt to channel things into a productive conversation.
Well that he recovered the disks is amazing in itself. I would have given up and just pulled a backup.
However, I would like to see a Dev saying: why didn't you use the --<flag> which we created for this Usecase
TLDR: The user got his filesystem corrupted on a forced reboot; native btrfs tools made the failure worse; the user asked Claude to autonomously debug and fix the problem; after multiple days of debugging, Claude wrote a set of custom low-level C scripts to recover 99.9% of the data; the user was impressed and asked Claude to submit an issue describing the whole thing.
Changing the metadata profile to at least raid1 (raid1, raid1c3, raid1c4) is a good idea, especially for anyone, against recommendations, using raid5 or raid6 for a btrfs array (raid1c3 is more appropriate for raid6). That would make it very difficult for metadata to get corrupted, which is the lion's share of the higher-impact problems with raid5/6 btrfs.
check:
btrfs fi df <mountpoint>
convert metadata: btrfs balance start -mconvert=raid1c3,soft <mountpoint>
(make sure it's -mconvert — m is for metadata — not -dconvert which would switch profiles for data, messing up your array)I want to be clear that losing (meta)data in flight during a power loss is expected. But a broken filesystem after that is definitely not acceptable.
Some postgresql db endedup soft corrupted. Postgresql could not replay its log because btrfs threw IO errors on fsync. That's just plain not acceptable.
With the same configuration this can happen with ZFS, bcachefs etc just as well.
Most filesystems just get a few files/directories damaged though. ZFS is famous for handling totally crazy things like broken hardware which damages data in-transit. ext4 has no checksum, but at least fsck will drop things into lost+found directory.
The "making all data inaccessible" part is pretty unique to btrfs, and lets not pretend nothing can be done about this.
As a ZFS wrangler by day:
People in this thread seem to happily shit on btrfs here but this seems to be very much not like a sane, resilient configuration no matter the FS. Just something to keep in mind.
* Data single obviously means losing a single drive will cause data loss, but no drive was actually lost, right?
* Metadata DUP (not sure if it's across 2 disks or all 3) should be expected to be robust, I'd expect?
* I certainly eye DM-SMR disks with suspicion in general, but it doesn't sound like they were responsible for the damage: "Both DUP copies of several metadata blocks were written with inconsistent parent and child generations."
No. DUP will happily put both copies on the same disk. You would need to use RAID1 (or RAID1c3 for a copy on all disks) if you wanted a guarantee of the metadata being on multiple disks.
I think what happened was that the machine ran out of battery in suspend, but an unclean shutdown shouldn't cause such a deep corruption.
That’s the only real reason. There are some papercuts, but they don’t compare to the risks described in this article.
Post-migration, a complete disk image of the original ext4 disk will exist within the new filesystem, using no additional disk space due to the magic of copy-on-write.
Why isn't the repair process the same? Fix the filesystem to get everything online asap, and leave a complete disk image of the old damaged filesystem so other recovery processes can be tried if necessary.
Keeps repeating btrfs check --repair . This command is dangerous and warned anywhere as a last resort: if you try to execute it you get a warning; the documentation has a warning; any guide from google tell you not to run it unless all else fails; chatgpt/lechat do not metion it, or note it as last resort. So not sure why he keeps repeating it without any note
> Use these tools ONLY if btrfs check --repair segfaults, enters an infinite loop, or leaves the filesystem in worse shape than before.
> Timeline of events ... First repair attempts. btrfs check --repair
The guy is recommending people brick their volumes permanently as first resort without any warning
Between using a dup profile and this I would not be surprised a btrfs dev just disregarding all as slop
> Pool only mounts with rescue=all,ro, fails to mount RW
Also this is important, the data was not lost. Even though read-only
I don't think I would run this code. Still it would be interesting a btrs dev to have look and comment if there is any value in the code generated. As it would be definitely interesting being able to repair more issues in the pool safely inplace