The results vary between annoying (need to restore / “resilver” and have no redundancy until it’s done; massively increased risk of data loss while doing so due to heavy IO load without redundancy and pointless loss of the redundancy that already exists) to catastrophic (outright corruption). The corollary is that RAID invariably works poorly with disks connected over using an interface that enumerates slowly or unreliably.
Yet most competent active-active database systems have no problems with this scenario!
I would love to see a RAID system that thinks of disks as nodes, properly elects leaders, and can efficiently fast-forward a disk that’s behind. A pile of USB-connected drives would work perfectly, would come up when a quorum was reached, and would behave correctly when only a varying subset of disks is available. Bonus points for also being able to run an array that spans multiple computers efficiently, but that would just be icing on the cake.
I'm not sure what you expect?
RAID1 is a simple data copy, you made sure to make both disks contain different data. So there's two outcomes possible: either the system notices this and copies A to B or B to A to reestablish the redundancy, or it fails to notice and you get corruption.
Linux MD allows for partial sync with the bitmap. If the system knows something in the first 5% of the disk changed, it can limit itself to only syncing that 5%.
> Yet most competent active-active database systems have no problems with this scenario!
Because they're not RAID. The whole point of RAID is that it's extremely simple. This means it's a brute force method with some downsides, but in exchange it's extremely easy to reason about.
RAID is overkill for home use. It also does not solve backups and snapshots. I use one way syncthing with unlimited history, plus usb-sata adapter.
I mean, sounds like a house of cards but, should be possible?
I have a ZFS mirror, where I have taken one disk out, added files to it elsewhere, returned it and reimported.
The pool immediately resilvered the new content onto the untouched drive.
Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.
I believe btrfs replace will copy only the data that had a replica on the failing drive.
You would mount degraded on the remote system and copy in the new files.
After the returning the new drive, you would mount normally, but getting the new content mirrored requires a rebalance.
Replace is for a blank drive, and it hasn't worked very well for me, as status/usage reported some data that was not mirrored; a rebalance fixed this.
That's a weird argument. Even if it's true, it is now stable, and has been for a long time. btrfs has long been my default, and I'd be wary of switching to something newer just because someone was mad that development took a long time.
This includes plenty of random power losses.
The people on IRC tend to default to "unless you're using an enterprise drive, it's probably buggy and doesn't respect write barriers", which shouldn't have mattered because there was no system crash involved.
Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.
Just luck. Software can't defend itself against bad RAM. There's always the possibility that bad RAM will cause ZFS to corrupt itself in some way it can't recover itself from.
Everything is in RAM. The kernel, the ZFS code, everything. All of that is vulnerable to corruption. No matter how fancy ZFS is, it can't stop its own code from being corrupted. It's just luck that it didn't happen.
Be careful though. If whatever data was to be written got corrupted early enough, ie before ZFS got to see it, it happily wrote corrupted data to disk with matching checksum and you're none the wiser. But yes, it didn't blow up the entire Filesystem unlike btrfs likes to do.
Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?
Personal anecdote: i've only ever had serious corruption twice, 20-ish years ago, once with XFS and once with ReiserFS, and have primarily used the extN family of filesystems for most of the past 30 years. A filesystem only has to go corrupt on me once before i stop using it.
Edit to add a caveat: though i find the ideas behind ZFS, btrfs, etc., fascinating, i have no personal need for them so have never used them on personal systems (but did use ZFS on corporate Solaris systems many years ago). ext4 has always served me well, and comes with none of the caveats i regularly read about for any of the more advanced filesystems. Similarly, i've never needed an LVM or any such complexity. As the age-old wisdom goes, "complexity is your enemy," and keeping to simple filesystem setups has always served my personal systems/LAN well. i've also never once seen someone recover from filesystem corruption in a RAID environment by simply swapping out a disk (there's always been much more work involved), so i've never bought into the "RAID is the solution" camp.
- ZStandard compression is a performance boost on crappy spinning rust
- Snapshots are amazing, and I love being able to quickly send and store them using send and receive
- I like not having to partition the disk at all, and still be able to have multiple datasets that share the same underlying storage. LVM2 has way too many downsides for me to still consider it, like the fact that thin provisioning was quite problematic (i.e. ext4 and the like have no idea they're thin provisioned, ...)
- I like not having to bother with fstab anymore. I have all of my (complex) datasets under multiple boot roots, and I can mount pools from a live with an altroot and immediately get all directories properly mounted
- AFAIK only ZFS and Btrfs support checksums out of the box. I hate the fact that most FS can in fact bitrot and silently corrupt files. With ZFS and Btrfs in theory you can't easily restore your data, but at least you'll know if it got corrupted and restore it from a backup
- I like ZVOL; I appreciate being able to use them as sparse disks for VMs that can be easily mounted without using loopback devices (you get all partitions under /dev/zvol/pool/zvol-partN)
- If you have a lot of RAM,the ZFS ARC can speed up things a lot. ZFS is somewhat slower most of the time than "simpler" FS, but with 10+ GB availble to ARC it's been faster in my experience than any other FS
I do use "classic" filesystems for other applications, like random USB disks and stuff. I just prefer ZFS because the feature set is so good and it's been nothing but stable in day to day use. I've literally had ZERO issues with it in 8+ years - even when using the -git version it's way more stable than Btrfs ever was.
I'd guess that it is the classic case of figuring out if something works without using it being a lot harder than giving it a go and seeing what happens. I've accidentally taken out my own home folder in the past with ill-advised setups and it is an educational experience. I wouldn't recommend it professionally, but I can see the joy in using something unusual on a personal system. Keep backups of anything you really can't afford to lose.
And one bad experience isn't enough to get a feel for how reliable something is. It is better to stick with it even if it fails once or twice.
$ grep bar foo.txt | tr A-Z a-z > foo.txt
is much more common than losing a diskmy personal reasons are raid + compression
I've personally had drive failures, fs corruptions due to power loss (which is supposed not to happen on a cow filesystem), fs and file corruption due to ram bitflips, etc. All the times btrfs handled the situation perfectly, with the caveat that I needed the help from the btrfs developers. And they were very helpful!
So yeah, btrfs has a bad rep, but it is not as bad as the common feeling makes it look like.
(note that I still run btrfs raid 1, as I did not find real return of experience regarding raid 5 or 6)
ZFS lovers need to stop this CoW against CoW violence.
Try CachyOS (or at least the ZFS-Kernel) it has excellent ZFS integration.
For two devices, 1x redundancy (so 2x copies of everything) will always limit your storage to the size of the smaller device otherwise it is not possible to have two copies of everything you need to store. As soon as you add a third device of at leats 100gb (or replace the 100gb device with one at least 200gb) the other 100gb of your second device will immediately come into play.
Uneven device size support is most useful when:
♯ You have three or more devices, or plan to grow onto three or more from an initial pair.
♯ You want flexibility rwt array growth (support for uneven devices usually (but not always) comes with better support for dynamic array reshaping).
♯ You want better quick repair flexibility: if 4Tb drive fails, you can replace it with 2x2Tb if you don't have a working 4Tb unit on-hand.
♯ You want variable redundancy (support for uneven devices sometimes comes with support for variable redundancy: keeping 3+ copies of important data, or data you want to access fastest via striping of reads, 2 copies of other permanent data, and 1 copy of temporary storage, all in the same array). In this instance the “wasted” part of the 200gb drive in your example could be used for scratch data designated as not needing to be stored with redundancy.
e.g. you have 3 100GB drives, total capacity in raid 1 is 150GB.
If you replace a broken one with a 200GB one, the total capacity will be increased to 200GB.
1. The scheduler doesn't really exist. IIRC it is PID % num disks.
2. The default balancing policy is super basic. (IIRC always write to the disk with the most free space).
3. Erasure coding is still experimental.
4. Replication can only be configured at the FS level. bcachefs can configure this per-file or per-directory.
bcachefs is still early but it shows that it is serious about multi-disk. You can lump any collection of disks together and it mostly does the right thing. It tracks performance of different drives to make requests optimally and balances write to gradually even out the drives (not lasering a newly added disk).
IMHO there is really no comparison. If it wasn't for the fact that bcachefs ate my data I would be using it.
Or that the complexity is such that if a new bug is found, it may take a long time to be fixed because of the complexity, or it is fixed fast and has unexpected knock-on effects even for circumstances on the common path.
Something that takes a long time to be declared stable/reliable because of its complexity, needs to spend a long time after that declaration without significant issues before I'll actually trust it. Things like btrfs definitely live in this category.
bcachefs even won't be something I use for important storage until it has been battle-tested a bit more for a bit longer, though at this point it is much more likely to take over from my current simple ext4-on-RAID arrangement (and when/if it does, my backups might stay on ext4-on-RAID even longer).
Given the rather cheap price of durable storage these days, I would favour rock solid, high quality code for storing my data, at the expense of some optimisations. Then again, I still like RAID, instantaneous snapshots, COW, encryption, xattr, resizable partitions, CRC... It's it possible to have all this with acceptable performance and simple code bricks combined and layered on top of each other?
yeah, features rich/complete fs is complicated, that's why we have very few of them.
ZFS does something smarter here, it keeps track of the queue length for each drive in a mirror, and picks the one with the lowest number of pending requests.
Personally, I was one of those people. Very excited about the prospects of btrfs, switched several machines over to it to test, ended up with filesystem corruption and had to revert to ext. Now, whenever I peek at btrfs, I never see anything that's compelling over running ZFS, which I've run for close to 15+ years, and run hard, and have never had data loss. Even in the early days with zfs+fuse, where I could regularly crash the zfs fuse; the zfs+fuse developers quickly addressed every crash I ran into, once I put together a stress test.
Is it really? I must have missed the news. Back when it was released completely raw as a default for many distros, there were fundamental design level issues (e.g. "unbound internal fragmentation" reported by Shishkin). Plus all the reports and personal experiences of getting and trying to recover exotically shaped bricks when volume fills to 100% (which could happen at any time with btrfs). Is it all good now? Where can I read about btrfs behaving robustly when no free space is left?
BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why
Because XFS is far quicker for server-related software such as databases and virtual machines, which are weak points on btrfs due to its COW model.
Do you know how ZFS handles that?
You can do a replace, but then you need to buy a new drive.
- I never agreed with the btrfs default of root raid 1 system not booting up if a device is missing. I think the point of raid1 is to minimize downtime when losing a device and if you lose the other device before returning it to good state, that's 100% on you.
- Poor management tools compared to md (though bcachefs might be in the same boat). Some tools are poorly thought, e.g. there is a tool for defragmentation, but it undoes sharing (so snapshots and dedupped files get expanded).
- If a drive in raid1 drops but then later comes back, btrfs is still quite happy.
- Need of using btrfs balance, and in a certain way as well: https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-... .
- At least it used to be difficult to recover when your filesystem becomes full. Helps if you have it on LVM volume with extra space.
- Snapshotting or having a clone of a btrfs volume is dangerous (due to the uuid-based volume participant scanning)
- I believe raid5/6 is still experimental?
- I've lost a filesystem to btrfs raid10 (but my backups are good).
- I have also rendered my bcachefs in a state where I could no longer write to the filesystem, but I was still able to read it. So I'm inclined to keep using bcachefs for the time being.
Overall I just have the impression that btrfs was complicated and ended up in a design dead-end, making improvements from hard to difficult, and I hope that bcachefs has made different base designs, making future improvements easier.
Yes, the number of developers for bcachefs is smaller, but frankly as long as it's possible for a project to advance with a single developer, it is going to be the most effective way to go—at the same time I hope this situation improves in the future.
Add "degraded" to default mount options. Solved.
That's been implemented; in Linux 6.11 bcachefs will correct errors on read. See
> - Self healing on read IO/checksum error
in https://lore.kernel.org/linux-bcachefs/73rweeabpoypzqwyxa7hl...
Making it possible to scrub from userspace by walking and reading everything (tar -c /mnt/bcachefs >/dev/null).
Repro: supposedly only good copy is copied to ram, ram corrupts bit, crc is recalculated using corrupted but, corrupted copy is written back to disk(s).
Why would it need to recalculate the CRC? The correct CRC (or other hash) for the data is already stored in the metadata trees; it's how it discovered that the data was corrupted in the first place. If it writes back corrupted data, it will be detected as corrupted again the next time.
That's how bcachefs is designed right now.
Our RAM should all be ECC and our OSes should all be on self-healing filesystems.
0 problems in 2.5 months is not necessarily better than 1-2 problems in ~3 years, though. If we're just talking about the single partition boot drive use case, I think I'd go with the option that's had vastly more time to find and eliminate bugs. (If you're conservative about this stuff that probably means ext4, actually.)
- Stability but also
- Constant refactorings
and later
"Disclaimer, my personal data is stored on ZFS"
A bit troubling, I find
"RAID0 behavior is default when using multiple disks" never have I ever had the need for RAID0 or have I seen a customer using it. I think it was at one time popular with gamers before SSDs became popular and cheap.
"RAID 5/6 (experimental)
This is referred to as erasure coding and is listed as “DO NOT USE YET”, "
Well, you got to start somewhere, but a comparison with btrfs and ZFS seems premature.> A bit troubling, I find
I appreciated the candor
The approach of bcachefs developers is that they will only recommend it's usage if it's absolutely, 100% stable and won't eat your data. Bcachefs isn't in that state yet and the developers don't pretend it is.
This avoids the kind of trust issues that btrfs has
> The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6. There are some implementation and design deficiencies that make it unreliable for some corner cases and the feature should not be used in production, only for evaluation or testing. The power failure safety for metadata with RAID56 is not 100%.
AFAIK ZFS has had deduplication support for a very long time (2009) and now even does opportunistic block cloning with much less overhead.
The new block cloning still had data corruption bugs quite recently.
But it has de-duplication, with your logic no non-CoW FS should be in that list because they are not comparable.
Chart should have a bloc dedup and file dedup separated columns if it is deemed non comparable.
In theory full file deduplication exists in every filesystem that has cow/reflink support
fclones group |fclones dedupe
btrfs doesn't have a built-in encryption.
> ZFS Encryption Y
I cannot find the discussion right now but I remember reading that they were considering a warning when enabling encryption because it was not really stable and people were running into crashes.
https://github.com/openzfs/zfs/issues?q=is%3Aissue+label%3A%...
I see it more as an administrative problem than an issue with ZFS encryption.
I bought new a new ssd and hdd for my desktop this year and looked into running bcachefs because it offers caching as well as native encryption and cow. I also determined that it is not production ready yet for my use case, my file system is the last thing I want to beta tester of. Investigated using bcache again, but opted to use lvm caching, as it offers better tooling and saves on one layer of block devices (with luks and btrfs on top). Performance is great and partition manipulations also worked flawless.
Hopefully bcachefs gains more traction and will be ready for production use, as it combines several useful features. My current setup still feels like making compromises.
why is this a bad thing?
Never again.
I eagerly await bcachefs reaching maturity!
I have a USB stick with btrfs + LUKS on Arch Linux and it never had a problem like this
Tried again with btrfs and hard freezes again.
None of those file systems are not comparable to BTRFS since they're not COW. BTRFS isn't for crappy USB drives since it has a lot more overhead than EXT4 and XFS which the controllers and flash chips in junky USB drives can't handle.
Btrfs is also far more reliable than ZFS in my view, because it has far far more real world testing, and is also much more actively developed.
Magical perfect elegant code isn't what makes a good filesystem: real world testing, iteration, and bugfixing is. BTRFS has more of that right now than anything else ever has.
On the other hand, while I haven't used it for /, dipping my toes in bcachefs with recoverable data has been a pleasant experience. Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices… it's good to have it all in one place.
That's not really true: it's deployed across a wide variety of workloads. Not databases, obviously, but reliability concerns have nothing to do with that.
My point isn't "they use it, it must be good": that's silly. My point is that they employ multiple full time engineers dedicated to finding and fixing the bugs in upstream Linux, and because of that, BTRFS is more well tested in practice than anything else out there today.
It doesn't matter how well thought out or "elegant" bcachefs or ZFS are: they don't have a team of full time engineers with access to thousands upon thousands of machines running the filesystem actively fixing bugs. That's what actually matters.
> Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices... it's good to have it all in one place.
BTRFS does all of that today.
ZFS has corruption bugs, this one was far worse than anything I've seen in btrfs recently: https://lists.freebsd.org/archives/freebsd-stable/2023-Novem...
I myself would never run a file server without ECC and a UPS configured for a graceful shutdown. I have also never had any issues, but I only have about 10tb of data.
If you're really conservative with these things, as some of us are, you currently don't really have a single safe COW pick. (Smug FreeBSD users incoming.) I have most trust in bcachefs over the long term.
I have both bcachefs and ext4 filesystems on the same machine, for different uses.