Bcachefs, an Introduction/Exploration (opens in new tab)

chasil1y ago

ZFS will come closest.

I have a ZFS mirror, where I have taken one disk out, added files to it elsewhere, returned it and reimported.

The pool immediately resilvered the new content onto the untouched drive.

Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.

yencabulator1y ago

> Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.

I believe btrfs replace will copy only the data that had a replica on the failing drive.

chasil1y ago

You wouldn't replace in this context.

You would mount degraded on the remote system and copy in the new files.

After the returning the new drive, you would mount normally, but getting the new content mirrored requires a rebalance.

Replace is for a blank drive, and it hasn't worked very well for me, as status/usage reported some data that was not mirrored; a rebalance fixed this.

LeoPanthera1y ago

The "why not btrfs" line boils down to "it took a long time to be stable".

That's a weird argument. Even if it's true, it is now stable, and has been for a long time. btrfs has long been my default, and I'd be wary of switching to something newer just because someone was mad that development took a long time.

eptcyka1y ago

In 2019, btrfs ate all my data after a power cut. Btrfs peeps said it sounded like my SSD was at fault. Well, ZFS is still chugging along on that drive. I am not surprised btrfs took ages to stabilize, and it will take ages again before I rely on it. I’ve had previous btrfs incidents too. I think the argument against btrfs is that it was not good enough when btrfs devs told people to use it in production for ages.

cpuguy831y ago

Anecdotally and absolutely not production experience here, but I've had a Synology device running btrfs for 7 or 8 years now. Only issue I ever had is when I shipped it cross country with the drives in it, but was able to recover just fine.

This includes plenty of random power losses.

miedpo1y ago

They do use btrfs. However, Synology also uses some additional tools on top of btrfs. From what I remember (could be wrong about the precise details), they actually run mdadm on top of btrfs, and use mdadm in order to get the erasure coding and possibly the cache NVME disk too. (By erasure coding, I mean RAID 5/6, or SHR, which are still unstable generally in BTRFS).

GrayShade1y ago

Yeah, in the last year and a half, I've had three btrfs file systems crash on me with the dreaded "parent transid verify failed". Two times it was out of the blue, third time was just after it filled up.

The people on IRC tend to default to "unless you're using an enterprise drive, it's probably buggy and doesn't respect write barriers", which shouldn't have mattered because there was no system crash involved.

Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

dale_glass1y ago

> Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Just luck. Software can't defend itself against bad RAM. There's always the possibility that bad RAM will cause ZFS to corrupt itself in some way it can't recover itself from.

Everything is in RAM. The kernel, the ZFS code, everything. All of that is vulnerable to corruption. No matter how fancy ZFS is, it can't stop its own code from being corrupted. It's just luck that it didn't happen.

iforgotpassword1y ago

> For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Be careful though. If whatever data was to be written got corrupted early enough, ie before ZFS got to see it, it happily wrote corrupted data to disk with matching checksum and you're none the wiser. But yes, it didn't blow up the entire Filesystem unlike btrfs likes to do.

greenavocado1y ago

Btrfs never actually stabilized it's still garbage compared to ZFS

leansensei1y ago

Care to substantiate that statement? It seems rather arbitrary to just say that it's garbage when it is running and has been running successfully for the vast majority of its users. It also offers two features that ZFS does not: the ability to grow a pool, and offline duplication.

qalmakka1y ago

Btrfs is still unacceptably less reliable than ZFS, after _decades_ of development. This is unacceptable, IMHO. I've lost so much data due to btrfs corruption issues that I've (almost) stopped to use it completely nowadays. It's better to fight to keep the damned OpenZFS modules up to date and get an actual _reliable_ system instead of accepting the risk again.

sgbeal1y ago

> I've lost so much data due to btrfs corruption issues that I've (almost) stopped to use it completely nowadays.

Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

Personal anecdote: i've only ever had serious corruption twice, 20-ish years ago, once with XFS and once with ReiserFS, and have primarily used the extN family of filesystems for most of the past 30 years. A filesystem only has to go corrupt on me once before i stop using it.

Edit to add a caveat: though i find the ideas behind ZFS, btrfs, etc., fascinating, i have no personal need for them so have never used them on personal systems (but did use ZFS on corporate Solaris systems many years ago). ext4 has always served me well, and comes with none of the caveats i regularly read about for any of the more advanced filesystems. Similarly, i've never needed an LVM or any such complexity. As the age-old wisdom goes, "complexity is your enemy," and keeping to simple filesystem setups has always served my personal systems/LAN well. i've also never once seen someone recover from filesystem corruption in a RAID environment by simply swapping out a disk (there's always been much more work involved), so i've never bought into the "RAID is the solution" camp.

qalmakka1y ago

ZFS is just too convenient, IMHO:

- ZStandard compression is a performance boost on crappy spinning rust

- Snapshots are amazing, and I love being able to quickly send and store them using send and receive

- I like not having to partition the disk at all, and still be able to have multiple datasets that share the same underlying storage. LVM2 has way too many downsides for me to still consider it, like the fact that thin provisioning was quite problematic (i.e. ext4 and the like have no idea they're thin provisioned, ...)

- I like not having to bother with fstab anymore. I have all of my (complex) datasets under multiple boot roots, and I can mount pools from a live with an altroot and immediately get all directories properly mounted

- AFAIK only ZFS and Btrfs support checksums out of the box. I hate the fact that most FS can in fact bitrot and silently corrupt files. With ZFS and Btrfs in theory you can't easily restore your data, but at least you'll know if it got corrupted and restore it from a backup

- I like ZVOL; I appreciate being able to use them as sparse disks for VMs that can be easily mounted without using loopback devices (you get all partitions under /dev/zvol/pool/zvol-partN)

- If you have a lot of RAM,the ZFS ARC can speed up things a lot. ZFS is somewhat slower most of the time than "simpler" FS, but with 10+ GB availble to ARC it's been faster in my experience than any other FS

I do use "classic" filesystems for other applications, like random USB disks and stuff. I just prefer ZFS because the feature set is so good and it's been nothing but stable in day to day use. I've literally had ZERO issues with it in 8+ years - even when using the -git version it's way more stable than Btrfs ever was.

roenxi1y ago

> Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

I'd guess that it is the classic case of figuring out if something works without using it being a lot harder than giving it a go and seeing what happens. I've accidentally taken out my own home folder in the past with ill-advised setups and it is an educational experience. I wouldn't recommend it professionally, but I can see the joy in using something unusual on a personal system. Keep backups of anything you really can't afford to lose.

And one bad experience isn't enough to get a feel for how reliable something is. It is better to stick with it even if it fails once or twice.

remexre1y ago

snapshots every 15 minutes are a big selling point of ZFS for me; losing a file to a tired

    $ grep bar foo.txt | tr A-Z a-z > foo.txt

is much more common than losing a disk

riku_iki1y ago

> is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

my personal reasons are raid + compression

tuetuopay1y ago

Funny because I have the opposite experience. The main issue with btrfs is a lack of tooling for the layperson to not require btrfs-developer level knowledge to fix issues.

I've personally had drive failures, fs corruptions due to power loss (which is supposed not to happen on a cow filesystem), fs and file corruption due to ram bitflips, etc. All the times btrfs handled the situation perfectly, with the caveat that I needed the help from the btrfs developers. And they were very helpful!

So yeah, btrfs has a bad rep, but it is not as bad as the common feeling makes it look like.

(note that I still run btrfs raid 1, as I did not find real return of experience regarding raid 5 or 6)

jauntywundrkind1y ago

It's funny because Facebook uses btrfs for their systems & doesn't have these issues.

ZFS lovers need to stop this CoW against CoW violence.

cpuguy831y ago

fs corruption due to power loss happens on ext4 because the default settings only journal metadata for performance. I guess if everything is on batteries all the time this is fine, intolerable on systems without battery.

BSDobelix1y ago

>It's better to fight to keep the damned OpenZFS modules up to date and get an actual _reliable_ system

Try CachyOS (or at least the ZFS-Kernel) it has excellent ZFS integration.

Volundr1y ago

This. I may still give up on running ZFS on Linux due to the common (seemingly intentional from the Linux side) breakage, but for my existing systems switching them over to CachyOS repos has been a blessed relief.

magicalhippo1y ago

Hadn't heard of CachyOS before, looks very nice! Was looking to move to Arch from KDE Neon, but this might be a much better fit.

rubenbe1y ago

For me the killer feature of btrfs is "RAID 1 with different sized disks". For a small and cheap setup, this is perfect since a broken disk can be replaced with a bigger one and immediately (part of) the extra new disk space can be used. Other filesystems seem to only increase the size once all disks have been replaced with a bigger capacity one (last time I checked this was still the case for ZFS)

frankjr1y ago

Exactly. Provisioning a completely different set of disks when running out of capacity might be fine for a company but not for home office.

e12e1y ago

How does that work? You have two 100gb drives in raid1, 80% full, you replace one with a 200gb disk and write 50gb to the array - how is your 130gb of data protected against either drive failing?

dspillett1y ago

I don't know the ins-and-outs of btrfs in detail, but having dug into other systems that offer redundancy-over-uneven-device-sizes and assuming btrfs is at least similar: with your two drive example you won't be able to write another 50gb to that array.

For two devices, 1x redundancy (so 2x copies of everything) will always limit your storage to the size of the smaller device otherwise it is not possible to have two copies of everything you need to store. As soon as you add a third device of at leats 100gb (or replace the 100gb device with one at least 200gb) the other 100gb of your second device will immediately come into play.

Uneven device size support is most useful when:

♯ You have three or more devices, or plan to grow onto three or more from an initial pair.

♯ You want flexibility rwt array growth (support for uneven devices usually (but not always) comes with better support for dynamic array reshaping).

♯ You want better quick repair flexibility: if 4Tb drive fails, you can replace it with 2x2Tb if you don't have a working 4Tb unit on-hand.

♯ You want variable redundancy (support for uneven devices sometimes comes with support for variable redundancy: keeping 3+ copies of important data, or data you want to access fastest via striping of reads, 2 copies of other permanent data, and 1 copy of temporary storage, all in the same array). In this instance the “wasted” part of the 200gb drive in your example could be used for scratch data designated as not needing to be stored with redundancy.

rubenbe1y ago

It only works with 3+ disks. All data needs to be on two disks.

e.g. you have 3 100GB drives, total capacity in raid 1 is 150GB.

If you replace a broken one with a 200GB one, the total capacity will be increased to 200GB.

kevincox1y ago

Yeah, BTRFS is really not good for any sort of redundancy, not even very good for multi-disk in general.

1. The scheduler doesn't really exist. IIRC it is PID % num disks.

2. The default balancing policy is super basic. (IIRC always write to the disk with the most free space).

3. Erasure coding is still experimental.

4. Replication can only be configured at the FS level. bcachefs can configure this per-file or per-directory.

bcachefs is still early but it shows that it is serious about multi-disk. You can lump any collection of disks together and it mostly does the right thing. It tracks performance of different drives to make requests optimally and balances write to gradually even out the drives (not lasering a newly added disk).

IMHO there is really no comparison. If it wasn't for the fact that bcachefs ate my data I would be using it.

leansensei1y ago

That, plus offline deduplication.

etskinner1y ago

Bcachefs has this too

bheadmaster1y ago

Development taking long usually means that the model itself is too complicated to be done right in a reasonable time, which indicates that the "stable" implementation could still be buggy, but only if you stray away from the common path. It's hard to feel comfortable using such a software in a fundamental role such a file system.

dspillett1y ago

> which indicates that the "stable" implementation could still be buggy, but only if you stray away from the common path

Or that the complexity is such that if a new bug is found, it may take a long time to be fixed because of the complexity, or it is fixed fast and has unexpected knock-on effects even for circumstances on the common path.

Something that takes a long time to be declared stable/reliable because of its complexity, needs to spend a long time after that declaration without significant issues before I'll actually trust it. Things like btrfs definitely live in this category.

bcachefs even won't be something I use for important storage until it has been battle-tested a bit more for a bit longer, though at this point it is much more likely to take over from my current simple ext4-on-RAID arrangement (and when/if it does, my backups might stay on ext4-on-RAID even longer).

humanfromearth91y ago

I think it's not quite so simple. The problem of organising storage is at least complex, on a scale of "simple complicated complex chaotic". The inherent complexity might be impossible to reduce to something simple or even just complicated, except _maybe_ with layering (à la LVM2), each layer tackling one issue independently of the others. But then it's probably at the cost of performance and other efficiency. Each layer should work such that it does not interfere too much with the performance of other layers. Not easy.

Given the rather cheap price of durable storage these days, I would favour rock solid, high quality code for storing my data, at the expense of some optimisations. Then again, I still like RAID, instantaneous snapshots, COW, encryption, xattr, resizable partitions, CRC... It's it possible to have all this with acceptable performance and simple code bricks combined and layered on top of each other?

7e1y ago

In this case I think it’s the case that bcachefs has only a very small set of developers working in it.

jeltz1y ago

But that was not the case for btrfs.

riku_iki1y ago

> Development taking long usually means that the model itself is too complicated to be done right in a reasonable time

yeah, features rich/complete fs is complicated, that's why we have very few of them.

GrayShade1y ago

One interesting titbit I've only recently found out is that btrfs can't really serve reads from different drives in RAID1, it picks a drive based on the process id.

ZFS does something smarter here, it keeps track of the queue length for each drive in a mirror, and picks the one with the lowest number of pending requests.

linsomniac1y ago

It's not simply that it took a long time to become stable; it's that during this time where it was unstable a lot of people got exposed to btrfs by having it lose data.

Personally, I was one of those people. Very excited about the prospects of btrfs, switched several machines over to it to test, ended up with filesystem corruption and had to revert to ext. Now, whenever I peek at btrfs, I never see anything that's compelling over running ZFS, which I've run for close to 15+ years, and run hard, and have never had data loss. Even in the early days with zfs+fuse, where I could regularly crash the zfs fuse; the zfs+fuse developers quickly addressed every crash I ran into, once I put together a stress test.

1122331y ago

> it is now stable, and has been for a long time.

Is it really? I must have missed the news. Back when it was released completely raw as a default for many distros, there were fundamental design level issues (e.g. "unbound internal fragmentation" reported by Shishkin). Plus all the reports and personal experiences of getting and trying to recover exotically shaped bricks when volume fills to 100% (which could happen at any time with btrfs). Is it all good now? Where can I read about btrfs behaving robustly when no free space is left?

vbezhenar1y ago

Btrfs lost its credibility and many people would never trust it.

BSDobelix1y ago

So a year ago i tried to repeat my old trick damaging btrfs (as a user NOT root). Fill the volume with dd if=/dev/urandom of=./file bs=2M && sync && rm ./file then reboot the machine and yes it still works, it's not booting anymore, bravo.

BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why

> BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why

Because XFS is far quicker for server-related software such as databases and virtual machines, which are weak points on btrfs due to its COW model.

yjftsjthsd-h1y ago

> So a year ago i tried to repeat my old trick damaging btrfs (as a user NOT root). Fill the volume with dd if=/dev/urandom of=./file bs=2M && sync && rm ./file then reboot the machine and yes it still works, it's not booting anymore, bravo.

Do you know how ZFS handles that?

greenavocado1y ago

As little as one year ago I experienced damage on a lightly used btrfs root partition on my laptop. Never again. I use ext4 root and ZFS for /home for snapshots and transparent compression now, all on top of LVM

pantalaimon1y ago

btrfs has still many weird issues. e.g. you can't remove a drive if it has I/O errors, even if the rest of the array has still enough space to accompany the data.

You can do a replace, but then you need to buy a new drive.

nialv71y ago

btrfs is not stable, at least not for me. it lost my data only a couple months ago. no power cut, no disk failure, data just gone.

_flux1y ago

My personal grievances with btrfs are multifaceted.

- I never agreed with the btrfs default of root raid 1 system not booting up if a device is missing. I think the point of raid1 is to minimize downtime when losing a device and if you lose the other device before returning it to good state, that's 100% on you.

- Poor management tools compared to md (though bcachefs might be in the same boat). Some tools are poorly thought, e.g. there is a tool for defragmentation, but it undoes sharing (so snapshots and dedupped files get expanded).

- If a drive in raid1 drops but then later comes back, btrfs is still quite happy.

- Need of using btrfs balance, and in a certain way as well: https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-... .

- At least it used to be difficult to recover when your filesystem becomes full. Helps if you have it on LVM volume with extra space.

- Snapshotting or having a clone of a btrfs volume is dangerous (due to the uuid-based volume participant scanning)

- I believe raid5/6 is still experimental?

- I've lost a filesystem to btrfs raid10 (but my backups are good).

- I have also rendered my bcachefs in a state where I could no longer write to the filesystem, but I was still able to read it. So I'm inclined to keep using bcachefs for the time being.

Overall I just have the impression that btrfs was complicated and ended up in a design dead-end, making improvements from hard to difficult, and I hope that bcachefs has made different base designs, making future improvements easier.

Yes, the number of developers for bcachefs is smaller, but frankly as long as it's possible for a project to advance with a single developer, it is going to be the most effective way to go—at the same time I hope this situation improves in the future.

viraptor1y ago

> I never agreed with the btrfs default of root raid 1 system not booting up if a device is missing.

Add "degraded" to default mount options. Solved.

jeltz1y ago

Bad defaults is a huge issue even when you can change the config to something sane.

cherryteastain1y ago

Good luck doing that after the disk shuts down okay but never comes back online

in https://lore.kernel.org/linux-bcachefs/73rweeabpoypzqwyxa7hl...

Tobu1y ago

> Error handling on CRC read error > 2 or more copies of file, CRC on error, read other copy, data returned to userspace, does not correct bad copy

That's been implemented; in Linux 6.11 bcachefs will correct errors on read. See

> - Self healing on read IO/checksum error

Making it possible to scrub from userspace by walking and reading everything (tar -c /mnt/bcachefs >/dev/null).

amtadt1y ago

Self healing is dangerous because it can potentially corrupt good data on disk, if RAM or other system component is flaky.

Repro: supposedly only good copy is copied to ram, ram corrupts bit, crc is recalculated using corrupted but, corrupted copy is written back to disk(s).

cesarb1y ago

> crc is recalculated using corrupted bit

Why would it need to recalculate the CRC? The correct CRC (or other hash) for the data is already stored in the metadata trees; it's how it discovered that the data was corrupted in the first place. If it writes back corrupted data, it will be detected as corrupted again the next time.

amtadt1y ago

Because CRC is in the on-disk data structure, not in the in-ram data structure. It is stripped upon reading to ram, and created upon writing to disk.

That's how bcachefs is designed right now.

newZWhoDis1y ago

That’s why you need ECC RAM.

Our RAM should all be ECC and our OSes should all be on self-healing filesystems.

ajb1y ago

Bcachefs was merged into the kernel only months ago, and had an immediate flurry of bug fixes due to the additional testing this brought. (It was in development for some years before that out of tree). That's the level of maturity that it is at. I think there's a hope that it will become more trustworthy than btrfs due to the developers success with bcache.

I've been running bcachefs on my main desktop and laptop (My really important data is on my fileserver or in my private git repo (including dotfiles), I'm not crazy) since cachyos made it an install option and it's honestly worked better and caused less problems than btrfs has for me in the past so far. Maybe I was just unlucky but btrfs caused read only filesystem issues and a catastrophic loss on a couple of my computers a few years ago. I am pretty impressed with bcachefs so far.

bscphil1y ago

> since cachyos made it an install option and it's honestly worked better and caused less problems

0 problems in 2.5 months is not necessarily better than 1-2 problems in ~3 years, though. If we're just talking about the single partition boot drive use case, I think I'd go with the option that's had vastly more time to find and eliminate bugs. (If you're conservative about this stuff that probably means ext4, actually.)

https://lib.rs/crates/fclones

I would agree with you if I didn't run my stuff the way I do (and I'd use zfs or maybe ext4). Pretty much all my important stuff is on a raidz6 file server with a secondary backup raidz6 file server locally pulling backups each night and backups being sent offsite streaming throughout the day. My dotfiles are synced to my local private git repo via yadm (although if I didn't have this system running already I would probably take the time to figure out nix home manager instead of yadm right now). I have a bunch of bash scripts I wrote to automate the most annoying parts of reinstalling as well, so what I am really risking by running bcachefs is about a half hour to reinstall cachyos on whatever system eats it and possibly some minor configuration changes I may not have synced to git via yadm yet.

ajb1y ago

Nice. Its advancement depends on folks like you, so thanks!

guenthert1y ago

Hmmh, under "Why bcachefs?" we find

- Stability but also

- Constant refactorings

and later

"Disclaimer, my personal data is stored on ZFS"

A bit troubling, I find

"RAID0 behavior is default when using multiple disks" never have I ever had the need for RAID0 or have I seen a customer using it. I think it was at one time popular with gamers before SSDs became popular and cheap.

"RAID 5/6 (experimental)

    This is referred to as erasure coding and is listed as “DO NOT USE YET”, "

Well, you got to start somewhere, but a comparison with btrfs and ZFS seems premature.

nextaccountic1y ago

> "Disclaimer, my personal data is stored on ZFS"

> A bit troubling, I find

I appreciated the candor

The approach of bcachefs developers is that they will only recommend it's usage if it's absolutely, 100% stable and won't eat your data. Bcachefs isn't in that state yet and the developers don't pretend it is.

This avoids the kind of trust issues that btrfs has

GrayShade1y ago

Does it? From the btrfs docs:

> The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6. There are some implementation and design deficiencies that make it unreliable for some corner cases and the feature should not be used in production, only for evaluation or testing. The power failure safety for metadata with RAID56 is not 100%.

ysleepy1y ago

I wonder why ZFS is marked as not having de-dupe (deduplication).

AFAIK ZFS has had deduplication support for a very long time (2009) and now even does opportunistic block cloning with much less overhead.

the84721y ago

ZFS online deduplication is not comparable with on-demand dedup offered btrfs and xfs and is prohibitively expensive for many workloads.

The new block cloning still had data corruption bugs quite recently.

BSDobelix1y ago

>ZFS online deduplication is not comparable with on-demand dedup offered btrfs and xfs

But it has de-duplication, with your logic no non-CoW FS should be in that list because they are not comparable.

prmoustache1y ago

It is still deduplication.

Chart should have a bloc dedup and file dedup separated columns if it is deemed non comparable.

adrian_b1y ago

Also XFS has deduplication now, already for some time, at least one year or two.

BlackLotus891y ago

btrfs has deduplication as well.

In theory full file deduplication exists in every filesystem that has cow/reflink support

Tobu1y ago

fclones for example covers it well for any filesystem with reflinks:

fclones group |fclones dedupe

ksec1y ago

Yes I sort of skip-read a lot of it after that.

frankjr1y ago

> btrfs Encryption Y

btrfs doesn't have a built-in encryption.

> ZFS Encryption Y

I cannot find the discussion right now but I remember reading that they were considering a warning when enabling encryption because it was not really stable and people were running into crashes.

https://github.com/openzfs/zfs/issues?q=is%3Aissue+label%3A%...

prmoustache1y ago

That bug is old, is missing information and hasn't been closed while the reporter say the problem has been solved after an update.

I see it more as an administrative problem than an issue with ZFS encryption.

GrayShade1y ago

See https://gist.github.com/rincebrain/622ee4991732774037ff44c67... though.

nialv71y ago

Anecdote: I've been using ZFS encryption for a looong time and never had any problems.

linsomniac1y ago

I had really high hopes for HAMMER2, including that it would one day be ported to Linux, but it seems to have remained firmly planted in Dragonfly BSD and it's not really clear what the status is. https://en.wikipedia.org/wiki/HAMMER2

xxmarkuski1y ago

I'm running bcache, with lvm/luks and xfs on top, since >5 years on my desktop and it has been stable and partition manipulations, like resizes, worked without problems, albeit the tooling is not so well supported.

I bought new a new ssd and hdd for my desktop this year and looked into running bcachefs because it offers caching as well as native encryption and cow. I also determined that it is not production ready yet for my use case, my file system is the last thing I want to beta tester of. Investigated using bcache again, but opted to use lvm caching, as it offers better tooling and saves on one layer of block devices (with luks and btrfs on top). Performance is great and partition manipulations also worked flawless.

Hopefully bcachefs gains more traction and will be ready for production use, as it combines several useful features. My current setup still feels like making compromises.

gigatexal1y ago

> ZFS, pioneering COW filesystem, ... commendable, its block-based design diverges from modern extent-based systems due to complexities in implementing extents with snapshots.

why is this a bad thing?

curt151y ago

How well does bcachefs handle databases and VMs? Those workloads are well-known to be btrfs' kryptonite whereas ZFS seems to tolerate them pretty well as long as one sets the correct recordsize (example: https://www.enterprisedb.com/blog/postgres-vs-file-systems-p...).

yarg1y ago

https://www.phoronix.com/review/bcachefs-benchmarks-linux67

dralley1y ago

It's worth mentioning that bcachefs has gotten a fair bit of performance work since the last round of Phoronix benchmarks. Also there's some bug where the formatting tool selected 512 byte blocks by default instead of 4k byte blocks on drives where other filesystems picked 4k bytes, which impacted the Phoronix benchmarks. Unsure if that has been fixed yet.

mastax1y ago

IIRC bcachefs is going to add a non-COW mode which should be good for databases.

curt151y ago

NoCOW on btrfs is a kludge because it disables the core features of btrfs and is dangerous to use with raid1. Since, ZFS doesn't even have a nocow mode, surely there are other ways of dealing with databases?

Liftyee1y ago

An interesting analysis. I can't stop my brain from parsing the title as "B C A Chefs".

sevg1y ago

I recently tried btrfs on a new USB thumb drive. I immediately got hard freezes of my main (Linux) OS while working with the USB stick.

Never again.

I eagerly await bcachefs reaching maturity!

viraptor1y ago

I hope you reported the issue. That smells like a bug beyond the scope of btrfs itself. The basic filesystem has been stable for a very long time.

nextaccountic1y ago

Maybe that was a bad USB port? (I have one such port that intermittently disconnects)

I have a USB stick with btrfs + LUKS on Arch Linux and it never had a problem like this

sevg1y ago

Same port and USB stick worked fine with XFS and ext4.

Tried again with btrfs and hard freezes again.

Rinzler891y ago

>Same port and USB stick worked fine with XFS and ext4.

None of those file systems are not comparable to BTRFS since they're not COW. BTRFS isn't for crappy USB drives since it has a lot more overhead than EXT4 and XFS which the controllers and flash chips in junky USB drives can't handle.

commandersaki1y ago

I’m pretty keen to try bcachefs. Has anyone successfully set it up as root filesystem on a raspberry pi?

whalesalad1y ago

Had to do a double-take on the UI of this blog. It looks identical to my notetaking app, Trilium.

tripdout1y ago

Does it allow both shrinking and growing the FS? Really wish ZFS allowed shrinking.

jcalvinowens1y ago

The idea that a brand new filesystem might be more reliable than good 'ol BTRFS, which Facebook runs on basically their entire infrastructure, is downright laughable to me.

Btrfs is also far more reliable than ZFS in my view, because it has far far more real world testing, and is also much more actively developed.

Magical perfect elegant code isn't what makes a good filesystem: real world testing, iteration, and bugfixing is. BTRFS has more of that right now than anything else ever has.

Tobu1y ago

I've had my own bad experiences with Btrfs (it doesn't behave well when close to full), and my intuition is that Facebook's use of it is in a limited operational domain. It works well for their use case (uploaded media I think?), combined with the way they manage and provision clusters. Letting random users loose on it uncovers a variety of failure modes and fixes are slow to come.

On the other hand, while I haven't used it for /, dipping my toes in bcachefs with recoverable data has been a pleasant experience. Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices… it's good to have it all in one place.

jcalvinowens1y ago

> my intuition is that Facebook's use of it is in a limited operational domain

That's not really true: it's deployed across a wide variety of workloads. Not databases, obviously, but reliability concerns have nothing to do with that.

My point isn't "they use it, it must be good": that's silly. My point is that they employ multiple full time engineers dedicated to finding and fixing the bugs in upstream Linux, and because of that, BTRFS is more well tested in practice than anything else out there today.

It doesn't matter how well thought out or "elegant" bcachefs or ZFS are: they don't have a team of full time engineers with access to thousands upon thousands of machines running the filesystem actively fixing bugs. That's what actually matters.

> Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices... it's good to have it all in one place.

BTRFS does all of that today.

jeltz1y ago

Why is that laughable? I do not think that it is more reliable than btrfs but it is not a crazy idea either. There are a whole bunch of people in these comments with very recent btrfs reliability issues which have affected them and nobody with recent zfs reliability issues.

jcalvinowens1y ago

If anecdotes are meaningful (dubious, but I'll play along...), I can counter with mine: I've been running btrfs on bleeding edge kernels for a decade, and I've never seen a single data loss event.

ZFS has corruption bugs, this one was far worse than anything I've seen in btrfs recently: https://lists.freebsd.org/archives/freebsd-stable/2023-Novem...

tiberious7261y ago

I've been running it on hundreds of servers in prod since once of the lead devs gave a talk at LinuxCon 2014 saying it was good to go. Had a few performance issues here and there, especially on older kernels, but never any data loss

bjoli1y ago

I also think you can't really compare them: ZFS more or less says "never use without ECC memory". BTRFS is run on just about any potato there is.

I myself would never run a file server without ECC and a UPS configured for a graceful shutdown. I have also never had any issues, but I only have about 10tb of data.

NKosmatos1y ago

Mandatory xkcd comic: https://xkcd.com/927 (replace "standards" with "FS")

jeltz1y ago

There are only two competitors to bcachefs: btrfs and zfs. So having a third player in this space is a good thing, especially since a lot of people (in my opinion for good reason) do not trust btrfs meaning there is only really zfs.

ZFS isn't a real competitor given it's not in the kernel and has legal troubles.

ZhongXina1y ago

Who is downvoting this? Among the large Linux distributions, ZFS is only really supported by Ubuntu, and even that is on the level "Canonical lawyers reviewed this and believe they're safe". If the unmentionable company ever goes to court against them, you're in hot water. You'll have to migrate to FreeBSD or support yourself by building dkms modules. So you're taking a non-zero risk by adopting ZFS.

If you're really conservative with these things, as some of us are, you currently don't really have a single safe COW pick. (Smug FreeBSD users incoming.) I have most trust in bcachefs over the long term.

https://github.com/openzfs/zfs/issues/3461

_flux1y ago

The key difference is that we don't need to agree on a certain FS, whereas the reason for standards is interoperability.

I have both bcachefs and ext4 filesystems on the same machine, for different uses.

tjoff1y ago

Situation is different though, we have very few modern filesystems and have desperately needed some diversion and competition in this area. I've been waiting decades for something like bcachefs, thought it would be btrfs but that turned into a disappointment - for my needs.

leetnewb1y ago

It seems like "modern filesystem" development focus shifted to distributed a few years back.

j / k navigate · click thread line to collapse

183 comments

amluto1y ago

Yet most competent active-active database systems have no problems with this scenario!

dale_glass1y ago

I'm not sure what you expect?

Linux MD allows for partial sync with the bitmap. If the system knows something in the first 5% of the disk changed, it can limit itself to only syncing that 5%.

> Yet most competent active-active database systems have no problems with this scenario!

Because they're not RAID. The whole point of RAID is that it's extremely simple. This means it's a brute force method with some downsides, but in exchange it's extremely easy to reason about.

amluto1y ago

I mean “RAID” in the more general sense, including btrfs, ZFS, etc, not just old-school RAID.

trte9343r41y ago

USB connected disks introduce new problems, like random disconnections.

RAID is overkill for home use. It also does not solve backups and snapshots. I use one way syncthing with unlimited history, plus usb-sata adapter.

yencabulator1y ago

Beware, ZFS often hangs on USB disconnections, forcing a reboot:

prmoustache1y ago

Yes for home use I prefer more computers with single disks each having one copy of the data than one with RAID.

newZWhoDis1y ago

That means you have no bitrot protection, in fact you’ve now increased that possibility.

pyinstallwoes1y ago

What's your syncthing setup?

trte9343r41y ago

One way sync to couple of servers. Unlimited history in syncthing for backups.

mavhc1y ago

A ZFS resilver is fast if there's not much changed data, only takes a few minutes

amluto1y ago

I didn’t know that — thanks!

TheDong1y ago

Does ceph not fulfill your requirements here? Especially that last "spans multiple computers" bit.

amluto1y ago

Ceph doesn’t really nail the “I want to boot off this thing” use case. It would be interesting to try, though.

magicalhippo1y ago

Ceph provides S3-compatible object store no? If so, just use s3backer[1] with a loopback mount and boot[2] off it?

I mean, sounds like a house of cards but, should be possible?

[1]: https://github.com/archiecobbs/s3backer

[2]: https://ersei.net/en/blog/fuse-root

chasil1y ago

ZFS will come closest.

I have a ZFS mirror, where I have taken one disk out, added files to it elsewhere, returned it and reimported.

The pool immediately resilvered the new content onto the untouched drive.

Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.

yencabulator1y ago

> Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.

I believe btrfs replace will copy only the data that had a replica on the failing drive.

chasil1y ago

You wouldn't replace in this context.

You would mount degraded on the remote system and copy in the new files.

After the returning the new drive, you would mount normally, but getting the new content mirrored requires a rebalance.

Replace is for a blank drive, and it hasn't worked very well for me, as status/usage reported some data that was not mirrored; a rebalance fixed this.

LeoPanthera1y ago

The "why not btrfs" line boils down to "it took a long time to be stable".

eptcyka1y ago

cpuguy831y ago

This includes plenty of random power losses.

miedpo1y ago

GrayShade1y ago

Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

dale_glass1y ago

> Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Just luck. Software can't defend itself against bad RAM. There's always the possibility that bad RAM will cause ZFS to corrupt itself in some way it can't recover itself from.

iforgotpassword1y ago

> For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

greenavocado1y ago

Btrfs never actually stabilized it's still garbage compared to ZFS

leansensei1y ago

qalmakka1y ago

sgbeal1y ago

> I've lost so much data due to btrfs corruption issues that I've (almost) stopped to use it completely nowadays.

Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

qalmakka1y ago

ZFS is just too convenient, IMHO:

- ZStandard compression is a performance boost on crappy spinning rust

- Snapshots are amazing, and I love being able to quickly send and store them using send and receive

- I like ZVOL; I appreciate being able to use them as sparse disks for VMs that can be easily mounted without using loopback devices (you get all partitions under /dev/zvol/pool/zvol-partN)

roenxi1y ago

> Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

And one bad experience isn't enough to get a feel for how reliable something is. It is better to stick with it even if it fails once or twice.

remexre1y ago

snapshots every 15 minutes are a big selling point of ZFS for me; losing a file to a tired

    $ grep bar foo.txt | tr A-Z a-z > foo.txt

is much more common than losing a disk

riku_iki1y ago

> is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

my personal reasons are raid + compression

tuetuopay1y ago

Funny because I have the opposite experience. The main issue with btrfs is a lack of tooling for the layperson to not require btrfs-developer level knowledge to fix issues.

So yeah, btrfs has a bad rep, but it is not as bad as the common feeling makes it look like.

(note that I still run btrfs raid 1, as I did not find real return of experience regarding raid 5 or 6)

jauntywundrkind1y ago

It's funny because Facebook uses btrfs for their systems & doesn't have these issues.

ZFS lovers need to stop this CoW against CoW violence.

cpuguy831y ago

BSDobelix1y ago

>It's better to fight to keep the damned OpenZFS modules up to date and get an actual _reliable_ system

Try CachyOS (or at least the ZFS-Kernel) it has excellent ZFS integration.

Volundr1y ago

magicalhippo1y ago

Hadn't heard of CachyOS before, looks very nice! Was looking to move to Arch from KDE Neon, but this might be a much better fit.

rubenbe1y ago

frankjr1y ago

Exactly. Provisioning a completely different set of disks when running out of capacity might be fine for a company but not for home office.

e12e1y ago

How does that work? You have two 100gb drives in raid1, 80% full, you replace one with a 200gb disk and write 50gb to the array - how is your 130gb of data protected against either drive failing?

dspillett1y ago

Uneven device size support is most useful when:

♯ You have three or more devices, or plan to grow onto three or more from an initial pair.

♯ You want flexibility rwt array growth (support for uneven devices usually (but not always) comes with better support for dynamic array reshaping).

♯ You want better quick repair flexibility: if 4Tb drive fails, you can replace it with 2x2Tb if you don't have a working 4Tb unit on-hand.

rubenbe1y ago

It only works with 3+ disks. All data needs to be on two disks.

e.g. you have 3 100GB drives, total capacity in raid 1 is 150GB.

If you replace a broken one with a 200GB one, the total capacity will be increased to 200GB.

kevincox1y ago

Yeah, BTRFS is really not good for any sort of redundancy, not even very good for multi-disk in general.

1. The scheduler doesn't really exist. IIRC it is PID % num disks.

2. The default balancing policy is super basic. (IIRC always write to the disk with the most free space).

3. Erasure coding is still experimental.

4. Replication can only be configured at the FS level. bcachefs can configure this per-file or per-directory.

IMHO there is really no comparison. If it wasn't for the fact that bcachefs ate my data I would be using it.

leansensei1y ago

That, plus offline deduplication.

etskinner1y ago

Bcachefs has this too

bheadmaster1y ago

dspillett1y ago

> which indicates that the "stable" implementation could still be buggy, but only if you stray away from the common path

humanfromearth91y ago

7e1y ago

In this case I think it’s the case that bcachefs has only a very small set of developers working in it.

jeltz1y ago

But that was not the case for btrfs.

riku_iki1y ago

> Development taking long usually means that the model itself is too complicated to be done right in a reasonable time

yeah, features rich/complete fs is complicated, that's why we have very few of them.

GrayShade1y ago

One interesting titbit I've only recently found out is that btrfs can't really serve reads from different drives in RAID1, it picks a drive based on the process id.

ZFS does something smarter here, it keeps track of the queue length for each drive in a mirror, and picks the one with the lowest number of pending requests.

linsomniac1y ago

It's not simply that it took a long time to become stable; it's that during this time where it was unstable a lot of people got exposed to btrfs by having it lose data.

1122331y ago

> it is now stable, and has been for a long time.

vbezhenar1y ago

Btrfs lost its credibility and many people would never trust it.

BSDobelix1y ago

BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why

> BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why

Because XFS is far quicker for server-related software such as databases and virtual machines, which are weak points on btrfs due to its COW model.

yjftsjthsd-h1y ago

Do you know how ZFS handles that?

greenavocado1y ago

pantalaimon1y ago

btrfs has still many weird issues. e.g. you can't remove a drive if it has I/O errors, even if the rest of the array has still enough space to accompany the data.

You can do a replace, but then you need to buy a new drive.

nialv71y ago

btrfs is not stable, at least not for me. it lost my data only a couple months ago. no power cut, no disk failure, data just gone.

_flux1y ago

My personal grievances with btrfs are multifaceted.

- If a drive in raid1 drops but then later comes back, btrfs is still quite happy.

- Need of using btrfs balance, and in a certain way as well: https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-... .

- At least it used to be difficult to recover when your filesystem becomes full. Helps if you have it on LVM volume with extra space.

- Snapshotting or having a clone of a btrfs volume is dangerous (due to the uuid-based volume participant scanning)

- I believe raid5/6 is still experimental?

- I've lost a filesystem to btrfs raid10 (but my backups are good).

- I have also rendered my bcachefs in a state where I could no longer write to the filesystem, but I was still able to read it. So I'm inclined to keep using bcachefs for the time being.

viraptor1y ago

> I never agreed with the btrfs default of root raid 1 system not booting up if a device is missing.

Add "degraded" to default mount options. Solved.

jeltz1y ago

Bad defaults is a huge issue even when you can change the config to something sane.

cherryteastain1y ago

Good luck doing that after the disk shuts down okay but never comes back online

in https://lore.kernel.org/linux-bcachefs/73rweeabpoypzqwyxa7hl...

Tobu1y ago

> Error handling on CRC read error > 2 or more copies of file, CRC on error, read other copy, data returned to userspace, does not correct bad copy

That's been implemented; in Linux 6.11 bcachefs will correct errors on read. See

> - Self healing on read IO/checksum error

Making it possible to scrub from userspace by walking and reading everything (tar -c /mnt/bcachefs >/dev/null).

amtadt1y ago

Self healing is dangerous because it can potentially corrupt good data on disk, if RAM or other system component is flaky.

Repro: supposedly only good copy is copied to ram, ram corrupts bit, crc is recalculated using corrupted but, corrupted copy is written back to disk(s).

cesarb1y ago

> crc is recalculated using corrupted bit

amtadt1y ago

Because CRC is in the on-disk data structure, not in the in-ram data structure. It is stripped upon reading to ram, and created upon writing to disk.

That's how bcachefs is designed right now.

newZWhoDis1y ago

That’s why you need ECC RAM.

Our RAM should all be ECC and our OSes should all be on self-healing filesystems.

ajb1y ago

bscphil1y ago

> since cachyos made it an install option and it's honestly worked better and caused less problems

https://lib.rs/crates/fclones

ajb1y ago

Nice. Its advancement depends on folks like you, so thanks!

guenthert1y ago

Hmmh, under "Why bcachefs?" we find

- Stability but also

- Constant refactorings

and later

"Disclaimer, my personal data is stored on ZFS"

A bit troubling, I find

"RAID 5/6 (experimental)

    This is referred to as erasure coding and is listed as “DO NOT USE YET”, "

Well, you got to start somewhere, but a comparison with btrfs and ZFS seems premature.

nextaccountic1y ago

> "Disclaimer, my personal data is stored on ZFS"

> A bit troubling, I find

I appreciated the candor

This avoids the kind of trust issues that btrfs has

GrayShade1y ago

Does it? From the btrfs docs:

ysleepy1y ago

I wonder why ZFS is marked as not having de-dupe (deduplication).

AFAIK ZFS has had deduplication support for a very long time (2009) and now even does opportunistic block cloning with much less overhead.

the84721y ago

ZFS online deduplication is not comparable with on-demand dedup offered btrfs and xfs and is prohibitively expensive for many workloads.

The new block cloning still had data corruption bugs quite recently.

BSDobelix1y ago

>ZFS online deduplication is not comparable with on-demand dedup offered btrfs and xfs

But it has de-duplication, with your logic no non-CoW FS should be in that list because they are not comparable.

prmoustache1y ago

It is still deduplication.

Chart should have a bloc dedup and file dedup separated columns if it is deemed non comparable.

adrian_b1y ago

Also XFS has deduplication now, already for some time, at least one year or two.

BlackLotus891y ago

btrfs has deduplication as well.

In theory full file deduplication exists in every filesystem that has cow/reflink support

Tobu1y ago

fclones for example covers it well for any filesystem with reflinks:

fclones group |fclones dedupe

ksec1y ago

Yes I sort of skip-read a lot of it after that.

frankjr1y ago

> btrfs Encryption Y

btrfs doesn't have a built-in encryption.

> ZFS Encryption Y

I cannot find the discussion right now but I remember reading that they were considering a warning when enabling encryption because it was not really stable and people were running into crashes.

https://github.com/openzfs/zfs/issues?q=is%3Aissue+label%3A%...

prmoustache1y ago

That bug is old, is missing information and hasn't been closed while the reporter say the problem has been solved after an update.

I see it more as an administrative problem than an issue with ZFS encryption.

GrayShade1y ago

See https://gist.github.com/rincebrain/622ee4991732774037ff44c67... though.

nialv71y ago

Anecdote: I've been using ZFS encryption for a looong time and never had any problems.

linsomniac1y ago

xxmarkuski1y ago

Hopefully bcachefs gains more traction and will be ready for production use, as it combines several useful features. My current setup still feels like making compromises.

gigatexal1y ago

> ZFS, pioneering COW filesystem, ... commendable, its block-based design diverges from modern extent-based systems due to complexities in implementing extents with snapshots.

why is this a bad thing?

curt151y ago

yarg1y ago

https://www.phoronix.com/review/bcachefs-benchmarks-linux67

dralley1y ago

mastax1y ago

IIRC bcachefs is going to add a non-COW mode which should be good for databases.

curt151y ago

Liftyee1y ago

An interesting analysis. I can't stop my brain from parsing the title as "B C A Chefs".

sevg1y ago

I recently tried btrfs on a new USB thumb drive. I immediately got hard freezes of my main (Linux) OS while working with the USB stick.

Never again.

I eagerly await bcachefs reaching maturity!

viraptor1y ago

I hope you reported the issue. That smells like a bug beyond the scope of btrfs itself. The basic filesystem has been stable for a very long time.

nextaccountic1y ago

Maybe that was a bad USB port? (I have one such port that intermittently disconnects)

I have a USB stick with btrfs + LUKS on Arch Linux and it never had a problem like this

sevg1y ago

Same port and USB stick worked fine with XFS and ext4.

Tried again with btrfs and hard freezes again.

Rinzler891y ago

>Same port and USB stick worked fine with XFS and ext4.

commandersaki1y ago

I’m pretty keen to try bcachefs. Has anyone successfully set it up as root filesystem on a raspberry pi?

whalesalad1y ago

Had to do a double-take on the UI of this blog. It looks identical to my notetaking app, Trilium.

tripdout1y ago

Does it allow both shrinking and growing the FS? Really wish ZFS allowed shrinking.

jcalvinowens1y ago

The idea that a brand new filesystem might be more reliable than good 'ol BTRFS, which Facebook runs on basically their entire infrastructure, is downright laughable to me.

Btrfs is also far more reliable than ZFS in my view, because it has far far more real world testing, and is also much more actively developed.

Magical perfect elegant code isn't what makes a good filesystem: real world testing, iteration, and bugfixing is. BTRFS has more of that right now than anything else ever has.

Tobu1y ago

jcalvinowens1y ago

> my intuition is that Facebook's use of it is in a limited operational domain

That's not really true: it's deployed across a wide variety of workloads. Not databases, obviously, but reliability concerns have nothing to do with that.

> Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices... it's good to have it all in one place.

BTRFS does all of that today.

jeltz1y ago

jcalvinowens1y ago

If anecdotes are meaningful (dubious, but I'll play along...), I can counter with mine: I've been running btrfs on bleeding edge kernels for a decade, and I've never seen a single data loss event.

ZFS has corruption bugs, this one was far worse than anything I've seen in btrfs recently: https://lists.freebsd.org/archives/freebsd-stable/2023-Novem...

tiberious7261y ago

bjoli1y ago

I also think you can't really compare them: ZFS more or less says "never use without ECC memory". BTRFS is run on just about any potato there is.

I myself would never run a file server without ECC and a UPS configured for a graceful shutdown. I have also never had any issues, but I only have about 10tb of data.

NKosmatos1y ago

Mandatory xkcd comic: https://xkcd.com/927 (replace "standards" with "FS")

jeltz1y ago

ZFS isn't a real competitor given it's not in the kernel and has legal troubles.

ZhongXina1y ago