XaiJu
bcachefs

bcachefs

patreon


bcachefs posts

More expensive on disk format upgrades

6.11 was an expensive forced on disk format upgrade, for the disk accounting rewrite: while bcachefs is still marked as experimental, I'm making on disk format changes that would not be feasible if we needed compatibility code so that new versions could work on old filesystems without upgrading them.

(Note that - aside from the odd bug, like when 6.9 wasn't reading the downgrade table correctly - old kernels can still mount upgraded filesystems, they just have to downgrade, which is...

View Post

Update: no bcachefs updates for 6.13 confirmed

Fixes will have to come from my repository.

This came just as the CoC board seemed to be relenting, and we started to be having a public conversation. That's now been cut off, and I think after the private correspondence we had their action was dishonest.

Here's what I wrote to Michal way back in September (because yes, I was out of line, and yes, that did need to be addressed; but I don't think public mea culpas are the best way to do that).

I do hope something good comes o...

View Post

Trouble in the kernel

TLDR: the future of bcachefs in the kernel is uncertain, and lots of things aren't looking good.

Linus has said he isn't accepting my 6.13 pull request, per "an open issue with the CoC board", and at this point I have no idea what's going on with the CoC board. I, for my part, have felt for quite some time that there are issues about our culture and the way we do work that need to be raised, and that hasn't been going anywhere - hence this post.

What follows will be an account of ...

View Post

Repeat after me: you do not fuck around in filesystem land

So, kernel programming is known for being a high stakes, high pressure, toxic environment to work in. There's a reason for that: the consequences for screwing up are really high, and you can make your co-workers and a lot of users really miserable if you screw things up.

I've seen plenty of bugs that have taken weeks or months to track down, and bugs can take down the entire machine and waste a lot of people's time to track down - not to mention if you're writing code that's included i...

View Post

Online fsck

Initial support for online fsck is merged - it's in my master branch, and will be in Linux 6.8. To use it, just run the normal fsck command; if the filesystem is mounted, it'll use the online codepath.

Not all fsck passes are safe to use while the filesystem is in use yet: online fsck only runs the subset that are safe to run. Right now that's most of the passes for checking allocation info, subvolumes and snapshots; soon that will start to include the fsck.c passes for checking high le...

View Post

Telemetry

Recently, I added a new superblock section for tracking counts of every distinct filesystem error (i.e. fsck error) since filesystem creation, as well as the date of the most recent error.

The idea is that inconsistencies that fsck is able to repair often don't go reported - but they still need to be fixed. And I won't know to go bug hunting if I don't know they're happening.
So, I'd like to add some telemetry - opt in, of course. I'm thinking a weekly cron job to upload the superblo...

View Post

Out west


View Post

Note on the phoronix numbers

Phoronix recently posted some bcachefs benchmarks, and their results looked a bit... off.
Here's what I just got, testing 4k random writes with a similar fio configuration. SSD is a Samsung 970 Evo plus. Default mkfs options for all three filesystems.

xfs: 1 job 456k iops, 8 jobs 548k iops

btrfs: 1 job 112k iops, 8 jobs 113k iops

bcachefs: 1 job 161k, 8 jobs 538k

So comparing bcachefs to xfs - we see much higher CPU usage because we're COW (every write has to al...

View Post

Recent work

Some recent features that have landed:

 * Rebalance work btree: Rebalance no longer has to scan for extents that need the background_target or background_compression option applied. Instead, there's a new bitset btree, updated by the extent trigger, that rebalance uses to find extents that need work done on them.

This is a big scalability improvement for multi device tiered setups, and we might extend this in the future for applying other IO path option changes in the backgro...

View Post

Your irregular status update: upstreaming, growing the team, and a funding situation update

So, the upstreaming process has been rocky - but bcachefs is in linux-next! Fingers crossed for 6.7. Since being merged into the linux-next tree we've also been getting a lot more assorted bugfixes and bugreports, mostly from static analysis - every bit helps.

Some other recent work:

 - Logged operations

Some operations are too big and complicated to be done in a single btree transaction, but we still want them to be atomic in the event of a crash. We now have code for ...

View Post

Your irregular status update

Let's see, what is there to talk about since the last:

Mostly, it's been a whole lot of bug fixing and stabilizing, grinding away at test failures in the CI: https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs

We're currently at ~50 test failures, which doesn't seem like progress compared to 6 months ago at first - except that since then I've added the xfstests-nocow tests, and also added new things for our...

View Post

Thanksgiving update

There's so much that goes into developing a real filesystem. Especially one that's intended to be good enough to replace our existing filesystems, codebases that have had decades of refinement by teams of engineers.

Some days it can feel a bit overwhelming.

A filesystem has to be fast. But performance isn't just a matter of taking a codepath and optimizing it until it's fast - though there is certainly a lot of that work; looking at a profile to identify what needs to be looked at...

View Post

New test dashboard

We _finally_ have a server continuously running tests and outputting to a nice dashboard. I'm pretty excited - this is going to make my life a lot easier, and it's another thing people can look at to see the current status.

If anyone skilled in web development is interested in helping out, I'd love to make some simple frontend-only improvements - it'd be nice to have an option to see tests of a given status in the list for a given commit.

2022-07-03 19:26:07 +0000 UTC View Post

Backpointers has been merged!

New on disk format, required upgrade. Your existing filesystem will be automagically upgraded when you upgrade kernel & tools to the new version.

What this means: much more efficient copygc. In the future we'll be using it for other things too, like possibly accelerating rebalance. The new backpointers fsck code may eventually replace a good chunk of our other fsck code, too.

Scrub is coming soon, too - I've started refactoring the data update path to prepare for that.

N...

View Post

.plan file

Reorganized my todo list last night, now automatically synced from my home directory:

https://evilpiepirate.org/~kent/.plan.txt

(Patreon's post submission tool doesn't pick up the link correctly, you'll have to edit it)

It's long, but not actually that insane. It doesn't cover the tests that are still failing and the outstanding bugs that are still filed, but that list isn't too insane either. Spiffy.

View Post

Zoned device support

I've been starting to work on support for zoned devices, laying out what needs to be done. This will get us native support for SMR (shingled magnetic recording) hard drives, and (even more exciting!) fancy new ZNS SSDs, which strip away most of the FTL: by folding that into the filesystem, we'll get better performance, especially latency, better write amplification, and more capacity. Now that the allocator rewrite has landed we have support for much larger buckets - up to a terabyte - which ...

View Post

New allocator has been merged!


It's a mandatory disk format upgrade; when switching to the new version on an

existing filesystem you'll see it initialize the freespace btree when you mount.


What's changed: we've got some new persistent data structures that replace code

that used to periodically walk all the buckets in the filesystem, kept in an in

memory array - and now that we don't need to do that anymore, the in-memory

bucket array is gone, too. Specifically, we've got...

View Post

Status update

Bacon and eggs, tea, sitting down to work on finishing the BTREE_ITER_WITH_JOURNAL patch, but before I can get to the interesting and necessary algorithmic work I've got like half a dozen bug reports and problems to respond to. And people wonder why I haven't upstreamed yet..

 - this is something I'm pretty excited about, it teaches the btree iterater code how to overlay the keys from journal replay over the btree, which means we'll be able to use all of the standard btree interfac...

View Post

User manual

Bcachefs now has a user manual - check it out!

https://bcachefs.org/bcachefs-principles-of-operation.pdf

Still need to expand the sections on sysfs internals, and the on disk format.

View Post

Short update

Just finished fixing a whole bunch of i_sectors accounting bugs - the count of how many sectors are in a given inode/file. It turns out our accounting on disk, in the btree was completely fine - fsck would've noticed, if it wasn't - but the in memory accounting is different due to dirty data in the page cache, and that code was completely missing assertions and not much tests it so we never noticed.

Except that quotas do hang off of that accounting, and the quota code did have assertion...

View Post

Snapshots have been merged

They work similarly to btrfs snapshots. There's new bcachefs subcommands for creating subvolumes and snapshots.

We've got writable snapshots, snapshots of snapshots, and snapshot and subvolume deletion. Create as many snapshots as you want - you're only limited by the amount of disk space you have, and they're much more space efficient than btrfs snapshots; no internal fragmentation problems. I'd love to hear if someone can get to a million snapshots.

Anything but the core functi...

View Post

Status update - new on disk format change

There's a new on disk format change, and it's a required upgrade - you'll want to upgrade your kernel and tools at the same time to stay in sync.

t was The new update closes a hole in our ability to verify metadata that goes back to bcache: since btree nodes are log structured, historically we've had no way to detect lost btree writes (unless they were the first write to a given node).

Naturally, this eventually bit us (after 10 years of this design being in use!). The one bug I'...

View Post

Quick note

Please report it _any_ time errors are reported on a filesystem, fsck errors or otherwise. There's a couple of nasty bugs I'm trying to track down right now (one of them appears to be btree writes getting lost, but so far the reports are only from filesystems that have replication enabled, and it's always just one of the replicas, so we're able to retry from another replica when we notice it).

Even if I'm not able to fully diagnose every single report, the data is still extremely helpfu...

View Post

Disk format change for snapshots

Turns out, there was a hiccup after all in the on disk format changes for snapshots: it turns out we were generating packed bkey formats where the snapshot field was too big, and this breaks the lookup code when we start using the snapshot field.

I just pushed out code to scan the btree for btree nodes that have these bad bkey formats and rewrite them, and flip a compatible feature bit once that's been done.

In order to get a smooth upgrade process, you'll need to upgrade your ker...

View Post

Snapshots are working!

They're still in a very early state, and nothing but the bare minimum is complete - but they're up, for people to look at and try out. You'll need to use the snapshots branch from both the kernel and tools repositories.

Caveats:

 - I haven't finished the compat code, you'll have to make a new filesystem (using the snapshots branch from tools). Also, I am definitely going to be adding things to the subvolume and snapshots keys and I'm not going to write compat code for that - ...

View Post

New snapshots design doc

Snapshots are really coming together - the whole thing is starting to look shockingly elegant and simple (you wouldn't be able to guess from the design doc how many years it took to get to that point...)

Comments welcome:

https://bcachefs.org/Snapshots/

View Post

Status update

Snapshots are coming :)

View Post

Upcoming changes

Recently, Dave Chinner's been helping out with some performance testing of bcachefs on his big dual socket machine, which has been great for finding scalability issues and he's also been able to provide really useful comparisons and microbenchmarks where bcachefs is still behind xfs, and I've been working on a bunch of improvements - in particular to the journalling code - in response.

 - When allocating new inodes, we now use the index of the core we're running on for the high bit...

View Post

Status update - erasure coding

Erasure coding is ready for wider testing!

I just finished reworking disk space accounting for parity blocks - they're broken out from user data now, so bcachefs fs usage will show you that overhead. And I've been fixing bugs; there shouldn't be any gaping holes, and I don't expect it to catch on fire or eat data at this point. But there will definitely bugs and performance issues to find and fix.

That's the main news, aside from a whole bunch of bug fixing. Xfstests runs are look...

View Post

Erasure coding, bcache2

Finally been making some good progress on erasure coding: currently debugging the new code to update existing stripes with new blocks, which is the main piece that was missing - this is needed to deal with internal fragmentation across erasure coded stripes and avoid running out of space.

Also - I started working on a bcache -> bcachefs layer: it lets you attach existing bcache backing devices to a bcachefs filesystem and use that to store the cached data instead of the bcache cache...

View Post