XaiJu
bcachefs

bcachefs

patreon


bcachefs posts

Status update

 - Interior btree node updates are now journalled; removing the need for btree writes to be FUA

 - Interior btree node updates are now fully transactional, we no longer have to do any metadata scanning after unclean shutdown

 - Btree key cache code has been merged

 - Major rework of journal replay finally finished

 - Lots of bug fixing

So, some background:

Historically, the btree and the journal in bcache/bcachefs have been fairly s...

View Post

Towards snapshots

Just finished a major rework that gets us a step closer to snapshots: the btree code is incrementally being changed to handle extents like regular keys.

Previously, when reading in a btree node we'd have to check for and handle partially overwritten extents, as part of the mergesort we do (btree nodes are log structured). But the plan for snapshots will break this algorithm; storing extents from different snapshots in the same position in the btree breaks the ordering requirements for t...

View Post

Status update

There is now a (very work-in-progress) fuse port!

The fuse port isn't intended to ever be for serious use - but I do expect it to be useful for debugging in the future; if someone is hitting a repeatable bug in the bcachefs code, debugging it via the fuse version (with gdb) should be much easier for most people than collecting kernel oopses. We've also already found at least one real bug by running the fuse version of bcachefs under valgrind (thanks to Justin Husted for that).

Als...

View Post

At long last - reflink is done

For those who aren't familiar with the idea - reflink means using shared, reference counted extents to do "shallow copies" - copies that share data transparently on disk, but are copy on write (unlike hardlinked files).

To use it, just use cp --reflink. It's great for virtual machine images, and you can also use it like snapshots - e.g. "cp -a --reflink foo foo-$(date -I)". It's not as good as real snapshots because creating the pseudo-snapshot isn't atomic and is more expensive than cr...

View Post

Still hacking away at reflink

It's pretty close to done, but working through the last of the xfstests failures has been tedious.

But - I just pushed out a punch of prep work patches, and something else cool is now done - we're exporting the actual filesystem blocksize to the Linux VFS, instead of pretending the filesystem blocksize is actually PAGE_SIZE. This was needed to get one of the reflink tests in xfstests to pass, but it was also the biggest blocker for supporting variable size pages (i.e. compound pages) in...

View Post

Notes on Phoronix benchmarks

Phoronix posted some bcachefs benchmarks: https://www.phoronix.com/scan.php?page=article&item=bcachefs-linux-2019

The results are actually pretty encouraging, even if they might not look it on the surface - they're about what you'd expect at this point. Given a large enough codebase, if 95% of it is thoroughly optimized, but there's a couple fastpaths that have perform...

View Post

Fully persistent allocation info is finally done

Finally! It was a huge effort, but it's done and pushed out.

This means that when mounting a filesystem - even after an unclean shutdown - we don't have to walk all the metadata anymore, because it's always updated in a transactional manner and kept fully consistent in the b-tree.

There may be a performance regression for now on multithreaded write workloads, due to lock contention on the alloc btree. But, that will go away when I implement the new btree key cache code (it'll gen...

View Post

Status update

5.0 rebase is up

And, more importantly - fully persistent allocation info is finally just about done! It's passing the tests, not much left before I can push it out...

View Post

Status update - persistent alloc info

So, first some background:

Fully persistent allocation info is going to require updating the alloc btree every time we update the extents btree - one key in the alloc btree for every pointer in an extent being inserted or overwritten.

That introduces a bit of a difficulty, in that extents can overwrite an unbounded number of existing extents (though we can trim the extent being inserted and not insert it all at once, and let it be merged again later) - and for every update being d...

View Post

More on fully persistent allocation information

So, to recap: bcachefs now persists allocation information on clean shutdown, so mounting after a clean shutdown doesn't require walking any metadata. However, we're not yet keeping allocation information updated as it's modified - that's my current project.

There's two main components to this. Firstly, there's the filesystem wide sector counts, which are now broken out by replica sets (i.e. "number of sectors of data replicated across drives x, y and z). Since typically every write ope...

View Post

Fast mounts update

Persistent alloc info for clean shutdowns is finally done - this means when mounting after a clean shutdown, we don't have to scan metadata anymore, and mounting should be just as fast or faster than other filesystems.

We do still run fsck by default on every mount, so to see any change you'll have to turn that off with the nofsck mount option:

mount -o nofsck /dev/sda1 /mnt

As always, keep the bug reports coming.

View Post

bcachefs at FOSDEM

I'll be at FOSDEM. I'm not planning on giving a talk or anything, but if anyone else is interested and is going to be there, send a message and I'd love to meet up.

View Post

Status update - quotas and option handling

Option handling improvements: There's a single master list of option in opts.h, and that list is now used by bcachefs format as well, including for bcachefs format --help. This is a nice usability improvement - it means options are always specified the same way anywhere they can be used, and it means the helptext is always going to be consistent with the actual options.

The next thing we should do is use opts.h for generating a man page - if anyone wants to take that one on that would b...

View Post

Status update - fast mount times, reflink

So for now, I'm leaving off the remaining parts of erasure coding - the important part was getting everything done that impacts both the on disk format, and the rest of the design. There's some commonality between erasure coding and some of the other upcoming features, so getting erasure coding mostly done now was very useful because it was a good angle for working on that common functionality.

What I really want to be working on next is reflink, but it turns out in order to do reflink I prett...

View Post

Erasure coding has been pushed

It's not production ready yet - stripe level copygc isn't implemented yet, so disk fragmentation could lead to your filesystem getting filled with partially empty stripes and getting stuck. But, aside from that it should be functional.

To use it, just enable the erasure_code option, either at mount time

mount -o erasure_code=true

or via sysfs

echo 1 > /sys/fs/bcachefs/*/options/erasure_code

or, just for a certain file or directory

setfattr -n bcachefs.erasure_c...

View Post

Erasure coding is coming!

First off, sorry for the slow progress lately - I've been dealing with some health issues that have been making it incredibly difficult to work. But, the good news is that we may have finally figured out what's going on and *fingers crossed* aforementioned issues seem to finally, slowly be getting better.

The good news is though - with the work I have managed to get done lately, erasure coding is finally seeing some major progress, and when it's done it's going to be _slick_. You'll be ab...

View Post

Bcachefs extents - compression, checksumming

One topic that was asked about recently was compression in bcachefs, so I thought I'd write a bit about how extents are represented as a bunch of stuff falls out of that.

In bcachefs, checksumming and compression are done per extent, not per block or per page. This means we store one checksum per extent and if the data is compressed, it'll be compressed all at once instead of being broken up into page size chunks (like btrfs does).

This means we're making some tradeoffs. Whenever we read...

View Post

Vote for the next deep dive topic!

I've gotten a few comments that people have been enjoying my technical deep dives into things I'm working on.


There's a lot of other things I could write about as well, not just bcachefs but perhaps also other kernel and storage topics. I'd like to hear what people are interested in, though. If you've got an idea of something you'd like to learn more about, post it below.

View Post

Filesystem metadata operations are now all fully atomic

In the last post, I wrote about some new transaction infrastructure I was working on that would make it practical to make all the high level filesystem operations (e.g. create, link, unlink) fully atomic - that work is now finished and merged in.

The main benefit from this work is that now, on unclean shutdown, we don't have to walk the filesystem heirarchy (i.e. all the dirents and inodes) to recalculate every inode's link count. Also, there were a few other operations, unrelated to i_nlink a...

View Post

Progress towards faster mount times - new transaction infrastructure

I've talked a bit before about the new transaction infrastructure I've been working on, but to recap:

bcachefs has, for quite some time, had the ability to use multiple btree iterators simultaneously, and to do multiple btree updates atomically - the main btree update function takes a list of (iterator, new key) pairs and does all the updates atomically by grabbing write locks on all the relevant btree nodes simultaneously (and in the correct order) and putting all the updates in one journal r...

View Post

Btree unit tests

Been spending a surprising amount of time lately on the core btree - in a good way, as in "oh, here's some good an useful improvements I can easily make", not "oh crap, this thing is broken and I have to fix it".

Some of this was motivated by the truncate bug and needing implement BTREE_INSERT_NOUNLOCK, and more has been motivated by some more advanced transaction functionality I'm working on (that I should write another post), but the actual work has mostly revolved around clarifying and...

View Post

The bug squashing continues...

Been squashing quite a few bugs lately, but this latest one has been quite a trip down the rabbit hole...

Initial symptom was that on xfstest generic/475, very occasionally we'd see an extent past the end a file's current i_size (the test runs a filesystem stress test while injecting IO errors and then checking that the filesystem is consistent, it's quite the torture test).

After several days of head scratching and shotgunning assertions all over the place trying to figure out where the...

View Post

Status update

definitely not drunk debugging right now


I know I've been shit at posting updates, so ask your questions now - about what's going on with upstreaming or anything else you can think of

View Post

New feature: specify a device's durability

Just pushed a new feature (only lightly tested so far): when formatting, you can specify a "durability" for each device: the effect of this is that data on that device will be counted as being replicated that many times.

So if you've got a filesystem with two SSDs and a big hardware RAID array: you probably want all your data to be replicated - you don't want to lose data if one of the SSDs dies - but you don't want bcachefs replicating it if it's on the hardware RAID array. With this feat...

View Post

Tiering is dead; long live disk groups

The new disk groups-based code for configuring data placement has been merged, and the notion of configuring disks into "tiers" has been removed. If you have an existing filesystem that uses tiering, you'll have to configure the new interfaces.

The reasoning behind the change was that a "disk tier" wasn't really a thing - it was just a hint to a couple different parts of the IO subsystem as to where they should put data and how they should move it around. Instead of having one hint - that...

View Post

Just pushed support for zstd compression

Please test (and don't assume it won't eat all your data)

View Post

ktest

The test framework I use for bcachefs - ktest - has been getting various cleanups and fixes to make it easier for other people to use - in particular, it works on non debian distributions now.

For anyone who's been interesting in getting started with kernel development or bcachefs development, ktest makes it really easy to get started: no messing with virtual machines to set up a test environment, it'll build a kernel and launch a VM to run your tests with a single command.

Check it out...

View Post

Initramfs support for root on encrypted bcachefs

I just pushed initrams hooks/scripts for handling a bcachefs encrypted root filesystem - after you make install in bcachefs-tools, they'll be picked up next time you generate an initramfs, and if your root filesystem is encrypted you'll be promted for the passphrase to unlock it when booting up.

I've only tested it on debian. It could also be prettier, too - patches welcome.

View Post

New rereplicate tool; replication ready for testing

Replication support is finally feature complete; it should have everything implemented that's needed for handling and recoving from device failure.

If replication is enabled on a filesystem, a device can fail and be removed while the filesystem is in use without returning any IO errors to userspace - reads/writes will be retried as needed, including on checksum error.

By default, mount is only allowed if all RW and RO devices are present. You can mount with devices missing with mount -o...

View Post

Migrate tool

just fixed some bugs in the migrate tool, should be working again

View Post