- Interior btree node updates are now journalled; removing the need for btree writes to be FUA
- Interior btree node updates are now fully transactional, we no longer have to do any metadata scanning after unclean shutdown
- Btree key cache code has been merged
- Major rework of journal replay finally finished
- Lots of bug fixing
So, some background:
Historically, the btree and the journal in bcache/bcachefs have been fairly s...
2020-06-18 23:06:26 +0000 UTC
View Post
Just finished a major rework that gets us a step closer to snapshots: the btree code is incrementally being changed to handle extents like regular keys.
Previously, when reading in a btree node we'd have to check for and handle partially overwritten extents, as part of the mergesort we do (btree nodes are log structured). But the plan for snapshots will break this algorithm; storing extents from different snapshots in the same position in the btree breaks the ordering requirements for t...
2019-12-29 17:36:19 +0000 UTC
View Post
There is now a (very work-in-progress) fuse port!
The fuse port isn't intended to ever be for serious use - but I do expect it to be useful for debugging in the future; if someone is hitting a repeatable bug in the bcachefs code, debugging it via the fuse version (with gdb) should be much easier for most people than collecting kernel oopses. We've also already found at least one real bug by running the fuse version of bcachefs under valgrind (thanks to Justin Husted for that).
Als...
2019-11-06 20:07:53 +0000 UTC
View Post
For those who aren't familiar with the idea - reflink means using shared, reference counted extents to do "shallow copies" - copies that share data transparently on disk, but are copy on write (unlike hardlinked files).
To use it, just use cp --reflink. It's great for virtual machine images, and you can also use it like snapshots - e.g. "cp -a --reflink foo foo-$(date -I)". It's not as good as real snapshots because creating the pseudo-snapshot isn't atomic and is more expensive than cr...
2019-08-21 17:28:01 +0000 UTC
View Post
It's pretty close to done, but working through the last of the xfstests failures has been tedious.
But - I just pushed out a punch of prep work patches, and something else cool is now done - we're exporting the actual filesystem blocksize to the Linux VFS, instead of pretending the filesystem blocksize is actually PAGE_SIZE. This was needed to get one of the reflink tests in xfstests to pass, but it was also the biggest blocker for supporting variable size pages (i.e. compound pages) in...
2019-08-07 15:03:34 +0000 UTC
View Post
Phoronix posted some bcachefs benchmarks: https://www.phoronix.com/scan.php?page=article&item=bcachefs-linux-2019
The results are actually pretty encouraging, even if they might not look it on the surface - they're about what you'd expect at this point. Given a large enough codebase, if 95% of it is thoroughly optimized, but there's a couple fastpaths that have perform...
2019-06-26 19:04:54 +0000 UTC
View Post
Finally! It was a huge effort, but it's done and pushed out.
This means that when mounting a filesystem - even after an unclean shutdown - we don't have to walk all the metadata anymore, because it's always updated in a transactional manner and kept fully consistent in the b-tree.
There may be a performance regression for now on multithreaded write workloads, due to lock contention on the alloc btree. But, that will go away when I implement the new btree key cache code (it'll gen...
2019-04-20 03:39:45 +0000 UTC
View Post
5.0 rebase is up
And, more importantly - fully persistent allocation info is finally just about done! It's passing the tests, not much left before I can push it out...
2019-04-04 02:14:11 +0000 UTC
View Post
So, first some background:
Fully persistent allocation info is going to require updating the alloc btree every time we update the extents btree - one key in the alloc btree for every pointer in an extent being inserted or overwritten.
That introduces a bit of a difficulty, in that extents can overwrite an unbounded number of existing extents (though we can trim the extent being inserted and not insert it all at once, and let it be merged again later) - and for every update being d...
2019-03-04 20:30:31 +0000 UTC
View Post
So, to recap: bcachefs now persists allocation information on clean shutdown, so mounting after a clean shutdown doesn't require walking any metadata. However, we're not yet keeping allocation information updated as it's modified - that's my current project.
There's two main components to this. Firstly, there's the filesystem wide sector counts, which are now broken out by replica sets (i.e. "number of sectors of data replicated across drives x, y and z). Since typically every write ope...
2019-02-18 17:55:01 +0000 UTC
View Post
Persistent alloc info for clean shutdowns is finally done - this means when mounting after a clean shutdown, we don't have to scan metadata anymore, and mounting should be just as fast or faster than other filesystems.
We do still run fsck by default on every mount, so to see any change you'll have to turn that off with the nofsck mount option:
mount -o nofsck /dev/sda1 /mnt
As always, keep the bug reports coming.
2019-02-10 00:59:52 +0000 UTC
View Post
I'll be at FOSDEM. I'm not planning on giving a talk or anything, but if anyone else is interested and is going to be there, send a message and I'd love to meet up.
2019-01-12 19:33:54 +0000 UTC
View Post
Option handling improvements: There's a single master list of option in opts.h, and that list is now used by bcachefs format as well, including for bcachefs format --help. This is a nice usability improvement - it means options are always specified the same way anywhere they can be used, and it means the helptext is always going to be consistent with the actual options.
The next thing we should do is use opts.h for generating a man page - if anyone wants to take that one on that would b...
2018-12-27 15:06:10 +0000 UTC
View Post
So for now, I'm leaving off the remaining parts of erasure coding - the important part was getting everything done that impacts both the on disk format, and the rest of the design. There's some commonality between erasure coding and some of the other upcoming features, so getting erasure coding mostly done now was very useful because it was a good angle for working on that common functionality.
What I really want to be working on next is reflink, but it turns out in order to do reflink I prett...
2018-11-30 19:04:29 +0000 UTC
View Post
It's not production ready yet - stripe level copygc isn't implemented yet, so disk fragmentation could lead to your filesystem getting filled with partially empty stripes and getting stuck. But, aside from that it should be functional.
To use it, just enable the erasure_code option, either at mount time
mount -o erasure_code=true
or via sysfs
echo 1 > /sys/fs/bcachefs/*/options/erasure_code
or, just for a certain file or directory
setfattr -n bcachefs.erasure_c...
2018-11-14 05:15:47 +0000 UTC
View Post
First off, sorry for the slow progress lately - I've been dealing with some health issues that have been making it incredibly difficult to work. But, the good news is that we may have finally figured out what's going on and *fingers crossed* aforementioned issues seem to finally, slowly be getting better.
The good news is though - with the work I have managed to get done lately, erasure coding is finally seeing some major progress, and when it's done it's going to be _slick_. You'll be ab...
2018-10-12 17:34:44 +0000 UTC
View Post
One topic that was asked about recently was compression in bcachefs, so I thought I'd write a bit about how extents are represented as a bunch of stuff falls out of that.
In bcachefs, checksumming and compression are done per extent, not per block or per page. This means we store one checksum per extent and if the data is compressed, it'll be compressed all at once instead of being broken up into page size chunks (like btrfs does).
This means we're making some tradeoffs. Whenever we read...
2018-08-13 21:55:07 +0000 UTC
View Post
I've gotten a few comments that people have been enjoying my technical deep dives into things I'm working on.
There's a lot of other things I could write about as well, not just bcachefs but perhaps also other kernel and storage topics. I'd like to hear what people are interested in, though. If you've got an idea of something you'd like to learn more about, post it below.
2018-08-06 22:30:21 +0000 UTC
View Post
In the last post, I wrote about some new transaction infrastructure I was working on that would make it practical to make all the high level filesystem operations (e.g. create, link, unlink) fully atomic - that work is now finished and merged in.
The main benefit from this work is that now, on unclean shutdown, we don't have to walk the filesystem heirarchy (i.e. all the dirents and inodes) to recalculate every inode's link count. Also, there were a few other operations, unrelated to i_nlink a...
2018-07-17 14:18:18 +0000 UTC
View Post
I've talked a bit before about the new transaction infrastructure I've been working on, but to recap:
bcachefs has, for quite some time, had the ability to use multiple btree iterators simultaneously, and to do multiple btree updates atomically - the main btree update function takes a list of (iterator, new key) pairs and does all the updates atomically by grabbing write locks on all the relevant btree nodes simultaneously (and in the correct order) and putting all the updates in one journal r...
2018-07-06 23:21:14 +0000 UTC
View Post
Been spending a surprising amount of time lately on the core btree - in a good way, as in "oh, here's some good an useful improvements I can easily make", not "oh crap, this thing is broken and I have to fix it".
Some of this was motivated by the truncate bug and needing implement BTREE_INSERT_NOUNLOCK, and more has been motivated by some more advanced transaction functionality I'm working on (that I should write another post), but the actual work has mostly revolved around clarifying and...
2018-06-08 00:46:34 +0000 UTC
View Post
Been squashing quite a few bugs lately, but this latest one has been quite a trip down the rabbit hole...
Initial symptom was that on xfstest generic/475, very occasionally we'd see an extent past the end a file's current i_size (the test runs a filesystem stress test while injecting IO errors and then checking that the filesystem is consistent, it's quite the torture test).
After several days of head scratching and shotgunning assertions all over the place trying to figure out where the...
2018-06-01 18:07:57 +0000 UTC
View Post
definitely not drunk debugging right now
I know I've been shit at posting updates, so ask your questions now - about what's going on with upstreaming or anything else you can think of
2018-05-25 05:11:36 +0000 UTC
View Post
Just pushed a new feature (only lightly tested so far): when formatting, you can specify a "durability" for each device: the effect of this is that data on that device will be counted as being replicated that many times.
So if you've got a filesystem with two SSDs and a big hardware RAID array: you probably want all your data to be replicated - you don't want to lose data if one of the SSDs dies - but you don't want bcachefs replicating it if it's on the hardware RAID array. With this feat...
2018-03-13 20:16:46 +0000 UTC
View Post
The new disk groups-based code for configuring data placement has been merged, and the notion of configuring disks into "tiers" has been removed. If you have an existing filesystem that uses tiering, you'll have to configure the new interfaces.
The reasoning behind the change was that a "disk tier" wasn't really a thing - it was just a hint to a couple different parts of the IO subsystem as to where they should put data and how they should move it around. Instead of having one hint - that...
2018-02-20 21:03:09 +0000 UTC
View Post
Please test (and don't assume it won't eat all your data)
2018-02-17 00:16:13 +0000 UTC
View Post
The test framework I use for bcachefs - ktest - has been getting various cleanups and fixes to make it easier for other people to use - in particular, it works on non debian distributions now.
For anyone who's been interesting in getting started with kernel development or bcachefs development, ktest makes it really easy to get started: no messing with virtual machines to set up a test environment, it'll build a kernel and launch a VM to run your tests with a single command.
Check it out...
2018-02-13 21:01:55 +0000 UTC
View Post
I just pushed initrams hooks/scripts for handling a bcachefs encrypted root filesystem - after you make install in bcachefs-tools, they'll be picked up next time you generate an initramfs, and if your root filesystem is encrypted you'll be promted for the passphrase to unlock it when booting up.
I've only tested it on debian. It could also be prettier, too - patches welcome.
2018-02-11 19:32:56 +0000 UTC
View Post
Replication support is finally feature complete; it should have everything implemented that's needed for handling and recoving from device failure.
If replication is enabled on a filesystem, a device can fail and be removed while the filesystem is in use without returning any IO errors to userspace - reads/writes will be retried as needed, including on checksum error.
By default, mount is only allowed if all RW and RO devices are present. You can mount with devices missing with mount -o...
2018-02-08 21:02:52 +0000 UTC
View Post
just fixed some bugs in the migrate tool, should be working again
2018-02-07 16:15:41 +0000 UTC
View Post