bcachefs

What's cooking

Added 2017-12-13 16:07:29 +0000 UTC

I've been back to work for several months but procrastinating and neglecting the updates...

The biggest thing I've been spending my time on lately has been improving the test infrastructure and test suite and chasing bugs. It seems I was putting this off for far too long, relying just on xfstests - this made sense while I was focusing on completing and debugging the filesystem interface, but more recently work has focused more on the core IO path - the part of the code that handles things like checksumming, compression, encryption, replication, multiple devices, tiering, et cetera. Xfstests isn't useful for testing that stuff, and there's many different combinations of features bcachefs that all need testing.

But, much progress has been made - all new tests are passing with the exception of some of the tests with replication enabled. That includes in particular a new copygc torture test, where we fill up a filesystem (writing sequentially to a file until we get -ENOSPC), and then do 4k random writes within that file - that test uncovered some embarrassing bugs in the disk reservation machinery.

There's also been some major (and rather cool) improvements to the core IO path: whenever we're rewriting existing data (e.g. cache promotion, writeback, copygc) and data checksumming is enabled, we often have to generate a new checksum that covers the currently live data (as checksums cover extents, not blocks, and an extent may have been partially overwritten). The recent change is that now, whenever we're generating this new checksum, we verify it against the existing checksum (by generating multiple checksums that cover all the existing data - one of which covers just the data we're keeping - and then merging the checksums we calculated and verifying that it's equal to the original checksum).

This means that even if you don't have ECC memory, memory corruption can't lead to silent corruption (bitrot) of existing data - we can't save you if your data is corrupted before bcachefs generates a checksum for it, but once it's checksummed no matter how many times data is moved around or rewritten, if there's corruption anywhere we'll detect it.

Actually, the immediate motivation for this change primarily wasn't bitrot avoidance (though that is highly desirable) - it was to guard against data corruption bugs and make sure they're detected. As the core IO path has had to support more and more features - data checksumming, compression, encryption, replication, promotion - each of these has added their own special cases, and the IO path has gotten rather hairy. The idea with this change was that if every time we calculate a checksum it has to be related to the previous checksum - it should be very hard for a silent data corruption bug to slip through, since it's hard to see how we would generate a valid checksum for the corrupt data that verified against the existing checksum.

The IO path changes also helped a lot to make that code more sane, by making it more structured in a way that significantly reduced the hairiness and edge cases there. Previously, the core IO path was probably the sketchiest code in bcachefs (certainly it had become the biggest source of bugs and regressions) - I'm feeling much better about that code now.

Upcoming features:

Tiering is going away soon! It's going to be replaced by a new mechanism, with the goal of making the IO path much more flexible, and enabling per-inode policy.

Currently, assigning a disk to a tier causes multiple things to happen:

- In the read path, if the extent we're reading from does not have a replica in the fastest tier, we cache it in the fastest tier.

- In the write path, we prefer to allocate from the fastest tier (unless it's full)

- In the background, we look for dirty data in the faster tier and copy it to the slower tier, and then mark the faster copy cached.

Instead of having one setting for each disk that controls all these, they'll be broken out and the won't be per disk.

The new settings will be:

- promote_target: If set, reads will write a cached copy to that target, if the extent doesn't already have a replica in that target

- foreground_write_target: Foreground writes will prefer to allocate from this target

- background_write_target: In the background, data will be moved to this target (leaving data in the original target, but marking it as cached).

What's a target? A target is either a single disk or a disk group - a disk group is a new mechanism I'm adding. A disk can be in at most one disk group.

One thing these changes will make possible is using one of the disks in a filesystem as a writethrough cache - currently, there's no way to specify that a disk should only be used for cached data, not dirty data. You'll also be able to use a disk as a writearound cache, by specifying it as a promote target and nothing else.

The cooler thing about this set of changes though is that by having the actual policy settings be filesystem settings and not disk settings, we can have per inode settings that override the filesystem settings (and are inherited on creation, so you can effectively have settings for a directory tree).

This means that you could configure one file or directory to reside on one set of devices, and another directory to live on another set of devices - or have a directory (e.g. your big media files) that doesn't use the cache.

Additionally, while I'm adding per-inode IO path settings, I'll be adding settings for everything else that controls IO path policy - e.g. data checksum type, compression type, number of replicas. This means you could have compression disabled for most of your data, but enable it for just your (highly compressible) source code - or enable compression by default, but disable it for directories containing data you know is already compressed. Fun stuff.

Also coming down the pipeline is quota support - but that's been the cause of much wailing and gnashing of teeth, so for now the less said about quotas the better. Hoping to finish it after I finish all the disk groups stuff, though.