bcachefs

More expensive on disk format upgrades

Added 2024-11-29 22:26:43 +0000 UTC

6.11 was an expensive forced on disk format upgrade, for the disk accounting rewrite: while bcachefs is still marked as experimental, I'm making on disk format changes that would not be feasible if we needed compatibility code so that new versions could work on old filesystems without upgrading them.

(Note that - aside from the odd bug, like when 6.9 wasn't reading the downgrade table correctly - old kernels can still mount upgraded filesystems, they just have to downgrade, which is also expensive).

We've got a couple more forced upgrades coming - and these should be the last, as once the experimental label comes off in six months or so I'll stop doing these.

But these are well worth it, so I want to give everyone a heads up:

Improvement to backpointers (they'll now include the bucket generation number)
This one makes it possible to check for missing backpointers without walking every extent and looking up its backpointer. Instead, we'll be able to sum up backpointers within a bucket, check it against the bucket sector counts, and only look for missing backpointers if the counts are off - and then we'll only be looking for missing backpointers in specific buckets.
We need the bucket generation number to avoid counting stale backpointers, since the backpointers btree uses the btree write write buffer (updates are much faster, but reads are always potentially stale), and we don't want to make backpointers bigger to add that field since backpointers are one of the biggest btrees (tied for extents for first place, on my laptop). So fitting it in requires reclaiming an obsolete field, hence the incompatible upgrade.
Since the two backpointers passes (backpointers -> extents and extents -> backpointers) are by far the most expensive fsck passes, this will be well worth the pain. The backpointers -> extents pass is also going away except in debug mode (runtime self healing makes it unnecessary), so this is a major scalability improvement.
Petabyte sized filesystems might be practical soon.
Sort order change for disk accounting keys
The disk accounting rewrite switched to storing accounting in btree keys, where the 160 bit multi word integer of a btree key is used as a tagged union: this means it's now much easier to add new accounting counters and we can have as many as we want. That's how we got the new accounting for compression type and ratio, and there's also per snapshot ID accounting and per-file fragmentation accounting that isn't exposed yet.
But I made an error when deciding how to translate from the disk accounting tagged union to a btree key (bpos): I made it a simple memcpy on little endian (byte swabbing on big endian) - that seemed the natural choice, given that most of us are running little endian machines these days.
Except - oops - that puts the type tag in the low bits of bpos. This is a situation where big endian is actually more natural than little endian (note that string sort order, if you pad them out to the same length, is the same integer comparison if you treat them as big endian integers).
I noticed that early on, and didn't think it would create any real issues because in general we're not iterating over accounting, we're looking up specific keys. But there's one exception: at startup, we have to read (some) disk accounting keys into memory - some accounting counter types are mirrored in memory in an eytzinger tree for fast access.
And since we now have per-inode counters (which we'll want for implementing defragmentation later on), the accounting btree is already drastically bigger than it was originally expected to be - so this needs to be fixed or mounts are going to be getting very slow (I would not be surprised if people are already seeing this and just haven't mentioned it).

So expect these changes to be in my tree soon (next week or two), I'll ask people to start testing them when they're ready and they should be landing in 6.14.