bcachefs

More on fully persistent allocation information

Added 2019-02-18 17:55:01 +0000 UTC

So, to recap: bcachefs now persists allocation information on clean shutdown, so mounting after a clean shutdown doesn't require walking any metadata. However, we're not yet keeping allocation information updated as it's modified - that's my current project.

There's two main components to this. Firstly, there's the filesystem wide sector counts, which are now broken out by replica sets (i.e. "number of sectors of data replicated across drives x, y and z). Since typically every write operation will be updating these counters, and I'm allergic to shared cachelines, these sector counts are stored as percpu counters - which does make things interesting for writing them out. Additionally, when we write them out they're written as part of a particular journal entry, and we need to be writing out what the value was at that particular journal sequence number - due to pipelining we do in the journalling code, there will already be operations newer than that journal sequence number updating those counts by the time all the operations up to and including that journal sequence number are finished.

That part is solved now, though. The solution is to maintain three sets of sector counts: a base set, which isn't stored as percpu counters and isn't updated, and two sets of percpu counters - because we can have up to two different journal entries with operations in flight at a time. At any given point in time, summing all three sets of sector counts together gives us the current true sector counts.

Any operation that updates sector counts will be doing so as part of an operation that is getting journalled in a particular journal entry - thus, when we go to update the sector counts, we can pick the one of the two different percpu counters that corresponds to the journal entry it's being done for. Then, just prior to that journal entry being written, all operations modifying the corresponding set of sector counts will be completed - so we can add those sector counts to the base sector counts and zero them out, and read out the sector counts from the base set and add them to the journal entry.

So with the way we're using the three sets of sector counts - the base set is "counters at time t", where t is some journal sequence number, and then the two sets of percpu counters are the deltas for time t + 1 and t + 2. At any given time we can add the deltas for t + 1 to the base counters, and the base is now t + 1, the other set of deltas is for t + 2, and now we have a zeroed out set of counters for t + 3.

So, that's the filesystem wide sector counts. The other thing that has to be persisted are the per bucket sector counts - these are stored as keys in the alloc btree. That means that every btree update - specifically, every update to the extents btree - that happens to update bucket sector counts also needs to be updating the alloc btree.

We've got the infrastructure for this - I've already described in a previous post the btree transaction infrastructure. So there aren't any particularly interesting technical challenges here, it's just a whole lot of plumbing to write. When we've got an extent btree update to do, just prior to doing the update we need to walk the new extent and all extents being overwritten and add the updates to the alloc btree to the btree transaction - possibly truncating the new extent, because a new extent can be overwriting arbitrarily many existing extents and all the alloc btree updates might not fit in the btree transaction or journal entry.

There is one other major challenge though - journal replay.

The way journal replay currently works is it just iterates over every key in the journal and uses the normal btree update interface to redo all those updates. This is the way it's always worked, ever since bcache first gained a journal long long ago, and it's dead simple.

The trouble is, that means journal replay may require allocating new btree nodes, which means _prior_ to journal replay we need to have consistent allocation information - at least consistent enough to start the allocator.

So the way this has always worked is that we run our mark and sweep gc prior to journal replay, and we run it both on the btree and on all the keys in the journal to be replayed. This will double count some things, since we're counting things in the btree and in the journal that may get overwritten by journal replay, but that's no big deal - when journal replay happens and does the overwrites that double counting will get fixed.

But, if prior to journal replay we already have all our allocation information (because it was persistent) - well, it's consistent with the state of the btree _after_ journal replay has finished, it's not consistent with how we calculate it prior to journal replay.

Which means that we can't actually verify it until after journal replay, but if journal replay requires new allocations which change the state of the allocation information... and this is where my head starts to hurt.

My current thinking is that journal replay is going to have to get quite a bit more complicated - we're going to have to sort the keys to be replayed, handling overwrites, and then walk them in parallel with walking the btree in initial gc. Except of course the point of all this is to get rid of that initial gc - so it may be that the only point of this exercise is just so that we actually have some way of testing that allocation information is getting persisted correctly. Which is a pretty important reason, still...