bcachefs

Status update - new on disk format change

Added 2021-07-16 16:46:40 +0000 UTC

There's a new on disk format change, and it's a required upgrade - you'll want to upgrade your kernel and tools at the same time to stay in sync.

t was The new update closes a hole in our ability to verify metadata that goes back to bcache: since btree nodes are log structured, historically we've had no way to detect lost btree writes (unless they were the first write to a given node).

Naturally, this eventually bit us (after 10 years of this design being in use!). The one bug I'm certain about was that on unclean shutdown in a multi device filesystem with metadata replication, a write could have made it to one btree node but not another. Reading from and resuming writing to the replica where the btree write didn't make it would be fine: we didn't need or want that write since it was newer than the most recent journal flush, but if we read from the replica where the write did make it and resumed writing at that point, the other replica would now have a gap in the btree node entries on disk - and later reading from that replica wouldn't find btree node entries after that gap. Oops.

It was also a hole in our authenticated encryption design: to be secure against an adversarial storage device (e.g. in the cloud), once we've read the root of our chain of trust (superblock, or most recent journal entry), we should always be able to detect corrupted metadata, including lost or extra metadata.

Fortunately - a little while back we gained journalling of updates to interior btree nodes, which was the main thing we needed to close this hole without regressing on performance. Now, after every btree node write we update the pointer to that node with the number of sectors currently written to that node - meaning we always know definitively what we expect to find when we're reading in a btree node. This completely eliminates the previously mentioned bug with replicated btree writes (I'd already fixed it, but this is a much better fix).

It also means we'll be able to delete most of the journal sequence number blacklist machinery: for background, to maintain sequential consistency after a crash we have to ignore btree writes that are newer than the newest journal commit after a crash, and we do this by recording in btree node entries the newest journal sequence number they have updates from, and after unclean shutdown blacklisting the next N journal sequence numbers and ignoring btree node entries that match those sequence numbers. It was pretty cool stuff, but it's unneeded now (since we'll never even look at those btree node entries) and the new way of doing things is an overall simplification.

Also, some more btree node locking improvements have landed. One of the explicit goals of bcachefs is being soft realtime - we should never be holding locks for an unbounded amount of time, or while we're doing IO. As prep work for the btree-node-ptr-update patch, I finished fixing the last remaining places we could potentially block on btree IO with btree locks held, and extended lockdep to add assertions so that we know we're not doing this in the future. We're not quite realtime yet - I still see some tail latencies that shouln't be there, and on my todo list is extending our current latency measuring infrastructure to measure latency in a lot more places so I can hunt down where it's coming from - but on overall tail latency bcachefs is, to my knowledge, dramatically better than any other filesystem.

And, snapshots are getting close to the final polishing stages - snapshot deletion is working now! Looking forward to merging it soon.