bcachefs

Progress towards faster mount times - new transaction infrastructure

Added 2018-07-06 23:21:14 +0000 UTC

I've talked a bit before about the new transaction infrastructure I've been working on, but to recap:

bcachefs has, for quite some time, had the ability to use multiple btree iterators simultaneously, and to do multiple btree updates atomically - the main btree update function takes a list of (iterator, new key) pairs and does all the updates atomically by grabbing write locks on all the relevant btree nodes simultaneously (and in the correct order) and putting all the updates in one journal reservation.

This was originally created for rename: if you skim through it, notice how it initializes a couple btree iterators, links them together, and then sets up a single call to bch2_btree_insert_at() at the bottom - that call is what does the actual rename, fully atomically. The rename code doesn't have to concern itself with locking or logging or transactions whatsoever.

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/dirent.c#n203

Besides rename, up to that point it had been possible to implement all the normal posix filesystem functionality without any sort of fancy transactional infrastructure, primarily by controlling the order of all the updates in a more complicated operation. For example, creating a new file requires creating a directory entry and creating a new inode (and for directory creation, updating i_nlinks on the parent). If make sure we always create the inode first, then the directory entry, we'll never end up in a situation after a crash where we have directory entries that point to garbage/missing inodes - we'll just leak some references to inodes, and that can be handled with a simple garbage collection pass.

This is sort of like the softupdates approach, from BSD - except much simpler because it's only order of btree updates we care about, not physical writes to disk - the btree and the journal do all the hard work for us.

The _only_ downside of this approach is that it means we have to walk all the inodes and dirents at mount time, which means mount times are always going to be an issue with large enough filesystems. So, alas, this approach wasn't ever going to last forever.

Well, the good news is most of the operations we care about making fully atomic (create, link, rename, etc.) only involve a few updates - the functionality we already have for multiple atomic updates will work for all these operations.

The trouble is, these high level filesystem operations require calling into a bunch of different code - the inode code to create a new inode, the dirent code, the xattr code (if the new file will have acls) - the existing interface would require us to allocate somewhere all the state for each chunk of code that has to live for the duration of the whole atomic operation (primarily every btree iterator, and all the new keys), and keep track of for each chunk of code what has to be inserted at the end.

Basically, the linked iterator approach isn't very composable - trying to use it as is for all the high level filesystem operations would turn things into a hairy mess of bookkeeping. Not fun.

So, the new transaction infrastructure is really just a wrapper around the existing linked iterators/atomic update interface that handles all those issues. Specifically, given a transaction context, it lets you:

- allocate a new btree iterator, either initializing it as normal or by copying an existing iterator

- allocate memory which will live until the transaction exists and be freed then - e.g. for a new key to be inserted

- add an update to the list of updates to be done when the transaction commits

There's a bit of magic involved in that bcachefs's btree code heavily relies on restarts - e.g. if traversing an iterator would require taking a lock that would deadlock (and trylock fails), we'll drop all the locks, retraverse all the iterators in the correct order, and then tell the caller other iterators were invalidated and the whole operation has to be restarted from the top; because the btree iterator code retraversed all the iterators in the correct order, when the caller restarts it won't hit that potential deadlock.

What this means is that on transaction restart, when we start redoing the whole operation (e.g. creating a file), call back into the inode code and the inode code goes to allocate an iterator again - we have to give it the same iterator it had before, not allocate and initialize a brand new iterator.

But, with that bit handled, existing code that uses linked iterators can be converted to the new btree_trans infrastructure and be completely functionally equivalent - only now different chunks of code can sanely be composed together into larger atomic operations.

The end result looks pretty neat, in my opinion. Here's the new __bch2_create() - the code that creates a new file or directory:

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs.c?h=bcachefs-locking&id=31110f60dca0d6f753551190f37915eca8b4b701#n380

The transactional part is from about line 420 to line 450 - everything from bch2_trans_begin() to bch2_trans_commit(). In that block of code, we're:

- creating a new inode

- creating acls, if enabled (which are stored as keys in the xattr btree)

- creating a new dirent

- updating the inode for the parent directory (updating mtime and ctime, and incrementing i_nlink if we were creating a directory).

This did require completely restructuring the create path, though: the old create path modified VFS state (e.g. creating the new inode in the inode cache) before the rest of the operation was complete - same as every other Linux filesystem I've looked at. But we can't be modifying global state elsewhere if we might be bailing out and restarting from the beginning.

There is a major upside of this approach though - other filesystems typically solve this problem with rather complex/heavyweight logging: for every operation that has to be atomic, we do the operation by writing log entries as we go that say "I'm creating file foo in directory bar", then "I'm creating filename foo in directory bar, got filename 4099" - i.e. by logging enough information so that journal replay can finish the operation.

That means you have to define all these log entries for all the various operations, and write replay code for each operation...

I'm pretty sure bcachefs will have to implement that style of logging eventually (e.g. to make fcollapse atomic) - but not yet! With this approach, the whole create either happens or doesn't happen - the code you see in __bch2_create() is it, there's nothing else to be implemented.

So, as of right now I'm working through the various operations, converting them to the new transaction interface and making them atomic. Create is done, link (i.e. for creating a hardlink) is done, rename in exchange mode is done. Symlink, unlink, and renames that overwrite aren't done yet, those will all take more work. But, it's starting to feel like progress...