Status update - persistent alloc info
Added 2019-03-04 20:30:31 +0000 UTCSo, first some background:
Fully persistent allocation info is going to require updating the alloc btree every time we update the extents btree - one key in the alloc btree for every pointer in an extent being inserted or overwritten.
That introduces a bit of a difficulty, in that extents can overwrite an unbounded number of existing extents (though we can trim the extent being inserted and not insert it all at once, and let it be merged again later) - and for every update being done in the transaction we require another linked btree iterator. And btree iterators are a little bit big (176 bytes, currently); also, this is probably going to become a source of lock contention because a lot of keys in the alloc btree will fit in the same leaf nodes, and the pattern of how we do allocations is going to mean write operations will tend to be hitting the same leaf nodes.
But, I wrote some code awhile back that will be useful here. We run into the same lock contention issues on the inodes btree with multithreaded write workloads, because inodes in bcachefs are tiny and we can fit a lot of them into a single leaf node, and workloads where every write also has to update the inode are not uncommon.
So, my solution was basically a write cache for the btree:
https://evilpiepirate.org/git/bcachefs.git/commit/?id=d2677b6f47f12504bb81b35a0052e2663c41fab1
The idea is when we do the update we only update the journal, not the btree - we only update the btree when triggered by journal reclaim. Simple, not a lot of code, and it's a nice performance boost - with this, on the worst case multithreaded write workloads we're now about even with XFS.
Except, there was one catch, which is why I hadn't pushed out the patch to make use of this functionality until last night - the btree update we do when triggered by journal reclaim is just using the normal btree update, which means it re journals the key (this is important for a number of reasons) - this means that we're now in a situation where in order to free up space in the journal, we need there to still be room in the journal - else we deadlock.
This means we needed a whole new mechanism to reserve space in the journal ahead of time - without getting a reservation on a particular journal entry. And there were other new potential deadlocks, due to new dependencies between journal reclaim and the btree code and the allocator, and within the journalling code itself. The TL;DR is that retrofitting this kind of reservation mechanism onto the journalling code meant making the free space calculations considerably more sophisticated and rigorous, as well as various other refactoring to journal reclaim.
But! That's all finally done and passing all the torture tests, and was pushed out last night, as well as the patch to make use of deferred btree updates for inode updates.
Next step is probably to integrate deferred btree updates better with the rest of the btree transaction/iterator code, and start making use of them for updating the alloc btree.