bcachefs

Upcoming changes

Added 2020-11-20 20:59:13 +0000 UTC

Recently, Dave Chinner's been helping out with some performance testing of bcachefs on his big dual socket machine, which has been great for finding scalability issues and he's also been able to provide really useful comparisons and microbenchmarks where bcachefs is still behind xfs, and I've been working on a bunch of improvements - in particular to the journalling code - in response.

- When allocating new inodes, we now use the index of the core we're running on for the high bits of the new inode number. This has the effect of making it so that new inodes on different cores end up in different btree leaf nodes, and not just for the inodes btree but all the other btrees that are indexed by inode number (extents, dirents, xattrs). This alone was a drastic improvement in scalability on create heavy workloads (i.e. rsync).

- We've now got a shrinker for the btree key cache. The btree key cache is a writeback cache for the btree in which objects are indexed by hashed with one object/lock per key; it's for the alloc and inodes btrees where we have lots of repeated updates of the same objects, helping with both lock contention and to avoid more expensive btree updates (until flushed from the key cache by journal reclaim).

Previously, we weren't able to free bkey_cached objects until the filesystem was unmounted - but Paul McKenney wrote us some new SRCU code, and now we have a shrinker that can free them. Thanks, Paul! I've been doing quite a bit of work on tuning the shrinker and journal reclaim, and now we're able to do a huge multithreaded rm -rf from one of Dave's test without OOMing.

Another issue Dave found was that on some workloads we were getting bottlenecked on the journal, due to having to issue a journal write that's a fua + a cache flush every time we fill up a journal entry (generally 512k - 2M, depending on the bucket size the device was formatted with). Flush/fua writes can have pretty high latency, depending on your device, so the journal would end up waiting on the write for the previous entry to complete before it could start a new one.

To fix this I've been working on a patch that increases the pipelined journal entries from 2 to 4, and also another patch that removes the need to have every journal write be a flush+fua write - most journal writes can now be normal data writes, and it's only when an application issues an fsync or we hit the journal_write delay (defaults to 1 second) that we do a flush. This one is a feature bit change, since on recovery we have to know which journal entries were written as flush writes.

After I finish all this I'll be working on some lock contention that shows up in the fsync path on the journal lock, that shows up on big dbench workloads. And, I've also got some vfs changes in the pipeline for the system inotde hash table - it uses a global spinlock and that's a major bottleck when benchmarking bcachefs vs. xfs, which does its own thing instead. My patches replace the system inode hash table with per superblock rhashtables, which are pretty nice automatically resizable hash tables the Linux kernel has.

So that's what all is in the pipeline, look for these changes to be up soon.