XaiJu
bcachefs
bcachefs

patreon


Zoned device support

I've been starting to work on support for zoned devices, laying out what needs to be done. This will get us native support for SMR (shingled magnetic recording) hard drives, and (even more exciting!) fancy new ZNS SSDs, which strip away most of the FTL: by folding that into the filesystem, we'll get better performance, especially latency, better write amplification, and more capacity. Now that the allocator rewrite has landed we have support for much larger buckets - up to a terabyte - which is enough for SMR zones.

On a zoned device, your device is divided up into zones, and zones will generally be append only (SMR hard drives will have some normal zones that support random writes). The idea is, bcachefs buckets map to zones: when we allocate a bucket, we write to it once in an append only fashion, then never write to it again until we discard and reuse the whole thing. Mostly.

So in general, not much has to change. For ZNS SSDs that have no normal zones we need a different way of writing the superblock - for them we'll want to use the first two buckets as a ringbuffer, easy enough. We've also got to put in appropriate commands for when we finish writing to a zone and when we reset a zone - easy enough.

One interesting thing is that with ZNS SSDs, the zones don't necessarily all have the same size - we have to query each zone's capacity (and the device's address space is then sparse). So bucket_size is no longer constant for a given device, but that's not too bad of a change.

The really tricky complicating factor, it turns out, is that with ZNS SSDs not all zones can be in the active state at the same time - only a relatively small fixed number can be appended to at any given time, and once you finish writing to a zone it can't be appended to anymore until it's been reset/erased.

For data writes this is fine, active zones correspond nicely to our internal concept of an open_bucket, but for btree nodes this does complicate things. Btree nodes are log structured, and any given btree node can be appended to until it's full on disk - then it's compacted and rewritten, and possibly split. Also, we generally put multiple btree nodes in a bucket (when btree node size is smaller than bucket size) - that doesn't work with zones that require strict appending writes.

I think the direction this is going to take us is that btree nodes are no longer going to be contiguous on disk, and each individual write that appends to a btree node is going to be a separate allocation. Fortunately, we already update btree node parent pointers after every write - we update the count of sectors that have been written to a btree node after every write so that we can detect missing data when reading in a btree node, and possibly retry it from another devices. Doing btree node allocations for every btree node write though - that's going to take some careful thought.

The code I've written so far is up for those interested to peruse:

https://evilpiepirate.org/git/bcachefs.git/log/?h=zones


More Creators