bcachefs

Erasure coding has been pushed

Added 2018-11-14 05:15:47 +0000 UTC

It's not production ready yet - stripe level copygc isn't implemented yet, so disk fragmentation could lead to your filesystem getting filled with partially empty stripes and getting stuck. But, aside from that it should be functional.

To use it, just enable the erasure_code option, either at mount time

mount -o erasure_code=true

or via sysfs

echo 1 > /sys/fs/bcachefs/*/options/erasure_code

or, just for a certain file or directory

setfattr -n bcachefs.erasure_code -v 1 /mnt/foo

Your data will then be written out in reed-solomon encoded stripes, as with RAID5/6. Like ZFS, bcachefs erasure coding avoids the write hole problem that arises when you update existing stripes. Unlike ZFS, we don't have to fragment writes to do it - ZFS avoids the issue by turning each write into its own stripe. Instead, foreground writes are replicated, but one of the replicas goes to buckets that are queued up to be turned into stripes in the background. Once we have an entire stripe worth of new data, we write out p/q blocks and then go back and update data pointers to include a pointer to the stripe entry, and at the same time drop the now unneeded extra replicas.

Effectively, we're doing full data journalling, and what this does is it lets us keep writing out our data in the ideal layout, preserving as much locality and keeping this as contiguous as we want. Currently, we're not quite writing out data in the ideal layout - due to how the allocator currently works, the buckets we allocate for the extra replicas won't get reused right away, whereas if we can reuse them as soon as the stripe is finished and they're no longer needed we ought to be able to have them effect the final layout of the data almost not at all. Even better, if we can have those buckets be reused for new writes quickly and without any cache flush commands in between, those journalling writes will never actually hit the physical platters or flash media - they'll be overwritten in the disk's write cache before they happen, which would mean they'd have essentially zero cost in terms of write bandwidth.

So, please go forth and play with it and report back - just don't use it for any real data until stripe level copygc is implemented (and there's been more testing, of course).

I may not get to stripe level copygc for awhile though - I'm eyeing up reflink next.