XaiJu
bcachefs
bcachefs

patreon


Erasure coding has been pushed

It's not production ready yet - stripe level copygc isn't implemented yet, so disk fragmentation could lead to your filesystem getting filled with partially empty stripes and getting stuck. But, aside from that it should be functional.

To use it, just enable the erasure_code option, either at mount time

mount -o erasure_code=true

or via sysfs

echo 1 > /sys/fs/bcachefs/*/options/erasure_code

or, just for a certain file or directory

setfattr -n bcachefs.erasure_code -v 1 /mnt/foo

Your data will then be written out in reed-solomon encoded stripes, as with RAID5/6. Like ZFS, bcachefs erasure coding avoids the write hole problem that arises when you update existing stripes. Unlike ZFS, we don't have to fragment writes to do it - ZFS avoids the issue by turning each write into its own stripe. Instead, foreground writes are replicated, but one of the replicas goes to buckets that are queued up to be turned into stripes in the background. Once we have an entire stripe worth of new data, we write out p/q blocks and then go back and update data pointers to include a pointer to the stripe entry, and at the same time drop the now unneeded extra replicas.

Effectively, we're doing full data journalling, and what this does is it lets us keep writing out our data in the ideal layout, preserving as much locality and keeping this as contiguous as we want. Currently, we're not quite writing out data in the ideal layout - due to how the allocator currently works, the buckets we allocate for the extra replicas won't get reused right away, whereas if we can reuse them as soon as the stripe is finished and they're no longer needed we ought to be able to have them effect the final layout of the data almost not at all. Even better, if we can have those buckets be reused for new writes quickly and without any cache flush commands in between, those journalling writes will never actually hit the physical platters or flash media - they'll be overwritten in the disk's write cache before they happen, which would mean they'd have essentially zero cost in terms of write bandwidth.

So, please go forth and play with it and report back - just don't use it for any real data until stripe level copygc is implemented (and there's been more testing, of course).

I may not get to stripe level copygc for awhile though - I'm eyeing up reflink next.

Comments

Interesting, as I indeed know that ZFS has serious problems in this area. I got lost, though on this part: "if we can have those buckets be reused for new writes quickly and without any cache flush commands in between, those journalling writes will never actually hit the physical platters or flash media - they'll be overwritten in the disk's write cache before they happen". What is the point of writing a temporary replicated copy first, when it will not be persistent on disk, before the final striped copy is touched? Can you clarify?

Yeah, it should somewhat reduce the performance penalty of wide stripes - though, rebuild performance will be similar to conventional raid5/6. It already handles disks of different size - it'll do about what you describe. If you're just using replication across multiple disks of different size (more disks than nr_replicas), the allocator will just preferentially stripe across the disks with more free space so that all the disks fill up at about the same time.

Kent Overstreet

I imagine this approach will decrease the performance penalty of using wide stripes? Will the code handle disks of different sizes? E.g., if I have five 8TB drives and three 12TB drives, and would like two-disk redundancy, will it fall back to writing 3 mirrored copies to the 12TB drives once the 8TB drives are full?

Now you can update the overview page, which still says "Erasure encoding (Reed-Solomon, i.e. RAID5/6): Not yet started". :-)

Ahh that is a good optimisation.. thanks!

veritanuda


More Creators