Erasure coding has been pushed
Added 2018-11-14 05:15:47 +0000 UTCIt's not production ready yet - stripe level copygc isn't implemented yet, so disk fragmentation could lead to your filesystem getting filled with partially empty stripes and getting stuck. But, aside from that it should be functional.
To use it, just enable the erasure_code option, either at mount time
mount -o erasure_code=true
or via sysfs
echo 1 > /sys/fs/bcachefs/*/options/erasure_code
or, just for a certain file or directory
setfattr -n bcachefs.erasure_code -v 1 /mnt/foo
Your data will then be written out in reed-solomon encoded stripes, as with RAID5/6. Like ZFS, bcachefs erasure coding avoids the write hole problem that arises when you update existing stripes. Unlike ZFS, we don't have to fragment writes to do it - ZFS avoids the issue by turning each write into its own stripe. Instead, foreground writes are replicated, but one of the replicas goes to buckets that are queued up to be turned into stripes in the background. Once we have an entire stripe worth of new data, we write out p/q blocks and then go back and update data pointers to include a pointer to the stripe entry, and at the same time drop the now unneeded extra replicas.
Effectively, we're doing full data journalling, and what this does is it lets us keep writing out our data in the ideal layout, preserving as much locality and keeping this as contiguous as we want. Currently, we're not quite writing out data in the ideal layout - due to how the allocator currently works, the buckets we allocate for the extra replicas won't get reused right away, whereas if we can reuse them as soon as the stripe is finished and they're no longer needed we ought to be able to have them effect the final layout of the data almost not at all. Even better, if we can have those buckets be reused for new writes quickly and without any cache flush commands in between, those journalling writes will never actually hit the physical platters or flash media - they'll be overwritten in the disk's write cache before they happen, which would mean they'd have essentially zero cost in terms of write bandwidth.
So, please go forth and play with it and report back - just don't use it for any real data until stripe level copygc is implemented (and there's been more testing, of course).
I may not get to stripe level copygc for awhile though - I'm eyeing up reflink next.
Comments
Interesting, as I indeed know that ZFS has serious problems in this area. I got lost, though on this part: "if we can have those buckets be reused for new writes quickly and without any cache flush commands in between, those journalling writes will never actually hit the physical platters or flash media - they'll be overwritten in the disk's write cache before they happen". What is the point of writing a temporary replicated copy first, when it will not be persistent on disk, before the final striped copy is touched? Can you clarify?
2019-04-09 09:25:48 +0000 UTCYeah, it should somewhat reduce the performance penalty of wide stripes - though, rebuild performance will be similar to conventional raid5/6. It already handles disks of different size - it'll do about what you describe. If you're just using replication across multiple disks of different size (more disks than nr_replicas), the allocator will just preferentially stripe across the disks with more free space so that all the disks fill up at about the same time.
Kent Overstreet
2019-02-09 21:55:19 +0000 UTCI imagine this approach will decrease the performance penalty of using wide stripes? Will the code handle disks of different sizes? E.g., if I have five 8TB drives and three 12TB drives, and would like two-disk redundancy, will it fall back to writing 3 mirrored copies to the 12TB drives once the 8TB drives are full?
2019-02-09 21:38:25 +0000 UTCNow you can update the overview page, which still says "Erasure encoding (Reed-Solomon, i.e. RAID5/6): Not yet started". :-)
2018-11-30 22:07:50 +0000 UTCAhh that is a good optimisation.. thanks!
veritanuda
2018-11-14 09:56:30 +0000 UTC