Bcachefs extents - compression, checksumming
Added 2018-08-13 21:55:07 +0000 UTCOne topic that was asked about recently was compression in bcachefs, so I thought I'd write a bit about how extents are represented as a bunch of stuff falls out of that.
In bcachefs, checksumming and compression are done per extent, not per block or per page. This means we store one checksum per extent and if the data is compressed, it'll be compressed all at once instead of being broken up into page size chunks (like btrfs does).
This means we're making some tradeoffs. Whenever we read some data from an extent that is compressed or checksummed (or both), we have to read the entire extent, even if we only wanted to read 4k of data and the extent was 128k - because of this, we limit the maximum size of checksummed/compressed extents to 128k by default. So this does mean that small random reads will be slow, if the data was written out in larger chunks and the data isn't cached.
But outside of benchmarks that's a pretty rare scenario, and there's some significant advantages:
- Our metadata is significantly smaller, since we're storing a lot fewer checksums. Smaller metadata makes everything else faster. And on real world workloads having to read entire extents is almost a non issue, because with buffered IO for a given read request we can round it up to wherever the extent stops, so that the next read for that data will be cached. Purely random IO workloads are not the norm, most workloads have some locality to them.
- Much better compression ratio. If you break your data up into page size chunks before you compress it that does mean that you can efficiently read only a page at a time, but a page (4k, generally) is not a lot of data to be compressing at once. Compression algorithms are fundamentally able to do a better job (i.e. find more redundancy) if you feed them more data at once, and also if you're compressing at 4k granularity now you've got painful alignment issues to deal with - you want your data on disk to be block aligned, but if you compress 4k and then round it back up to your block size you're losing a lot of your compression ratio - all of it if your blocksize is 4k.
- And it simplifies the IO path quite a bit. Everything in the IO path works in terms of extents: there's nothing smaller than an extent we have to write code to handle.
One other disadvantage of this approach, with compression, is that it turns out to be very difficult (and probably not practical) to guarantee that copygc won't make your data take up more space on disk, in the process of moving things around. The reason is that when we rewrite an extent we may have to fragment it, if there isn't enough space for it in the bucket we're currently writing to - it's a bin packing problem. And if the extent was compressed, fragmenting it means it's almost definitely going to take up more space on disk.
This particular problem is the reason I still haven't flipped disk space accounting (i.e. what df shows) to compressed size - we're still counting up how much data we have as if it was all uncompressed; if I flip that switch without making any other changes assertions pop when copygc moves data and discovers it did something that caused disk usage to go up and it didn't have a disk reservation. I could hack around that (just have the copygc path grab a disk reservation...), but I've been uncomfortable with that without a better answer to the real problem, which is that copygc really should not cause data to take up more space than it did before.
My current thinking is that when compression is enabled we're just going to have to live with the fact that copygc might occasionally cause a particular extent to take up more space, as long as on average it's compacting data. For that though, I'm going to need to make some improvements to the data move code so that it can merge adjacent extents that it's moving. Right now if it's moving uncompressed data the extents can be merged by the btree code, after the data has been written, but merging compressed extents requires feeding the data into the write path at the same time.
So that'll happen at some point, and then I'll flip the disk space accounting switch to compressed size, and then everyone who's using compression will magically see more free space in their filesystem when they upgrade to the new kernel :)