XaiJu
bcachefs
bcachefs

patreon


Bcachefs extents - compression, checksumming

One topic that was asked about recently was compression in bcachefs, so I thought I'd write a bit about how extents are represented as a bunch of stuff falls out of that.

In bcachefs, checksumming and compression are done per extent, not per block or per page. This means we store one checksum per extent and if the data is compressed, it'll be compressed all at once instead of being broken up into page size chunks (like btrfs does).

This means we're making some tradeoffs. Whenever we read some data from an extent that is compressed or checksummed (or both), we have to read the entire extent, even if we only wanted to read 4k of data and the extent was 128k - because of this, we limit the maximum size of checksummed/compressed extents to 128k by default. So this does mean that small random reads will be slow, if the data was written out in larger chunks and the data isn't cached.

But outside of benchmarks that's a pretty rare scenario, and there's some significant advantages:

 - Our metadata is significantly smaller, since we're storing a lot fewer checksums. Smaller metadata makes everything else faster. And on real world workloads having to read entire extents is almost a non issue, because with buffered IO for a given read request we can round it up to wherever the extent stops, so that the next read for that data will be cached. Purely random IO workloads are not the norm, most workloads have some locality to them.

 - Much better compression ratio. If you break your data up into page size chunks before you compress it that does mean that you can efficiently read only a page at a time, but a page (4k, generally) is not a lot of data to be compressing at once. Compression algorithms are fundamentally able to do a better job (i.e. find more redundancy) if you feed them more data at once, and also if you're compressing at 4k granularity now you've got painful alignment issues to deal with - you want your data on disk to be block aligned, but if you compress 4k and then round it back up to your block size you're losing a lot of your compression ratio - all of it if your blocksize is 4k.

 - And it simplifies the IO path quite a bit. Everything in the IO path works in terms of extents: there's nothing smaller than an extent we have to write code to handle.

One other disadvantage of this approach, with compression, is that it turns out to be very difficult (and probably not practical) to guarantee that copygc won't make your data take up more space on disk, in the process of moving things around. The reason is that when we rewrite an extent we may have to fragment it, if there isn't enough space for it in the bucket we're currently writing to - it's a bin packing problem. And if the extent was compressed, fragmenting it means it's almost definitely going to take up more space on disk.

This particular problem is the reason I still haven't flipped disk space accounting (i.e. what df shows) to compressed size - we're still counting up how much data we have as if it was all uncompressed; if I flip that switch without making any other changes assertions pop when copygc moves data and discovers it did something that caused disk usage to go up and it didn't have a disk reservation. I could hack around that (just have the copygc path grab a disk reservation...), but I've been uncomfortable with that without a better answer to the real problem, which is that copygc really should not cause data to take up more space than it did before.

My current thinking is that when compression is enabled we're just going to have to live with the fact that copygc might occasionally cause a particular extent to take up more space, as long as on average it's compacting data. For that though, I'm going to need to make some improvements to the data move code so that it can merge adjacent extents that it's moving. Right now if it's moving uncompressed data the extents can be merged by the btree code, after the data has been written, but merging compressed extents requires feeding the data into the write path at the same time.

So that'll happen at some point, and then I'll flip the disk space accounting switch to compressed size, and then everyone who's using compression will magically see more free space in their filesystem when they upgrade to the new kernel :)

Comments

Hey Kent. any update on the upstreaming effort? Last mention I see was a true few months ago, and requiring a custom built kernel is the major thing blocking I and many others from giving bcachefs a try

According to btrfs docs, it writes uncompressed data first, and later it compresses data COW style, so it uses bandwidth 3x that's needed(write-read-write again fragmented). Much worse is that, it has another sidefect: it wears out SSD at least twice more, than needed (imagine what write-amplification it causes to easily compressible stuff like text files[source code, man pages, etc]). I hope that bcachefs doesn't do stupid things like that

Thanks Natascha. Maybe it's a patreon page and shouldn't be the best site for it, but as you see all announcements, news, etc goes there (unlike official site, which is more complete, but is a bit laggy and there's no real news/announcements section. And last time I've tried to access mailing lists, it seems it's closed (for developers only) and theres no public archives. So all bcachefs PR goes there, and I think that Kent made it on purpose ;-)

You are right that gc stands for garbage collector to be precise it stands for copying garbage collector. Buckets are what other people call chunks in garbage collection. If you don't know what garbage collection is just read up on it. Bcachefs uses garbage collection as a means for defragmenting the device. PS what do you expect? This is the patreon page and not really the best place to finde answers to such questions even thought I agree and some of bcachefs terminology should be explained better/at all, but I don't think that this is a priority for Kent.

Really, nobody could explain or point me to the docs explaining what 'bucket' or 'copygc' in context of COW FS means?

Hi, I'm waiting for day when bcachefs will support subvolumes, so I'm reading any article or post there, but still are unhappy because cannot understand every article there, because unexplained terminology(that's hard too google in this context) there, like 'bucket' or 'copygc' (found something about garbage collector, but it was JavaScript code). Maybe it would be good to write some common dictionary for beginners(while I was familiar with old/basic/nocow terminology like blocks,extents,inode,FAT and is very well documented/explained your articales sounds very cryptic in many placed to me[I wonder also if only to m e;-])

What would happen if copygc refuses to fragment compressed extents? Would that make its job impossible?


More Creators