bcachefs

Compression + copygc

Added 2016-08-06 05:12:49 +0000 UTC

I'm kicking myself for not noticing this sooner (most likely I saw it months ago and then forgot about it because I'm terrible about taking notes...). Do not use bcachefs with compression enabled yet - if copygc ever has to run you'll very soon hit a BUG_ON().

The issue is that copygc will often have to split extents that it's rewriting: it's copying extents from various mostly empty buckets into new buckets, and very often the extents it's moving won't exactly fill up a new bucket - think bin packing. So, the extent gets split - we put as much as we can into the bucket we're filling up, then start a new bucket with the rest of the extent.

But with compression, this is bad - very bad: if a compressed extent is split, the two new extents will be recompressed individually and will almost definitely be larger than the original extent - possibly not compressing at all, due to rounding up to the block size.

If moving data around in the background causes it to take up more space on disk than it did before, everything goes out the window - we have no way of calculating any bounds on how much space we need to store a given amount of data (except by going off the uncompressed size). In the worst case, with a mostly full filesystem and foreground writes that deviously overwrite existing data so as to force copygc to do the most possible work, all the existing data will be rewritten and split many times until nothing is compressed at all anymore.

Fuck. Fuckity fuck fuck fuck.

So, copygc cannot split existing extents. There's no way around that that I can see.

What if we just don't split extents as we're rewriting them - what if we just waste space, and skip to the next bucket if the extent we're rewriting won't fit in our current bucket?

Well, the maximum size of a compressed extent is 64k - one block (maximum uncompressed size is 64k, and if the compressed size was 64k we wouldn't compress it). Let's deal in 512 byte sectors, and say our block size is one sector: maximum size of a compressed extent is 127. If a bucket had 127 sectors left, we'd always be able to fit a compressed extent in it, so the maximum amount of space per bucket this could cause us to waste (our maximum internal fragmentation) is 126 sectors.

With 2 Mb buckets, that's ~3% - that's workable, the default reserve for copy gc is 10%. With smaller buckets, that's not a solution.

And as far as I can tell we really don't have any other bounds on internal fragmentation, so with this approach we really couldn't enable compression when using smaller buckets.

The only real, general solution I can think of is: fragment _compressed_ extents. That is, allow a single compressed extent (compressed as one chunk) to reside in physically discontiguous locations on disk.

This is not a pretty solution - to read any of a compressed extent (even if you only want one sector out of it) you have to read the entire thing so that you can decompress it (it's the same with checksummed extents - you have to read the entire extent so you can verify the checksum). If the extent is fragmented - you now have to read every fragment in order to read any part of the extent.

Also, this would probably have to be an incompatible on disk format change, and it'll complicate various code dealing with extents. It's definitely an ugly solution.

But it's the only workable solution I can think of, and practically I don't think the downsides are too bad - with typical bucket sizes only a small fraction of compressed extents will be fragmented, and with some optimizations we can probably make it very rare in practice (e.g. if we just say "don't fragment if the amount of space we'd waste in the current bucket is small enough", this would never happen with 2 mb buckets).

If anyone else has any other bright ideas though I'd love to hear them.