XaiJu
bcachefs
bcachefs

patreon


Compression + copygc

I'm kicking myself for not noticing this sooner (most likely I saw it months ago and then forgot about it because I'm terrible about taking notes...). Do not use bcachefs with compression enabled yet - if copygc ever has to run you'll very soon hit a BUG_ON().

The issue is that copygc will often have to split extents that it's rewriting: it's copying extents from various mostly empty buckets into new buckets, and very often the extents it's moving won't exactly fill up a new bucket - think bin packing. So, the extent gets split - we put as much as we can into the bucket we're filling up, then start a new bucket with the rest of the extent.

But with compression, this is bad - very bad: if a compressed extent is split, the two new extents will be recompressed individually and will almost definitely be larger than the original extent - possibly not compressing at all, due to rounding up to the block size.

If moving data around in the background causes it to take up more space on disk than it did before, everything goes out the window - we have no way of calculating any bounds on how much space we need to store a given amount of data (except by going off the uncompressed size). In the worst case, with a mostly full filesystem and foreground writes that deviously overwrite existing data so as to force copygc to do the most possible work, all the existing data will be rewritten and split many times until nothing is compressed at all anymore.

Fuck. Fuckity fuck fuck fuck.

So, copygc cannot split existing extents. There's no way around that that I can see.

What if we just don't split extents as we're rewriting them - what if we just waste space, and skip to the next bucket if the extent we're rewriting won't fit in our current bucket?

Well, the maximum size of a compressed extent is 64k - one block (maximum uncompressed size is 64k, and if the compressed size was 64k we wouldn't compress it). Let's deal in 512 byte sectors, and say our block size is one sector: maximum size of a compressed extent is 127. If a bucket had 127 sectors left, we'd always be able to fit a compressed extent in it, so the maximum amount of space per bucket this could cause us to waste (our maximum internal fragmentation) is 126 sectors.

With 2 Mb buckets, that's ~3% - that's workable, the default reserve for copy gc is 10%. With smaller buckets, that's not a solution.

And as far as I can tell we really don't have any other bounds on internal fragmentation, so with this approach we really couldn't enable compression when using smaller buckets.

The only real, general solution I can think of is: fragment _compressed_ extents. That is, allow a single compressed extent (compressed as one chunk) to reside in physically discontiguous locations on disk.

This is not a pretty solution - to read any of a compressed extent (even if you only want one sector out of it) you have to read the entire thing so that you can decompress it (it's the same with checksummed extents - you have to read the entire extent so you can verify the checksum). If the extent is fragmented - you now have to read every fragment in order to read any part of the extent.

Also, this would probably have to be an incompatible on disk format change, and it'll complicate various code dealing with extents. It's definitely an ugly solution.

But it's the only workable solution I can think of, and practically I don't think the downsides are too bad - with typical bucket sizes only a small fraction of compressed extents will be fragmented, and with some optimizations we can probably make it very rare in practice (e.g. if we just say "don't fragment if the amount of space we'd waste in the current bucket is small enough", this would never happen with 2 mb buckets).

If anyone else has any other bright ideas though I'd love to hear them.

Comments

It would be useful for multi booting if the user needed to resize the Linux bcachefs partition. EDIT: I was meant to ask if partition resizing would be supported both online and offline? This file system looks like it has everything, I'm assuming it has delayed allocation (to avoid fragmented files)? Both ZFS and Btrfs apparently have problems with torrents not working well with delayed allocation and copy on write; the torrent files get very messy. As for testing bcachefs, I would test - were it available from the Ubuntu installer GUI, like Btrfs has been for years. I have tested Btrfs and hoping to test yours in a future version of Ubuntu. Yours looks a million times better and I would definitely trust my data on it. I know it would need GRUB2 support. Btrfs feels very slow and ZFS is not offered by the Ubuntu installer GUI.

Dave494

If copygc left buckets partly filled, then the only thing that could finish filling them up would be foreground writes - it could _maybe_ be made to work, but it'd mean foreground writes would get horribly fragmented and it'd be a really fragile solution - it would subtly break or at least complicate some pretty fundamental invariants we use to reason about allocation in bcache. In general it is highly likely that copygc will be able to defragment data, and recompress data in larger chunks, but it's not something we can depend on.

Kent Overstreet

Doesn't currently, but it'll be trivial to add.

Kent Overstreet

You can't do random reads inside of compressed data - for a given compressed chunk, to read any of it you have to decompress the entire chunk. That means you want your compressed chunks to be relatively small, and you definitely want them to be contiguous on disk - otherwise for a small read to part of a compressed chunk you'll then have to issue multiple reads, to read all the fragments. But you don't want them to be too small - if you only compress 4k at a time, your compression ratio is going to be terrible. Taken together, the two constraints mean you really want to be compressing extents.

Kent Overstreet

LZ4, which is pretty much the best compression algorithm for filesystems currently, isn't dictionary based at all. Sharing dictionaries does sound nice in theory, but it would be a lot of additional complexity so it'll be quite awhile before we can even think about such things. Low hanging fruit first.

Kent Overstreet

This is a very generic thought on FS compression: # Solution now All systems I am aware of use block or stream compression techniques to compression certain blocks/extends/chunks/sectors/... on disc by simply applying a chosen compression method to that part of data. This makes it (relatively) easy to implement and allows to swap/plug-in different algorithms. The problem here is: this is not how file systems work in general and you problem is only one of the bad effects. # Proper (?) way Normally the data blocks you compress share a good amount of common information. For example, chunks of a PDF file contain similar commands, XML data implies a tag-structure and so on. This also holds for inter-file information. It is likely that files store d within the same folder share mutual information. Compression algorithms could deal with it by using the same dictionary for all of these data blocks or by using a hierarchical structure. I know that this is way more complicated than the simple approach we see nowadays but it would allow the following features: a) splitting extends at (not arbitrary!) places and decompressing them individually b) hitting better compression rates c) maybe better performance since you could cache dictionary information (= mutual information) within the RAM As said, this is a VERY generic idea and may have major problems.

Marco Neumann

I've got an Ubuntu netinst USB image that supports* installing onto bcachefs, if you'd like to try it. * For sufficiently hackish values of “support”: You need to first set up a normal, non-bcachefs, partition scheme via the partitioner, then drop to a terminal, format your bcachefs volume, and mount at /target.

Thanks for the work, Kent! I do have a few question to better understand this problem: 1) Does bcachefs fill up buckets sequentially? Would it be possible to leave some half filled, skip the the next, and fill up the previous one later? 2) If copygc copies fragmented extents into empty buckets (except maybe for the first bucket it writes to), it is likely that fragmented extents are merged, isn't it? Wouldn't that cause the compressed size to decrease in most cases?

Forgot to ask this at the beginning: does bcachefs support partition resizing i.e. shrink or grow? If your file system could be one of the options included in Ubuntu installer for testing, I would happily test it in real-world use, rather than using VirtualBox. A PayPal donation button on your site would be good for donations too, from the less wealthy of us. ;-)

Dave494

I'm not claiming to have understood your design on this (<a href="https://bcache.evilpiepirate.org/BcacheGuide/" rel="nofollow noopener" target="_blank">https://bcache.evilpiepirate.org/BcacheGuide/</a>#index2h2) fully, but it looks like the difficulties come from putting both functionalities (in-bucket extent positioning and compression) in the same layer. (Grumbling; apparently I cannot write two paragraphs in one single reply here?) Did you consider handling compression in a different layer above extents, so that gccopy would just see "data"? What were your reasons not to do that? It looks to me that introducing a mechanism to "fragment extents across buckets" is actually going to add one more layer to the whole setup anyway, but it looks like the insertion point (for the new layer in the hierarchy) is going to make things harder to handle in the lower level.


More Creators