Compression + copygc
Added 2016-08-06 05:12:49 +0000 UTCI'm kicking myself for not noticing this sooner (most likely I saw it months ago and then forgot about it because I'm terrible about taking notes...). Do not use bcachefs with compression enabled yet - if copygc ever has to run you'll very soon hit a BUG_ON().
The issue is that copygc will often have to split extents that it's rewriting: it's copying extents from various mostly empty buckets into new buckets, and very often the extents it's moving won't exactly fill up a new bucket - think bin packing. So, the extent gets split - we put as much as we can into the bucket we're filling up, then start a new bucket with the rest of the extent.
But with compression, this is bad - very bad: if a compressed extent is split, the two new extents will be recompressed individually and will almost definitely be larger than the original extent - possibly not compressing at all, due to rounding up to the block size.
If moving data around in the background causes it to take up more space on disk than it did before, everything goes out the window - we have no way of calculating any bounds on how much space we need to store a given amount of data (except by going off the uncompressed size). In the worst case, with a mostly full filesystem and foreground writes that deviously overwrite existing data so as to force copygc to do the most possible work, all the existing data will be rewritten and split many times until nothing is compressed at all anymore.
Fuck. Fuckity fuck fuck fuck.
So, copygc cannot split existing extents. There's no way around that that I can see.
What if we just don't split extents as we're rewriting them - what if we just waste space, and skip to the next bucket if the extent we're rewriting won't fit in our current bucket?
Well, the maximum size of a compressed extent is 64k - one block (maximum uncompressed size is 64k, and if the compressed size was 64k we wouldn't compress it). Let's deal in 512 byte sectors, and say our block size is one sector: maximum size of a compressed extent is 127. If a bucket had 127 sectors left, we'd always be able to fit a compressed extent in it, so the maximum amount of space per bucket this could cause us to waste (our maximum internal fragmentation) is 126 sectors.
With 2 Mb buckets, that's ~3% - that's workable, the default reserve for copy gc is 10%. With smaller buckets, that's not a solution.
And as far as I can tell we really don't have any other bounds on internal fragmentation, so with this approach we really couldn't enable compression when using smaller buckets.
The only real, general solution I can think of is: fragment _compressed_ extents. That is, allow a single compressed extent (compressed as one chunk) to reside in physically discontiguous locations on disk.
This is not a pretty solution - to read any of a compressed extent (even if you only want one sector out of it) you have to read the entire thing so that you can decompress it (it's the same with checksummed extents - you have to read the entire extent so you can verify the checksum). If the extent is fragmented - you now have to read every fragment in order to read any part of the extent.
Also, this would probably have to be an incompatible on disk format change, and it'll complicate various code dealing with extents. It's definitely an ugly solution.
But it's the only workable solution I can think of, and practically I don't think the downsides are too bad - with typical bucket sizes only a small fraction of compressed extents will be fragmented, and with some optimizations we can probably make it very rare in practice (e.g. if we just say "don't fragment if the amount of space we'd waste in the current bucket is small enough", this would never happen with 2 mb buckets).
If anyone else has any other bright ideas though I'd love to hear them.
Comments
It would be useful for multi booting if the user needed to resize the Linux bcachefs partition. EDIT: I was meant to ask if partition resizing would be supported both online and offline? This file system looks like it has everything, I'm assuming it has delayed allocation (to avoid fragmented files)? Both ZFS and Btrfs apparently have problems with torrents not working well with delayed allocation and copy on write; the torrent files get very messy. As for testing bcachefs, I would test - were it available from the Ubuntu installer GUI, like Btrfs has been for years. I have tested Btrfs and hoping to test yours in a future version of Ubuntu. Yours looks a million times better and I would definitely trust my data on it. I know it would need GRUB2 support. Btrfs feels very slow and ZFS is not offered by the Ubuntu installer GUI.
Dave494
2016-08-09 20:22:55 +0000 UTCIf copygc left buckets partly filled, then the only thing that could finish filling them up would be foreground writes - it could _maybe_ be made to work, but it'd mean foreground writes would get horribly fragmented and it'd be a really fragile solution - it would subtly break or at least complicate some pretty fundamental invariants we use to reason about allocation in bcache. In general it is highly likely that copygc will be able to defragment data, and recompress data in larger chunks, but it's not something we can depend on.
Kent Overstreet
2016-08-09 19:29:41 +0000 UTCDoesn't currently, but it'll be trivial to add.
Kent Overstreet
2016-08-09 19:24:38 +0000 UTCYou can't do random reads inside of compressed data - for a given compressed chunk, to read any of it you have to decompress the entire chunk. That means you want your compressed chunks to be relatively small, and you definitely want them to be contiguous on disk - otherwise for a small read to part of a compressed chunk you'll then have to issue multiple reads, to read all the fragments. But you don't want them to be too small - if you only compress 4k at a time, your compression ratio is going to be terrible. Taken together, the two constraints mean you really want to be compressing extents.
Kent Overstreet
2016-08-09 19:24:12 +0000 UTCLZ4, which is pretty much the best compression algorithm for filesystems currently, isn't dictionary based at all. Sharing dictionaries does sound nice in theory, but it would be a lot of additional complexity so it'll be quite awhile before we can even think about such things. Low hanging fruit first.
Kent Overstreet
2016-08-09 19:19:24 +0000 UTCThis is a very generic thought on FS compression: # Solution now All systems I am aware of use block or stream compression techniques to compression certain blocks/extends/chunks/sectors/... on disc by simply applying a chosen compression method to that part of data. This makes it (relatively) easy to implement and allows to swap/plug-in different algorithms. The problem here is: this is not how file systems work in general and you problem is only one of the bad effects. # Proper (?) way Normally the data blocks you compress share a good amount of common information. For example, chunks of a PDF file contain similar commands, XML data implies a tag-structure and so on. This also holds for inter-file information. It is likely that files store d within the same folder share mutual information. Compression algorithms could deal with it by using the same dictionary for all of these data blocks or by using a hierarchical structure. I know that this is way more complicated than the simple approach we see nowadays but it would allow the following features: a) splitting extends at (not arbitrary!) places and decompressing them individually b) hitting better compression rates c) maybe better performance since you could cache dictionary information (= mutual information) within the RAM As said, this is a VERY generic idea and may have major problems.
Marco Neumann
2016-08-09 14:15:36 +0000 UTCI've got an Ubuntu netinst USB image that supports* installing onto bcachefs, if you'd like to try it. * For sufficiently hackish values of “support”: You need to first set up a normal, non-bcachefs, partition scheme via the partitioner, then drop to a terminal, format your bcachefs volume, and mount at /target.
2016-08-08 02:06:13 +0000 UTCThanks for the work, Kent! I do have a few question to better understand this problem: 1) Does bcachefs fill up buckets sequentially? Would it be possible to leave some half filled, skip the the next, and fill up the previous one later? 2) If copygc copies fragmented extents into empty buckets (except maybe for the first bucket it writes to), it is likely that fragmented extents are merged, isn't it? Wouldn't that cause the compressed size to decrease in most cases?
2016-08-07 22:56:00 +0000 UTCForgot to ask this at the beginning: does bcachefs support partition resizing i.e. shrink or grow? If your file system could be one of the options included in Ubuntu installer for testing, I would happily test it in real-world use, rather than using VirtualBox. A PayPal donation button on your site would be good for donations too, from the less wealthy of us. ;-)
Dave494
2016-08-07 11:30:38 +0000 UTC