XaiJu
bcachefs
bcachefs

patreon


Lots of new changes/features:

It's been far too long since the last announcement - lots of stuff has been

happening. The biggest milestone has been all the breaking on disk format

changes finally landing, but there's been lots of other stuff going on, too.


On the subject of the breaking on disk format changes - there's an excellent

chance this'll be the last breaking change, so if you're thinking about trying

out bcachefs this is an excellent time. Also, if you have a filesystem in the

old format, code to read your filesystem is available in the bcachefs-v0 braches

of both linux-bcache and bcache-tools.


What all has changed since the last announcement:


Related to the on disk format changes, we have...

 - Encryption


   We now have whole filesystem encryption - and this is modern authenticated

   encrypted, using ChaCha20 and Poly1305. Bcachefs's encryption isn't a direct

   competitor to ext4's encryption - unlike ext4, we can't currently encrypt

   only part of the filesystem, and then mount and use the rest of the

   filesystem without providing the encryption key. It's more of a better

   dm-crypt - block layer encryption is somewhat of a pile of hacks [1] and it's

   not possible to do authenticated encryption at the block layer, but it is in

   a copy on write filesystem.


   In my (relatively brief) performance testing, bcachefs's encryption performs

   for me almost identically to dm-crypt (which I was surprised by, given that

   they're using completely different ciphers).


   Before you go out and switch to bcachefs encryption though - please be aware,

   the encryption design and code has seen some outside review but it really

   does need more before I'd trust it with anything critical.


   [1] https://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

 

 - Backup superblocks


   This has been badly needed since our superblocks are now often > 4k and thus

   torn writes leading to checksum failures are a real issue.


 - New inode format


   The new inode format is both more compact and more easily extensible than the

   old one - average real world inode size is now 50-60 bytes. You know what

   makes a filesystem feel fast? Being able to fit all your metadata in ram :)


 - Lots of small changes for better support for multiple devices and replication


   Multiple device support (including caching/tiering) is getting to be pretty

   robust and usable (and people are sucessfully using it for their root

   filesystems - for awhile now, actually). The tooling is getting better, the

   main priority at this point needs to be documentation.


   For replication (i.e. raid1/10), the core functionality all works - you can

   create a replicated filesystem, write data to it, and take one of the drives

   offline while the filesystem is in use - it keeps working, and you can keep

   writing data to it. However, there's still quite a few things that need to be

   finished before it will actually be useful for protecting your data - we need

   to add better tracking for which drives have data and how that data is

   replicated (so we know whether we can take a drive offline or mount without

   it without losing data), as well as replication aware disk space accounting

   and rereplication/scrubbing. But it's coming.



Most of the activity lately has actually been happening in the userspace

tooling, though:


We now have a userspace fsck: we've actually had most of fsck implemented for

quite awhile, but it was implemented in the kernel so it was only possible to

run it at mount time (it runs by default on every mount, because I err towards

paranoia). The new userspace fsck is much more convenient though - it takes all

the normal options (e.g. -n for dry run) and is able to prompt if it finds an

inconsistency.


We didn't get a whole new fsck tool that runs in userspace - what's actually new

here is that I wrote a shim layer to build almost all the bcachefs code in

userspace as part of bcache-tools, which uses it as a library.


This is really cool, and it's made it easy to write some other very useful

tools/subcommands: One is "bcache dump", which takes a filesystem and dumps all

the metadata to a sparse qcow2 image. This is really useful for debugging - if

your bcachefs filesystem gets into a bad state and fsck isn't able to fix it,

dump the metadata and send it to me and I'll debug it from that. We've already

used it for exactly that - and for me the developer, it was a hell of a lot

easier to debug and teach fsck to fix that particular issue that way instead of

having to either get remote access, or debug by sending him patches and waiting

for him to test them. So of the recent changes this might be the one I'm

happiest about :)


We can also now migrate filesystems to bcachefs in place! The bcache migrate

command takes an existing filesystem, fallocates a big file in it, creates a new

filesystem (in userspace) on the block device but using only the space reserved

by that file it fallocated - and then walks the contents of the original

filesystem creating pointers to all your existing data.


You can then mount that new filesystem and verify that everything is correct

without overwriting anything in the existing filesystem (by passing mount the

offset where bcache migrate put the superblock) - and you can even mount both

the old and the new filesystems at the same time (use mount -o noexcl when

mounting the bcachefs filesystem) and use rsync --itemize-changes to verify that

the filesystems really are identical, which is how I test it.


Aside from all that, there's been numerous fixes and performance improvements -

we're still looking for benchmarks/workloads where bcachefs lags other

filesystems, and as we find them they get fixed. Good rigorous performance

testing with new benchmarks is always appreciated.

Comments

Will try posting again... I would REALLY LIKE TO TRY this file system on my existing Ubuntu partition (I also have Windows installed). I have no idea how I could convert my Ubuntu ext4 into bcachefs. Even better, when I come across faulty hardware, this would be a great "real world" test for bcachefs. I know someone who dropped their laptop and now the hard drive has gone awry, would be interesting to see how this fs behaves on a bad drive. I was wondering if AES encryption - with x86 AES hardware acceleration (AES-NI) - would be used with the encryption? Never heard of ChaCha until now.

Dave494

Great to see such a huge amount of progress! EDIT: Unfortunately, I can't even promise to test things, but I hope that this is a sign of getting closer to upstream. We need a real competitor to zfs. (No btrfs isn't it.)

Why ChaCha, btw? Is it faster than AES using AES-NI? Anyway nice to see such a progress just after I become a supporter :-D

Feel free to grab the code I wrote for dumping the qcow2 image, if it's useful to you: <a href="https://evilpiepirate.org/git/bcache-tools.git/tree/qcow2.c?h=dev" rel="nofollow noopener" target="_blank">https://evilpiepirate.org/git/bcache-tools.git/tree/qcow2.c?h=dev</a>

Kent Overstreet

You made me look into the QCOW2 format and thank for that, it was a valuable discovery for me.


More Creators