XaiJu
bcachefs

bcachefs

patreon


bcachefs posts

Fastest ordered key value store?

Based on all the numbers I've seen, it looks like bcachefs's b-tree might actually be the fastest ordered key value store around (there are faster persistent hash tables).


If anyone knows of anything that might be faster, I'd love to hear it - and I might rig up some head to head benchmarks.

View Post

More bcachefs btree benchmarks

rand_mixed is 3/4 lookups, 1/4 update

Lookups, both sequential and random, are beautifully fast.

Random updates ought to scale better than that, but I haven't profiled it yet so I'm not sure what's going on.

Could probably do better on sequential inserts/deletes too, but I haven't even tried to optimize those yet.

seq_insert:  10.0M with 1 threads in    11 sec,  1067 nsec per iter,  914k per sec
seq_lookup:  10.0M with 1 threads in  ...

View Post

Doing some pure btree performance tests

I should have done this ages ago...


bcachefs: btree_perf_test() doing 10.0M rand_insert:

bcachefs: btree_perf_test() done in 27 sec, 2587 nsec per iter, 377k per sec

bcachefs: btree_perf_test() doing 10.0M rand_lookup:

bcachefs: btree_perf_test() done in 10 sec, 1001 nsec per iter, 974k per sec

bcachefs: btree_perf_test() doing 10.0M seq_lookup:

bcachefs: btree_perf_test() done in 0 sec, 24 nsec per iter, 39.7M per sec


View Post

Filesystem resize is implemented

You can now expand a filesystem on a device - shrink isn't implemented just yet. The command is bcachefs device resize, and it takes the same arguments as resize2fs:

bcachefs device resize /dev/sdb 10G

If you don't specify a size, it uses the current size of the device.

Online and offline resize are both supported.

View Post

Update and new bcachefs fs usage command

Replication tests are finally all passing! This means that device removal and write error handling (for replicated writes) should finally be fully working.

Those two codepaths have in common that they need to modify pointers to existing btree nodes - removing the pointer to the device that either failed to write or is being removed - which is a particularly tricky operation, partially due to the btree node cache being physically indexed. Also fixed a whole bunch of bugs in the code that tracks...

View Post

v4.13 rebase is up


View Post

Per inode checksumming, compression options

Just pushed out a patch to add per-inode options for some options that could previously only be set globally. Currently this is just checksum type and compression type, but more will be added in the future. The options are exposed as xattrs, and if you set them on a directory they'll be inherited on create.

For example:

$ setfattr -n bcachefs.data_checksum -v none /mnt/foo

$ setfattr -n bcachefs.compression -v lz4 /mnt/foo

The options will take effect for newly written data -...

View Post

What's cooking

I've been back to work for several months but procrastinating and neglecting the updates...

The biggest thing I've been spending my time on lately has been improving the test infrastructure and test suite and chasing bugs. It seems I was putting this off for far too long, relying just on xfstests - this made sense while I was focusing on completing and debugging the filesystem interface, but more recently work has focused more on the core IO path - the part of the code that handles things like...

View Post

Status update

Took this earlier today in Nashville :)

On the bcachefs front - might be announcing a corporate sponsor in the next few days! Stay tuned.

View Post

Status update


View Post

Status update - moving, portability/FUSE

If you've noticed things have been quieter lately, you haven't been imagining things - I've been busy with getting ready for a rather big move, and in the process I'm taking an extended road trip. I'm pretty happy about it - I've been feeling stuck in a rut and it's been hard to make progress writing code, and I think a change of scenery has been long overdue.

In the short term though, all my bigger machines are packed up onto a trailer which is going to make a lot of my normal work more diffi...

View Post

Replication update

As I think I mentioned awhile ago, for replication the last big item left was IO error handling - that is, handling IO errors without just going read only when we've got another replica to read from (for reads) or when only some of the replicas for a replicated write failed.

The really tricky one was btree node write error handling, since on btree node write error we have to note somewhere that we can no longer read from the replica that failed, and we also have to note in the superblock ...

View Post

Faster fsck and mount times

Recently pushed a patch to add prefetching of btree nodes. It's a rather minor change compared to the stuff I'm still working on for replication, but it does improve both mount and fsck times by around 2x - not too shabby for a relatively simple change.

On larger filesystems, bcachefs's mount times still are too slow - this is really only a stopgap measure until I implement persistent allocation information and a few other things. Fsck performance appears to be quite good compared to othe...

View Post

Status update - debugging and replication

Debugging, debugging, more debugging...

If you've been wondering at the slow progress, that's where all my time's been going. The unfortunate reality about creating a filesystem is that a filesystem, much moreso than most software, isn't all that useful if it's only, say, 90% debugged - you don't want a filesystem that doesn't eat your data _most_ of the time. And chasing down those last few bugs, that are the hardest to reproduce and find, is just a long slow slog. And not the fun kind, ...

View Post

New website, git repositories

http://bcachefs.org/

Don't have any _new_ content there yet, it's just all the existing stuff in one place. Would love to have people help out on the website.

Also, bcachefs has its own git repositories now - also linked to by the new website.

More importantly - the bcachefs code in the kernel tree now lives in fs/bcachefs, it's (finally) been forked from bcache. That's one of the prerequisites for upstreaming knoc...

View Post

Lots of new changes/features:

It's been far too long since the last announcement - lots of stuff has been

happening. The biggest milestone has been all the breaking on disk format

changes finally landing, but there's been lots of other stuff going on, too.


On the subject of the breaking on disk format changes - there's an excellent

chance this'll be the last breaking change, so if you're thinking about trying

out bcachefs this is an excellent time. Also, if you have a filesystem in the View Post

Performance improvements in bcachefs-testing

First off, some background on where we're at currently, regarding metadata IO:

 - A userspace process will never block on IO - i.e., wait for a journal write or a btree node write - unnecessarily. Never ever.  The only reason your userspace proccess will end up blocked waiting for a metadata write to complete is either: you asked to (fsync), or resource exhaustion (either the journal filling up, or we don't have enough memory to allocate a new btree node without flushing some other d...

View Post

Testing needed

Lately, the big bottleneck is getting to be testing - I really need more people willing to try out the latest code and make sure it isn't going to eat anyone's data before I push it out for general consumption. I do a lot of testing myself already - honestly, that's where most of my time goes - but filesystems are complicated beasts, and bugs in filesystem code tend to have more severe consequences than in most other code.

Help with the test framework would also be useful, the coverage of the ...

View Post

Upcoming performance improvements

These patches haven't landed yet, and the numbers should be higher when I'm done - but the fsmark numbers are now looking really nice. Delete performance is massively improved, too.

Time to completion for fs_mark -v -n 200000 -s 4096 -k -S 1 -D 1000 -N 1000 -t 10:

bcachefs: 

real 0m53.354s
user 0m21.063s
sys 3m17.084s


ext4:

real 2m6.306s
user 0m26.659s
sys 3m47.379s

xfs:

real 1m11.003s
user ...

View Post

Tiering should now be working - testing requested

Tiering should finally be working with the last big batch of fixes I pushed.

Chris Halse Rogers (RAOF in the #bcache IRC channel) has been testing it. He has been seeing an intermittent deadlock while copying large amounts of data, which may or may not be tiering related: if anyone else hits it, I'd really appreciate if you could grab backtraces. Do a "echo t > /proc/sysrq-trigger" and then grab the full dmesg log - that should be enough to figure out what the deadlock is.

I haven...

View Post

All about tiering

First off, a word about definitions. In bcachefs, tiering is caching by another name: storage devices can be assigned to different tiers, and we can use a faster tier to cache a slower tier.

In some other storage systems, tiering means a setup where data can be dynamically moved between different tiers, but it often implies that the migration of data between tiers happens slowly in the background (not as data is accessed), and a given chunk of data only lives in one tier at a time - that is, w...

View Post

Encryption - motivation

Just saw this really excellent article about disk encryption - this explains better than I could the issues with encryption at the block layer:

http://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/

This also explains the motivation for doing encryption in bcachefs: with a copy on write filesystem, where we can take the needs of encryption into account when we're designing it and specif...

View Post

Updates

 - Encryption's mostly done, got some useful feedback from the design doc.

 - Starting to work on multiple devices and replication again. Found some "there's no way this could have possibly worked" bugs with tiering - evidently I've neglected all the multiple device stuff for too long.

Hoping to have tiering ready for people to use before too long.

View Post

Encryption design doc is finished

https://bcache.evilpiepirate.org/Encryption/

View Post

Encryption

Been studying random papers/RFCs/Dan Bernstein's code and figuring out the plan for adding encryption to bcachefs... doing crypto right is hard. In storage land, I'm not sure anyone really gets it right - if you're doing block storage (e.g. dm-crypt), or if you're adding encryption to an existing filesystem, you're kind of screwed since you have no place to stick a nonce. Unless I missed it when I was reading the code, ext4 doesn't even try to use block or file offset or anything - it's just AES...

View Post

Compression + copygc

I'm kicking myself for not noticing this sooner (most likely I saw it months ago and then forgot about it because I'm terrible about taking notes...). Do not use bcachefs with compression enabled yet - if copygc ever has to run you'll very soon hit a BUG_ON().

The issue is that copygc will often have to split extents that it's rewriting: it's copying extents from various mostly empty buckets into new buckets, and very often the extents it's moving won't exactly fill up a new bucket - think bin...

View Post

Transactions

This is my current project, so I thought I'd write something about it and how this area of bcachefs works.


So, for some background: every remotely modern filesystem has some sort of facility for what database people just call transactions - doing complex operations atomically, so that in the event of a crash the operation either happened completely or not at all - we're never left in a halfway state.


For example, creating a new file requires doing a few different things: View Post

Compression has landed

The last bit - disk space accounting - is finished, so it should actually be useful now.

Please test it out - I'd like to hear how well it's working for people.


Currently lz4 and gzip are supported, and lz4 is the recommended option. I'd like to add more compression algorithms in the future - in particular, lzma for cold data would be really nifty.


What's still missing is interfaces/tooling: right now, the only way to see compressed and uncompressed numbers is via /sys/...

View Post

July 2016 patron supported

Thank you so much for the support this month!

View Post

Compression

Finally figured out how to make compressed disk usage accounting work. It's a surprisingly thorny issue - I'll have to write more about it later.

The TL;DR is - disk usage is only allowed to increase when you're getting a disk reservation (which is also where you'd get -ENOSPC). We have to be able to arbitrarily move data around (or rewrite fragmented data) without the amount of disk space used - however we define that number - increasing. But extents getting partially overwritten - in par...

View Post