Based on all the numbers I've seen, it looks like bcachefs's b-tree might actually be the fastest ordered key value store around (there are faster persistent hash tables).
If anyone knows of anything that might be faster, I'd love to hear it - and I might rig up some head to head benchmarks.
2018-02-04 23:42:01 +0000 UTC
View Post
rand_mixed is 3/4 lookups, 1/4 update
Lookups, both sequential and random, are beautifully fast.
Random updates ought to scale better than that, but I haven't profiled it yet so I'm not sure what's going on.
Could probably do better on sequential inserts/deletes too, but I haven't even tried to optimize those yet.
seq_insert: 10.0M with 1 threads in 11 sec, 1067 nsec per iter, 914k per sec
seq_lookup: 10.0M with 1 threads in ...
2018-01-28 03:32:51 +0000 UTC
View Post
I should have done this ages ago...
bcachefs: btree_perf_test() doing 10.0M rand_insert:
bcachefs: btree_perf_test() done in 27 sec, 2587 nsec per iter, 377k per sec
bcachefs: btree_perf_test() doing 10.0M rand_lookup:
bcachefs: btree_perf_test() done in 10 sec, 1001 nsec per iter, 974k per sec
bcachefs: btree_perf_test() doing 10.0M seq_lookup:
bcachefs: btree_perf_test() done in 0 sec, 24 nsec per iter, 39.7M per sec
2018-01-24 20:07:19 +0000 UTC
View Post
You can now expand a filesystem on a device - shrink isn't implemented just yet. The command is bcachefs device resize, and it takes the same arguments as resize2fs:
bcachefs device resize /dev/sdb 10G
If you don't specify a size, it uses the current size of the device.
Online and offline resize are both supported.
2018-01-02 23:41:21 +0000 UTC
View Post
Replication tests are finally all passing! This means that device removal and write error handling (for replicated writes) should finally be fully working.
Those two codepaths have in common that they need to modify pointers to existing btree nodes - removing the pointer to the device that either failed to write or is being removed - which is a particularly tricky operation, partially due to the btree node cache being physically indexed. Also fixed a whole bunch of bugs in the code that tracks...
2017-12-24 21:21:48 +0000 UTC
View Post
2017-12-16 16:53:54 +0000 UTC
View Post
Just pushed out a patch to add per-inode options for some options that could previously only be set globally. Currently this is just checksum type and compression type, but more will be added in the future. The options are exposed as xattrs, and if you set them on a directory they'll be inherited on create.
For example:
$ setfattr -n bcachefs.data_checksum -v none /mnt/foo
$ setfattr -n bcachefs.compression -v lz4 /mnt/foo
The options will take effect for newly written data -...
2017-12-15 16:21:18 +0000 UTC
View Post
I've been back to work for several months but procrastinating and neglecting the updates...
The biggest thing I've been spending my time on lately has been improving the test infrastructure and test suite and chasing bugs. It seems I was putting this off for far too long, relying just on xfstests - this made sense while I was focusing on completing and debugging the filesystem interface, but more recently work has focused more on the core IO path - the part of the code that handles things like...
2017-12-13 16:07:29 +0000 UTC
View Post
Took this earlier today in Nashville :)
On the bcachefs front - might be announcing a corporate sponsor in the next few days! Stay tuned.
2017-08-21 23:48:22 +0000 UTC
View Post
2017-07-18 00:19:41 +0000 UTC
View Post
If you've noticed things have been quieter lately, you haven't been imagining things - I've been busy with getting ready for a rather big move, and in the process I'm taking an extended road trip. I'm pretty happy about it - I've been feeling stuck in a rut and it's been hard to make progress writing code, and I think a change of scenery has been long overdue.
In the short term though, all my bigger machines are packed up onto a trailer which is going to make a lot of my normal work more diffi...
2017-06-13 23:48:51 +0000 UTC
View Post
As I think I mentioned awhile ago, for replication the last big item left was IO error handling - that is, handling IO errors without just going read only when we've got another replica to read from (for reads) or when only some of the replicas for a replicated write failed.
The really tricky one was btree node write error handling, since on btree node write error we have to note somewhere that we can no longer read from the replica that failed, and we also have to note in the superblock ...
2017-05-15 08:12:50 +0000 UTC
View Post
Recently pushed a patch to add prefetching of btree nodes. It's a rather minor change compared to the stuff I'm still working on for replication, but it does improve both mount and fsck times by around 2x - not too shabby for a relatively simple change.
On larger filesystems, bcachefs's mount times still are too slow - this is really only a stopgap measure until I implement persistent allocation information and a few other things. Fsck performance appears to be quite good compared to othe...
2017-04-25 14:49:53 +0000 UTC
View Post
Debugging, debugging, more debugging...
If you've been wondering at the slow progress, that's where all my time's been going. The unfortunate reality about creating a filesystem is that a filesystem, much moreso than most software, isn't all that useful if it's only, say, 90% debugged - you don't want a filesystem that doesn't eat your data _most_ of the time. And chasing down those last few bugs, that are the hardest to reproduce and find, is just a long slow slog. And not the fun kind, ...
2017-04-11 05:18:01 +0000 UTC
View Post
http://bcachefs.org/
Don't have any _new_ content there yet, it's just all the existing stuff in one place. Would love to have people help out on the website.
Also, bcachefs has its own git repositories now - also linked to by the new website.
More importantly - the bcachefs code in the kernel tree now lives in fs/bcachefs, it's (finally) been forked from bcache. That's one of the prerequisites for upstreaming knoc...
2017-03-22 09:27:50 +0000 UTC
View Post
It's been far too long since the last announcement - lots of stuff has been
happening. The biggest milestone has been all the breaking on disk format
changes finally landing, but there's been lots of other stuff going on, too.
On the subject of the breaking on disk format changes - there's an excellent
chance this'll be the last breaking change, so if you're thinking about trying
out bcachefs this is an excellent time. Also, if you have a filesystem in the
2017-03-16 00:04:34 +0000 UTC
View Post
First off, some background on where we're at currently, regarding metadata IO:
- A userspace process will never block on IO - i.e., wait for a journal write or a btree node write - unnecessarily. Never ever. The only reason your userspace proccess will end up blocked waiting for a metadata write to complete is either: you asked to (fsync), or resource exhaustion (either the journal filling up, or we don't have enough memory to allocate a new btree node without flushing some other d...
2016-12-03 03:59:18 +0000 UTC
View Post
Lately, the big bottleneck is getting to be testing - I really need more people willing to try out the latest code and make sure it isn't going to eat anyone's data before I push it out for general consumption. I do a lot of testing myself already - honestly, that's where most of my time goes - but filesystems are complicated beasts, and bugs in filesystem code tend to have more severe consequences than in most other code.
Help with the test framework would also be useful, the coverage of the ...
2016-11-06 08:27:02 +0000 UTC
View Post
These patches haven't landed yet, and the numbers should be higher when I'm done - but the fsmark numbers are now looking really nice. Delete performance is massively improved, too.
Time to completion for fs_mark -v -n 200000 -s 4096 -k -S 1 -D 1000 -N 1000 -t 10:
bcachefs:
real 0m53.354s
user 0m21.063s
sys 3m17.084s
ext4:
real 2m6.306s
user 0m26.659s
sys 3m47.379s
xfs:
real 1m11.003s
user ...
2016-10-24 10:48:01 +0000 UTC
View Post
Tiering should finally be working with the last big batch of fixes I pushed.
Chris Halse Rogers (RAOF in the #bcache IRC channel) has been testing it. He has been seeing an intermittent deadlock while copying large amounts of data, which may or may not be tiering related: if anyone else hits it, I'd really appreciate if you could grab backtraces. Do a "echo t > /proc/sysrq-trigger" and then grab the full dmesg log - that should be enough to figure out what the deadlock is.
I haven...
2016-09-13 02:10:00 +0000 UTC
View Post
First off, a word about definitions. In bcachefs, tiering is caching by another name: storage devices can be assigned to different tiers, and we can use a faster tier to cache a slower tier.
In some other storage systems, tiering means a setup where data can be dynamically moved between different tiers, but it often implies that the migration of data between tiers happens slowly in the background (not as data is accessed), and a given chunk of data only lives in one tier at a time - that is, w...
2016-09-13 01:40:07 +0000 UTC
View Post
Just saw this really excellent article about disk encryption - this explains better than I could the issues with encryption at the block layer:
http://sockpuppet.org/blog/2014/04/30/you-dont-want-xts/
This also explains the motivation for doing encryption in bcachefs: with a copy on write filesystem, where we can take the needs of encryption into account when we're designing it and specif...
2016-09-06 04:21:00 +0000 UTC
View Post
- Encryption's mostly done, got some useful feedback from the design doc.
- Starting to work on multiple devices and replication again. Found some "there's no way this could have possibly worked" bugs with tiering - evidently I've neglected all the multiple device stuff for too long.
Hoping to have tiering ready for people to use before too long.
2016-09-05 02:40:02 +0000 UTC
View Post
Been studying random papers/RFCs/Dan Bernstein's code and figuring out the plan for adding encryption to bcachefs... doing crypto right is hard. In storage land, I'm not sure anyone really gets it right - if you're doing block storage (e.g. dm-crypt), or if you're adding encryption to an existing filesystem, you're kind of screwed since you have no place to stick a nonce. Unless I missed it when I was reading the code, ext4 doesn't even try to use block or file offset or anything - it's just AES...
2016-08-07 16:10:52 +0000 UTC
View Post
I'm kicking myself for not noticing this sooner (most likely I saw it months ago and then forgot about it because I'm terrible about taking notes...). Do not use bcachefs with compression enabled yet - if copygc ever has to run you'll very soon hit a BUG_ON().
The issue is that copygc will often have to split extents that it's rewriting: it's copying extents from various mostly empty buckets into new buckets, and very often the extents it's moving won't exactly fill up a new bucket - think bin...
2016-08-06 05:12:49 +0000 UTC
View Post
This is my current project, so I thought I'd write something about it and how this area of bcachefs works.
So, for some background: every remotely modern filesystem has some sort of facility for what database people just call transactions - doing complex operations atomically, so that in the event of a crash the operation either happened completely or not at all - we're never left in a halfway state.
For example, creating a new file requires doing a few different things:
2016-08-02 12:40:59 +0000 UTC
View Post
The last bit - disk space accounting - is finished, so it should actually be useful now.
Please test it out - I'd like to hear how well it's working for people.
Currently lz4 and gzip are supported, and lz4 is the recommended option. I'd like to add more compression algorithms in the future - in particular, lzma for cold data would be really nifty.
What's still missing is interfaces/tooling: right now, the only way to see compressed and uncompressed numbers is via /sys/...
2016-08-02 10:39:12 +0000 UTC
View Post
Thank you so much for the support this month!
2016-07-31 23:59:00 +0000 UTC
View Post
Finally figured out how to make compressed disk usage accounting work. It's a surprisingly thorny issue - I'll have to write more about it later.
The TL;DR is - disk usage is only allowed to increase when you're getting a disk reservation (which is also where you'd get -ENOSPC). We have to be able to arbitrarily move data around (or rewrite fragmented data) without the amount of disk space used - however we define that number - increasing. But extents getting partially overwritten - in par...
2016-07-26 11:34:27 +0000 UTC
View Post