Repeat after me: you do not fuck around in filesystem land
Added 2024-09-05 13:19:57 +0000 UTCSo, kernel programming is known for being a high stakes, high pressure, toxic environment to work in. There's a reason for that: the consequences for screwing up are really high, and you can make your co-workers and a lot of users really miserable if you screw things up.
I've seen plenty of bugs that have taken weeks or months to track down, and bugs can take down the entire machine and waste a lot of people's time to track down - not to mention if you're writing code that's included in the kernel, and you get it wrong, you have the opportunity to crash a lot of machines.
Filesystem land is a step up from that.
In kernel land, generally the worst you can do is crash the machine (or introduce a really nasty security bug, but - let's not talk about those).
In filesystem land, we're responsible for people's data, and if we screw up, people's data and machines go inaccessible until we fix it - or worst, data gets lost. That's a really, really bad day.
Recently, there's been an absolute ton of drama out of Debian land - and elsewhere, but the Debian kerfluffle was the really notable one. A blog post, which I will not link, went on at great length at the difficulties and drama of attempting to package bcachefs-tools for Debian; high on drama, low on technical content.
And in the Debian bug tracker where they're considering a new maintainer, it's a lot of the same thing - a lot of talking about who's at fault, nothing to say about the technical issues that caused problems the last time around. (And as it led to people unable to access their filesystems, because they were stuck on an old broken version of -tools, that needs to be talked about).
A good engineering culture is one where we can talk about what went wrong without fingerpointing, without excessively trying to assign blame, and where we all take responsibilities for our mistakes and look for things we can do better. Post-mortems, as some call them, can be a good way of doing things. A good engineering culture is where we can all maintain our focus on the technical issues at hand; that's why we're here.
So.
Let's all have an attitude of learning what we can, and trying to do better.
Comments
Just a quick note, I rarely comment, but I feel this is important. Is any publicity good publicity? I doubt it. > From: Kent Overstreet > Date: Sat, 5 Oct 2024 18:54:19 -0400 > I've got a team lined up, just secured funding to start paying them and it looks like I'm about to secure more. I feel like there is no way that sending that response to Linus was helpful in securing funding. Sometimes it is both less work and more helpful to say nothing. Please think a few days and run the next reply if any via your friends before you send any more stuff for the good of the project. Or maybe just take 1 week holiday before you continue. Thanks for considering. Cheers
Mark. K.
2024-10-08 20:56:31 +0000 UTCHi Kent, thanks for the wise words. Having read only the superficial first post of both sides, I can see that it looks like bridges were burned. However, I think it is very valuable to come to some kind of solution for the users out there, maybe not now, but maybe in the next years. The users (well, the majority of users running the majority of the infrastructure our society runs on) value stability over everything else. I'm not running Debian oldoldstable because I hate innovation, I'm running oldoldstable because migrating my stuff from one Debian release to the next takes weeks of effort and there are only so many weeks in a year, so upgrading every two years to keep up with Debian's "frantic" release schedule is not a thing. I understand that that means I won't be using bcachefs this decade, probably, and that is fine, I'm patient. But it would be great to use bcachefs in production some time in 2032 or so. (-: Cheers, Mika
Mika PflΓΌger
2024-09-06 14:24:13 +0000 UTCSecondarily, we need to make sure we have a package maintainer who can do updates in a timely manner, but if the unbundling requirement is sorted I don't expect that to be a real problem. There might need to be a discussion down the line about how updates will work for stable (I'm not backporting bugfixes any time soon, I don't have cycles for that, so getting fixes and staying in sync with the kernel means stable has track the latest released version. But that's far down the line, I don't think anyone's thinking about stable yet).
Kent Overstreet
2024-09-06 13:32:34 +0000 UTCYeah, certainly. We just need to just ask up front if the powers-that-be will be willing to relax the Rust dependency unbundling requirement (and make sure that discussion stays on track, and doesn't veer off into the drama it did before). If they say no we don't need to push further - we can support Debian users with a ppa.
Kent Overstreet
2024-09-06 13:29:42 +0000 UTCI wonder if I can help with the Debian conversation, let me know if you would be open to that.
Martin Stadler
2024-09-06 12:58:42 +0000 UTCnah, I'm not talking about Linus. Linus and I get annoyed with each other, but I respect the hell out of him and I hope he thinks the same of me; there's a perspective that you only get by being the responsible person for a huge project for a long time, and I think we share some of that. What he and I disagree on is just judgement calls; the kernel is in a more mature place than bcachefs is, so of course we're going to have differences in how we want to do things. The Debian people, on the other hand... let me just stop myself right there, because I don't want to go bashing, but I think there's things about keeping the focus on the technical that they would do well to keep in mind...
Kent Overstreet
2024-09-06 04:33:53 +0000 UTCHello Kent, I don't comment much but I've been supporting on Patreon for 4.5 years. I'm hoping you can clarify, I'm confused about this post. This reads to me like extreme conservatism, like you're saying that filesystem development is exceptionally difficult and you have to be exceptionally careful and it's an exceptionally large responsibility. So it seems like you're expanding on this https://lore.kernel.org/lkml/bczhy3gwlps24w3jwhpztzuvno7uk7vjjk5ouponvar5qzs3ye@5fckvo2xa5cz/ Reading through some discussion (e.g. on hackernews) it seems like there is some disagreement, especially from Linus, as to size of patches, development speed, etc. For instance in that reply, I apologize for being direct, but Linus's message https://lore.kernel.org/lkml/CAHk-=wjwn-YAJpSNo57+BB10fZjsG6OYuoL0XToaYwyz4fi1MA@mail.gmail.com/ was going in detail about fixing bugs in stable, what's a necessary "small fix" versus development, how to manage the risk, and your reply basically blew him off, it didn't substantively reply to anything he said, just reiterating that you know what you're doing, and Linus didn't reply further. You're saying on here, LKML, hackernews, etc that you develop very conservatively, that you have manual and automated testing, etc, but it seems as though there's skepticism. What do you think about this? I am excited for bcachefs to see future success but it's scary for Linus to be displeased.
Lurf Jurv
2024-09-06 04:28:57 +0000 UTCFairy nuff. But having seen how things work in kernel development back in the day, I can't say I am terribly surprised. Yeah, if your code relies on someone else's fixes, then you either wait for them to or help them to. Anyway, I am just a random opinion on the internet. I look forward to seeing bcachefs land in unstable or testing some day.
veritanuda
2024-09-05 19:57:08 +0000 UTCRust is a major improvement over C for writing reliable code, so - Debian's going to have to figure this out. The cargo packaging model (not npm; Cargo does it better) is also an improvement for our ability to debug and QA - it makes it possible to bisect for regressions caused by dependency updates. If you don't want to do drama, try digging a little more on these issues instead of just spouting the "common wisdom". The common wisdom is not always right.
Kent Overstreet
2024-09-05 19:38:52 +0000 UTCYou've been misinformed. On the last pull request that Linus didn't like, the "other subsystem" was also code that I wrote and maintain. Back in the day, Linus and others were rewriting core mm code in RC kernels on more than one occasion. bcachefs is still new, so of course there's still a rush to get fixes out; issues are being found and fixed at a pretty high rate.
Kent Overstreet
2024-09-05 19:37:11 +0000 UTCOnly that large patch sets, especially those that touch other systems, submitted without consultation with those other systems, is not good practice. It shows a state of development that is very much in flux or that has showstopper bugs have been exposed which should have been caught earlier in testing. Kernel filesystem development is not a race, it is a pilgrimage. There is no prize for being first, and there is no stage when you can say it is completely finished. So long as it is stable and becomes as easy (or tolerable) to maintain as other fs drivers. then it can be considered one of the team. FWIW it took a long time for ZFS to reach there, and so here we are today. I consider bcachefs very much beta at this stage, and a LOT more testing in the real world is needed before it can become anyone's default.
veritanuda
2024-09-05 19:21:44 +0000 UTCWhat's your point?
Kent Overstreet
2024-09-05 17:43:08 +0000 UTCThe Kernel dev path is necessarily brutal because there are a lot of opportunities to really screw things up. As for Debian, they are, rightly, paranoid about anything that messes with their stability. Something, I for one, appreciate. So building user space tools in Rust, which is a moving target, much like npm, was probably not the most universal choice. Especially as rust-coreutils is not even in Stable atm. I don't do drama and prefer to stick to facts and practicalities. If the maintainership of the user space tools has to use versions of libs that are not currently in any version of Debian, then it can become unnecessarily tiresome. Have to say, I see their point.
veritanuda
2024-09-05 15:48:02 +0000 UTCKent, I have been supporting your efforts for a very long time now, but as a very long in the tooth Linux user I have to say, filesystem debugging should not be done in live kernels. Simple as. I am painfully aware of the move from ext to 2 to 3 and then 4 and having lost data migrating systems across those machines.
veritanuda
2024-09-05 15:21:10 +0000 UTCWell, "toxic" has a lot of meaning behind it. To the outside world, the way we communicate in kernel land often seems toxic, to the point of terrifying - but that is a nearly inevitable consequence of the kind of work we do. Because there are things that just _have_ to be done right, and no one has a ton of free time on their hands, when someone is screwing up you need to be able to get the point across quickly - this isn't kindergarten. And it's a big community, where we don't know everyone well and we don't know the standards everyone has for their own work - when someone is starting to do something dangerous, yelling and shouting can be 100% necessary. It's not ideal, but the kind of work environment where you can take on dangerous tasks without the yelling and shouting takes building up a lot of trust and camraderie, and the priority has to be on making sure the work gets done right. But - the flip side of all that "toxic" stuff, is that what the outside world views as toxic is done with a purpose, to keep people's attention focused on the task at hand, on the technical issues that matter. Viewed that way, would you really mind getting shouted at on occasion if it meant that you avoid wasting a ton of time going down blind alleys and chasing stupid bugs? There's something to be said for not taking things personally when tensions get high; that's professionalism.
Kent Overstreet
2024-09-05 14:17:05 +0000 UTC> toxic environment to work in. There's a reason for that: the consequences for screwing up are really high "Reason" as in that's a factor for how the situation ended up toxic. Not as toxicity is necessary due to the high stakes. It becomes a source of bad work done. Not even accounting for people distancing themselves from the given project area and fueling a lack of workforce which won't help for having things done correctly. > A good engineering culture is one where we can talk about what went wrong without fingerpointing, without excessively trying to assign blame, and where we all take responsibilities for our mistakes and look for things we can do better. Post-mortems, as some call them, can be a good way of doing things. A good engineering culture is where we can all maintain our focus on the technical issues at hand; that's why we're here. +100 Ok great so that where you were going. Thanks to the people doing their best and sorry that things turned out toxic π
tuxayo
2024-09-05 14:01:29 +0000 UTC