It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.
That said, if we eliminated the 70% of bugs that are memory safety issues, the SNR ratio for finding these deep logic bugs would improve dramatically. We spend so much time tracing segfaults that we miss the subtle corruption bugs.
While the bugs you describe are indeed things that aren't directly addressed by Rust's borrow checker, I think the article covers more ground than your comment implies.
For example, a significant portion (most?) of the article is simply analyzing the gathered data, like grouping bugs by subsystem:
Subsystem Bug Count Avg Lifetime
drivers/can 446 4.2 years
networking/sctp 279 4.0 years
networking/ipv4 1,661 3.6 years
usb 2,505 3.5 years
tty 1,033 3.5 years
netfilter 1,181 2.9 years
networking 6,079 2.9 years
memory 2,459 1.8 years
gpu 5,212 1.4 years
bpf 959 1.1 years
Or by type: Bug Type Count Avg Lifetime Median
race-condition 1,188 5.1 years 2.6 years
integer-overflow 298 3.9 years 2.2 years
use-after-free 2,963 3.2 years 1.4 years
memory-leak 2,846 3.1 years 1.4 years
buffer-overflow 399 3.1 years 1.5 years
refcount 2,209 2.8 years 1.3 years
null-deref 4,931 2.2 years 0.7 years
deadlock 1,683 2.2 years 0.8 years
And the section describing common patterns for long-lived bugs (10+ years) lists the following:> 1. Reference counting errors
> 2. Missing NULL checks after dereference
> 3. Integer overflow in size calculations
> 4. Race conditions in state machines
All of which cover more ground than listed in your comment.
Furthermore, the 19-year-old bug case study is a refcounting error not related to highly concurrent state machines or hardware assumptions.
It’s also worth noting that Rust doesn’t prevent integer overflow, and it doesn’t panic on it by default in release builds. Instead, the safety model assumes you’ll catch the overflowed number when you use it to index something (a constant source of bugs in unsafe code).
I’m bullish about Rust in the kernel, but it will not solve all of the kinds of race conditions you see in that kind of context.
It always surprised me how the top-of-the line analyzers, whether commercial or OSS, never really implemented C-style reference count checking. Maybe someone out there has written something that works well, but I haven’t seen it.
Rust is not just about memory safety. It also have algebraic data types, RAII, among other things, which will greatly help in catching this kind of silly logic bugs.
Is this an irrational fear, I wonder? Reminds me of methods used in the political discourse.
> Is this an irrational fear, I wonder? Reminds me of methods used in the political discourse.
In a sad sort of way, I think its hilarious that hn users have been so completely conditioned to expect rust evangelism any time a topic like this comes up that they wanted to get ahead of it.
Not sure who it says more about, but it sure does say a whole lot.
It's worth noting that if you write memory safe code but mis-program a DMA transfer, or trigger a bug in a PCIe device, it's possible for the hardware to give you memory-safety problems by splatting invalid data over a region that's supposed to contain something else.
In my experience it's closer to 5%.
Basically, 70% of high severity bugs are memory safety.
[1] https://www.chromium.org/Home/chromium-security/memory-safet...
At one point, I found serious bugs (crashing our product) that had existed for over 15 years. (And that was 10 years ago).
Rust may not be perfect but it gives me hope that some classes of stupidity will be either be avoided or made visible (like every function being unsafe because the author was a complete idiot).
You are right about that, but even just using sum types eliminates a lot of logic errors, too.
The Rust phantom zealotry is unfortunately real.
[1] Aha, but the chilling effect of dismissing RIR comments before they are even posted...
Rewriting it all in Rust is extremely expensive, so it won't be done (soon).
Our bug dataset was way smaller, though, as we had to pinpoint all bug introductions unfortunately. It's nice to see the Linux project uses proper "Fixes: " tags.
Sort of. They often don't.
IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good. Unless the time represents how long it takes to reproduce and resolve a known bug, but in such case I would not say that "bug hides" in the kernel.
Not really true. A lot of very severe bugs have lurked for years and even decades. Heartbleed comes to mind.
The reason these bugs often lurk for so long is because they very often don't cause a panic, which is why they can be really tricky to find.
For example, use after free bugs are really dangerous. However, in most code, it's a pretty safe bet that nothing dangerous happens when use after free is triggered. Especially if the pointer is used shortly after the free and dies shortly after it. In many cases, the erroneous read or write doesn't break something.
The same is true of the race condition problems (which are some of the longest lived bugs). In a lot of cases, you won't know you have a race condition because in many cases the contention on the lock is low so the race isn't exposed. And even when it is, it can be very tricky to reproduce as the race isn't likely to be done the same way twice.
I don’t know much about Heartbleed, but Wikipedia says:
> Heartbleed is a security bug… It was introduced into the software in 2012 and publicly disclosed in April 2014.
Two years doesn’t sound like “years or even decades” to me? But again, I don’t know much about Heartbleed so I may be missing something. It does say it was also patched in 2014, not just discovered then.
It doesn't seem to indicate that. It indicates the bug just isn't in tested code or isn't reached often. It could still be a very severe bug.
The issue with longer lived bugs is that someone could have been leveraging it for longer.
One criticism of Rust (and, no, I'm not saying "rewrite it in Rust", to be clear) is that the borrow checker can be hard to use whereas many C++ engineers (in particular, for some reason) seem to argue that it's easier to write in C++. I have two things to say about that:
1. It's not easier in C++. Nothing is. C++ simply allows you to make mistakes without telling you. GEtting things correct in C++ is just as difficult as any other language if not more so due to the language complexity; and
2. The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.
This is I favor cooperative multitasking and using battle-tested concurrency abstractions whenever possible. For example the cooperative async-await of Hack and the model of a single thread responding to a request then discarding everything in PHP/Hack is virtually ideal (IMHO) for serving Web traffic.
I remember reading about Google's work on various C++ tooling including valgrind and that they exposed concurrency bugs in their own code that had lain dormant for up to a decade. That's Google with thousands of engineers and some very talented engineers at that.
There are entire classes of structures that no, aren't hard to do properly, but the borrow checker makes artificially hard due to design limitations that are known to be sub-optimal.
No, two-directional linked lists and partially editable data structures aren't inherently hard. It's a Rust limitation that a piece of code can't take enough ownership of them to edit they safely.
Just worth noting that it is a significant extrapolation from only "28%" of fix commits to assume that the average is 2 years.
“There’s a crash while using this config file.” Something more complex than that, but ultimately a crash of some kind.
Years later, like 20 years later, the bug was closed. You see, they re-wrote the config parser in Rust, and now this is fixed.”
That’s cool but it’s not the part I remember. The part I always think about is, imagine responding to the bug right after it was opened with “sorry, we need to go off and write our own programming language before this bug is fixed. Don’t worry, we’ll be back, it’s just gonna take some time.”
Nobody would believe you. But yet, it’s what happened.
The anti-Firefox mob really is striving to take shots at it.
The point of the article isn't a criticism of Linux, but an analysis that leads to more productive code review.
It's not uncommon for the bugs they found to be rediscovered 6-7 years later.
1. Tons of bugs are reported upstream by grsecurity historically.
2. Tons of critical security mitigations in the kernel were outright invented by that team. ASLR, SMAP, SMEP, NX, etc.
3. They were completely FOSS until very recently.
4. They have always maintained that they are entirely willing to upstream patches but that it's a lot of work and would require funding. Upstream has always been extremely hostile towards attempts to take small pieces of Grsecurity and upstream them.
Profiting from selling their patchset is not the whole story, though. grsec was public and free for a long time and there were many effects at play preventing the kernel from adopting it.
Here is a device driver bug that was around 11 years.
https://www.bitdefender.com/en-us/blog/hotforsecurity/google...
One bug is all it takes to compromise the entire system.
The monolithic UNIX kernel was a good design in the 60s; Today, we should know better[0][1].
Say the USB system runs in its own isolated process. Great, but if someone pwns the USB process they can change disk contents, intercept and inject keystrokes, etc. You can usually leverage that into a whole system compromise.
Same with most subsystems: GPU, network, file system process compromises are all easily leveraged to pwn the whole system.
So is Mach, by the way, if you can afford the microkernel performance overhead.
Impressive results on the model, I'm surprised they improved it with very simple heuristics. Hopefully this tool will be made available to the kernel developers and integrated to the workflow.
My Pixel 8 runs kernel a stable minor from 6.1, which was released more than 4 years ago. Yes, fixes get backported to it, but the new features in 6.2->6.19 stay unused on that hardware. All the major distros suffer from the same problem, most people are not running them in production
Most hyperscalers are running old kernel versions on which they do backports. If you go Linux conferences you hear folks from big companies mentioning 4.xx, 3.xx kernels, in 2025.
I have a server which has many peripherals and multiple GPUs. Now, I can use vfio and vfio-pcio to memory map and access their registers in user space. My question is, how could I start with kernel driver development? And I specifically mean the dev setup.
Would it be a good idea to use vfio with or without a vm to write and test drivers? How to best debug, reload and test changing some code of an existing driver?
John Gall, The Systems Bible
Am I the only unreasonable maniac who wants a very long-term stable, seL4-like capability-based, ubiquitous, formally-verified μkernel that rarely/never crashes completely* because drivers are just partially-elevated programs sprinkled with transaction guards and rollback code for critical multiple resource access coordination patterns? (I miss hacking on MINIX 2.)
* And never need to reboot or interrupt server/user desktop activities because the core μkernel basically never changes since it's tiny and proven correct.
On a related note, I'm seeing a correlation between "level of hoopla" and a "level of attention/maintenance." While it's hard to distinguish that correlation from "level of use," the fact that CAN is so far down the list suggests to me that hoopla matters; it's everywhere but nobody talks about it. If a kernel bug takes down someone's datacenter, boy are we gonna hear about it. But if a kernel bug makes a DeviceNet widget freak out in a factory somewhere? Probably not going to make the front page of HN, let alone CNN.
A CAN with 10,000 machines total and relatively fixed applications is either going to trigger the bug right off the bat and then work around it, or trigger the bug so rarely it won't be recognized as a kernel issue.
General purpose systems running millions and millions of units with different workloads are an evolutionary breeding ground for finding bugs and exploits.
We call AI models "open source" if you can download the binary and not the source. Why not programs?
Who's "we"? There's been quite a lot of pushback on this naming scheme from the OSS community, with many preferring the term "open weights".
the weights of a model aren't equivalent to the binary output of source code, no matter how you try to stretch the metaphor.
>why not
because we aren't beholden to change all definitions and concepts because some guy at some corp said so.
We want human readable, comprehensible, reproducible and maintainable sources at minimum when we say open source.