Kernel bugs hide for 2 years on average. Some hide for 20 (opens in new tab)

(pebblebed.com)

294 pointskmavm4mo ago165 comments

165 comments

Before the "rewrite it in Rust" comments take over the thread:

It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.

That said, if we eliminated the 70% of bugs that are memory safety issues, the SNR ratio for finding these deep logic bugs would improve dramatically. We spend so much time tracing segfaults that we miss the subtle corruption bugs.

aw16211074mo ago

> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions)

While the bugs you describe are indeed things that aren't directly addressed by Rust's borrow checker, I think the article covers more ground than your comment implies.

For example, a significant portion (most?) of the article is simply analyzing the gathered data, like grouping bugs by subsystem:

    Subsystem        Bug Count  Avg Lifetime
    drivers/can      446        4.2 years
    networking/sctp  279        4.0 years
    networking/ipv4  1,661      3.6 years
    usb              2,505      3.5 years
    tty              1,033      3.5 years
    netfilter        1,181      2.9 years
    networking       6,079      2.9 years
    memory           2,459      1.8 years
    gpu              5,212      1.4 years
    bpf              959        1.1 years

Or by type:

    Bug Type         Count  Avg Lifetime  Median
    race-condition   1,188  5.1 years     2.6 years
    integer-overflow 298    3.9 years     2.2 years
    use-after-free   2,963  3.2 years     1.4 years
    memory-leak      2,846  3.1 years     1.4 years
    buffer-overflow  399    3.1 years     1.5 years
    refcount         2,209  2.8 years     1.3 years
    null-deref       4,931  2.2 years     0.7 years
    deadlock         1,683  2.2 years     0.8 years

And the section describing common patterns for long-lived bugs (10+ years) lists the following:

> 1. Reference counting errors

> 2. Missing NULL checks after dereference

> 3. Integer overflow in size calculations

> 4. Race conditions in state machines

All of which cover more ground than listed in your comment.

Furthermore, the 19-year-old bug case study is a refcounting error not related to highly concurrent state machines or hardware assumptions.

johncolanduoni4mo ago

It depends what they mean by some of these: are the state machine race conditions logic races (which Rust won’t trivially solve) or data races? If they are data races, are they the kind of ones that Rust will catch (missing atomics/synchronization) or the ones it won’t (bad atomic orderings, etc.).

It’s also worth noting that Rust doesn’t prevent integer overflow, and it doesn’t panic on it by default in release builds. Instead, the safety model assumes you’ll catch the overflowed number when you use it to index something (a constant source of bugs in unsafe code).

I’m bullish about Rust in the kernel, but it will not solve all of the kinds of race conditions you see in that kind of context.

aw16211074mo ago

> are the state machine race conditions logic races (which Rust won’t trivially solve) or data races? If they are data races, are they the kind of ones that Rust will catch (missing atomics/synchronization) or the ones it won’t (bad atomic orderings, etc.).

The example given looks like a generalized example:

    spin_lock(&lock);
    if (state == READY) {
        spin_unlock(&lock);
        // window here where another thread can change state
        do_operation();  // assumes state is still READY
    }

So I don't think you can draw strong conclusions from it.

> I’m bullish about Rust in the kernel, but it will not solve all of the kinds of race conditions you see in that kind of context.

Sure, all I'm trying to say is that "the class of bugs described here" covers more than what was listed in the parentheses.

2 more replies

materielle4mo ago

I don’t think that the parent comment is saying all of the bugs would have been prevented by using Rust.

But in the listed categories, I’m equally skeptical that none of them would have benefited from Rust even a bit.

1 more reply

yencabulator4mo ago

> It’s also worth noting that Rust doesn’t prevent integer overflow

Add a single line to a single file and you get that enforced.

https://rust-lang.github.io/rust-clippy/stable/index.html#ar...

RealityVoid4mo ago

Why doesn't it surprise me that the CAN bus driver bugs have the longest average lifetime?

apaprocki4mo ago

> Furthermore, the 19-year-old bug case study is a refcounting error

It always surprised me how the top-of-the line analyzers, whether commercial or OSS, never really implemented C-style reference count checking. Maybe someone out there has written something that works well, but I haven’t seen it.

johncolanduoni4mo ago

This is I think an under-appreciated aspect, both for detractors and boosters. I take a lot more “risks” with Rust, in terms of not thinking deeply about “normal” memory safety and prioritizing structuring my code to make the logic more obviously correct. In C++, modeling things so that the memory safety is super-straightforward is paramount - you’ll almost never see me store a std::string_view anywhere for example. In Rust I just put &str wherever I please, if I make a mistake I’ll know when I compile.

anon-39884mo ago

> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker. Rust is fantastic for memory safety, but it will not stop you from misunderstanding the spec of a network card or writing a race condition in unsafe logic that interacts with DMA.

Rust is not just about memory safety. It also have algebraic data types, RAII, among other things, which will greatly help in catching this kind of silly logic bugs.

JuniperMesos4mo ago

Yeah, Rust gives you much better tools to write highly concurrent state machines than C does, and most of those tools are in the type system and not the borrow checker per se. This is exactly what the Typestate pattern (https://docs.rust-embedded.org/book/static-guarantees/typest...) is good at modeling.

the84724mo ago

The concurrent state machine example looks like a locking error? If the assumption is that it shouldn't change in the meantime, doesn't it mean the lock should continue to be held? In that case rust locks can help, because they can embed the data, which means you can't even touch it if it's not held.

kubb4mo ago

It’s hilarious that you feel the need to preemptively take control of the narrative in anticipation of the Rust people that you fear so much.

Is this an irrational fear, I wonder? Reminds me of methods used in the political discourse.

Bridged77564mo ago

People who make that kind of remarks should be called out and shunned. The Rust community is tired of discrimination and being the butt of jokes. All the other inferior languages prey on its minority status, despite Rust being able to solve all their problems. I take offense to these remarks, I don't want my kids to grow up as Rustaceans in such a caustic society.

irishcoffee4mo ago

> It’s hilarious that you feel the need to preemptively take control of the narrative in anticipation of the Rust people that you fear so much.

> Is this an irrational fear, I wonder? Reminds me of methods used in the political discourse.

In a sad sort of way, I think its hilarious that hn users have been so completely conditioned to expect rust evangelism any time a topic like this comes up that they wanted to get ahead of it.

Not sure who it says more about, but it sure does say a whole lot.

kubb4mo ago

I don’t think evangelism is necessary anymore. Rust adoption is now a matter of time.

1 more reply

john01dav4mo ago

Rust has more features than just the borrow checker. For example, it has a a more featured type system than C or C++, which a good developer can use to detect some logic mistakes at compile time. This doesn't eliminate bugs, but it can catch some very early.

1 more reply

pjc504mo ago

> race condition in unsafe logic that interacts with DMA

It's worth noting that if you write memory safe code but mis-program a DMA transfer, or trigger a bug in a PCIe device, it's possible for the hardware to give you memory-safety problems by splatting invalid data over a region that's supposed to contain something else.

mgaunard4mo ago

I don't think 70% of bugs are memory safety issues.

In my experience it's closer to 5%.

cogman104mo ago

I believe this is where that fact comes from [1]

Basically, 70% of high severity bugs are memory safety.

[1] https://www.chromium.org/Home/chromium-security/memory-safet...

saagarjha4mo ago

High severity security issues.

mgaunard4mo ago

Right, which is a measure which is heavily biased towards memory safety bugs.

IshKebab4mo ago

70% of security vulnerabilities are due to memory safety. Not all bugs.

stonemetal124mo ago

Using the data provided, memory safety issues (use-after-free, memory-leak, buffer-overflow, null-deref) account for 67% of their bugs. If we include refcount It is just over 80%.

tester7564mo ago

That's the figure that Microsoft and Google found in their code bases.

redeeman4mo ago

probably quite a bit less than 5%, however, they tend to be quite serious when they happen

mgaunard4mo ago

Only serious if you care about protecting from malicious actors running code on the same host.

1 more reply

BobbyTables24mo ago

I’ve seen too many embedded drivers written by well known companies not use spinlocks for data shared with an ISR.

At one point, I found serious bugs (crashing our product) that had existed for over 15 years. (And that was 10 years ago).

Rust may not be perfect but it gives me hope that some classes of stupidity will be either be avoided or made visible (like every function being unsafe because the author was a complete idiot).

eru4mo ago

> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker.

You are right about that, but even just using sum types eliminates a lot of logic errors, too.

keybored4mo ago

No other top-level comments have since mentioned Rust[1] and TFA mentions neither Rust nor topics like memory safety. It’s just plain bugs.

The Rust phantom zealotry is unfortunately real.

[1] Aha, but the chilling effect of dismissing RIR comments before they are even posted...

staticassertion4mo ago

Yes, I saw this last night and was confused because only one comment mentioned Rust, and it was deleted I think. I nearly replied "you're about to prompt 1,000 rust replies with this" and here's what I woke up to lol

paulddraper4mo ago

Rust would prevent a number of bugs, as it can model state machine guarantees as well.

Rewriting it all in Rust is extremely expensive, so it won't be done (soon).

wiz21c4mo ago

Expensive because of: 1/ a re-write is never easy 2/ rust is specifically tough (because it catches error and forces you to think about it for real, because it makes some contruct (linked list) really hard to implement) for kernel/close to kernel code ?

IshKebab4mo ago

Both I'd say. Rust imposes more constraints on the structure of code than most languages. The borrow checker really likes ownership trees whereas most languages allow any ownership graph no matter how spaghetti it is.

As far as I know that's why Microsoft rewrote Typescript in Go instead of Rust.

1 more reply

lynx974mo ago

Thanks for raising this. It feels like evangelists paint a picture of Rust basically being magic which squashes all bugs. My personal experience is rather different. When I gave Rust a whirl a few years ago, I happened to play with mio for some reason I can't remember yet. Had some basic PoC code which didn't work as expected. So while not being a Rust expert, I am still too much fan of the scratch your own itch philosophy, so I started to read the mio source code. And after 5 minutes, I found the logic bug. Submitted a PR and moved on. But what stayed with me was this insight that if someone like me can casually find and fix a Rust library bug, propaganda is probably doing more work then expected. The Rust craze feels a bit like Java. Just because a language baby-sits the developer doesn't automatically mean better quality. At the end of the day, the dev needs to juggle the development process. Sure, tools are useful, but overstating safety is likely a route better avoided.

IshKebab4mo ago

Rust has other features that help prevent logic errors. It's not just C plus a borrow checker.

ramon1564mo ago

You're fighting air

marcosdumay4mo ago

Eh... Removing concurrence bugs is one of the main selling points for Rust. And algebraic types are a really boost for situations where you have lots of assumptions.

gjfr4mo ago

Interesting! We did a similar analysis on Content Security Policy bugs in Chrome and Firefox some time ago, where the average bug-to-report time was around 3 years and 1 year, respectively. https://www.usenix.org/conference/usenixsecurity23/presentat...

Our bug dataset was way smaller, though, as we had to pinpoint all bug introductions unfortunately. It's nice to see the Linux project uses proper "Fixes: " tags.

staticassertion4mo ago

> It's nice to see the Linux project uses proper "Fixes: " tags.

Sort of. They often don't.

giamma4mo ago

Is the intention of the author to use the number of years bugs stay "hidden" as a metric of the quality of the kernel codebase or of the performance of the maintainers? I am asking because at some point the articles says "We're getting faster".

IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good. Unless the time represents how long it takes to reproduce and resolve a known bug, but in such case I would not say that "bug hides" in the kernel.

cogman104mo ago

> IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority

Not really true. A lot of very severe bugs have lurked for years and even decades. Heartbleed comes to mind.

The reason these bugs often lurk for so long is because they very often don't cause a panic, which is why they can be really tricky to find.

For example, use after free bugs are really dangerous. However, in most code, it's a pretty safe bet that nothing dangerous happens when use after free is triggered. Especially if the pointer is used shortly after the free and dies shortly after it. In many cases, the erroneous read or write doesn't break something.

The same is true of the race condition problems (which are some of the longest lived bugs). In a lot of cases, you won't know you have a race condition because in many cases the contention on the lock is low so the race isn't exposed. And even when it is, it can be very tricky to reproduce as the race isn't likely to be done the same way twice.

turtletontine4mo ago

> …lurked for years and even decades. Heartbleed comes to mind.

I don’t know much about Heartbleed, but Wikipedia says:

> Heartbleed is a security bug… It was introduced into the software in 2012 and publicly disclosed in April 2014.

Two years doesn’t sound like “years or even decades” to me? But again, I don’t know much about Heartbleed so I may be missing something. It does say it was also patched in 2014, not just discovered then.

cogman104mo ago

This may just be me misremembering, but as I recall, the bug of Heartbleed was ultimately a very complex macro system which supported multiple very old architectures. The bug, IIRC, was the interaction between that old macro system and the new code which is what made it hard to recognize as a bug.

Part of the resolution to the problem was I believe they ended up removing a fair number of unsupported platforms. It also ended up spawning alternatives to openssl like boring ssl which tried to remove as much as possible to guard against this very bug.

1 more reply

staticassertion4mo ago

> IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good.

It doesn't seem to indicate that. It indicates the bug just isn't in tested code or isn't reached often. It could still be a very severe bug.

The issue with longer lived bugs is that someone could have been leveraging it for longer.

galangalalgol4mo ago

Worst case is that it doesn't even cause correctness issues in normal use, only when misused in a way that is unlikely to happen unintentionally.

staticassertion4mo ago

I guess because I work in security the "unintentionally" doesn't matter much to me.

1 more reply

NewsaHackO4mo ago

It may be just my system, but the times look like hyperlinks but aren't for some reason. It is especially disappointing that the commit hashes don't link to the actual commit in the kernel repo.

Telaneo4mo ago

They're <strong> tags with color:#79635c on hover in the CSS. A really weird style choice for sure, but semantically they aren't meant to be links at all.

NewsaHackO4mo ago

I know, I am saying they should be links, as it is what one would expect from an article like this.

jmyeet4mo ago

The lesson here is that people have an unrealistic view of how complex it is to write correct and safe multithreaded code on multi-core, multi-thread, assymmetric core, out-of-order processors. This is no shade to kernel developers. Rather, I direct this at people who seem to you can just create a thread pool in C++ and solve all your concurrency problems.

One criticism of Rust (and, no, I'm not saying "rewrite it in Rust", to be clear) is that the borrow checker can be hard to use whereas many C++ engineers (in particular, for some reason) seem to argue that it's easier to write in C++. I have two things to say about that:

1. It's not easier in C++. Nothing is. C++ simply allows you to make mistakes without telling you. GEtting things correct in C++ is just as difficult as any other language if not more so due to the language complexity; and

2. The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.

This is I favor cooperative multitasking and using battle-tested concurrency abstractions whenever possible. For example the cooperative async-await of Hack and the model of a single thread responding to a request then discarding everything in PHP/Hack is virtually ideal (IMHO) for serving Web traffic.

I remember reading about Google's work on various C++ tooling including valgrind and that they exposed concurrency bugs in their own code that had lain dormant for up to a decade. That's Google with thousands of engineers and some very talented engineers at that.

marcosdumay4mo ago

> The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.

There are entire classes of structures that no, aren't hard to do properly, but the borrow checker makes artificially hard due to design limitations that are known to be sub-optimal.

No, two-directional linked lists and partially editable data structures aren't inherently hard. It's a Rust limitation that a piece of code can't take enough ownership of them to edit they safely.

silver_sun4mo ago

Their section on "Dataset limitations" says that the study "Only captures bugs with Fixes: tags (~28% of fix commits)."

Just worth noting that it is a significant extrapolation from only "28%" of fix commits to assume that the average is 2 years.

tremon4mo ago

Why? A sample size of 28% is positively huge compared to what most statistical studies have to work with. The accuracy of an extrapolation is mostly determined by underlying sampling bias, not the amount of data. If you have any basis to suggest that capturing "only bugs with fixes tags" creates a skewed sample, that would be grounds to distrust the extrapolation, but simply claiming "it's only 28%" does not make it worth noting.

sedatk4mo ago

Firefox bugs stay in the open for that long.

steveklabnik4mo ago

One of my favorite Firefox bugs was some I don’t quite remember the details of, but went something like this:

“There’s a crash while using this config file.” Something more complex than that, but ultimately a crash of some kind.

Years later, like 20 years later, the bug was closed. You see, they re-wrote the config parser in Rust, and now this is fixed.”

That’s cool but it’s not the part I remember. The part I always think about is, imagine responding to the bug right after it was opened with “sorry, we need to go off and write our own programming language before this bug is fixed. Don’t worry, we’ll be back, it’s just gonna take some time.”

Nobody would believe you. But yet, it’s what happened.

nurettin4mo ago

To be fair, any rewrite could have fixed it, didn't have to wait for Rust.

Yossarrian224mo ago

No, Graydon Hoare took one look at the config code, went “fuck this” and decided to create a new language instead.

Xss34mo ago

But that take ruins all the intrigue of their comment... But youre spot on. They fabricated a story.

steveklabnik4mo ago

I didn’t say otherwise. Rust is not the point here.

mmooss4mo ago

All software has long-lived bugs. None are bug-free, at any point in their existance, so it's almost inevitable. Have you seen Windows' bug tracker?

The anti-Firefox mob really is striving to take shots at it.

The point of the article isn't a criticism of Linux, but an analysis that leads to more productive code review.

ValdikSS4mo ago

grsecurity project has fixed many security bugs but did not contribute back, as they're profiting from selling the patchset.

It's not uncommon for the bugs they found to be rediscovered 6-7 years later.

https://xcancel.com/spendergrsec

staticassertion4mo ago

This implies (or states, hard to say) that they don't upstream specifically in order to profit. That is nonsense.

1. Tons of bugs are reported upstream by grsecurity historically.

2. Tons of critical security mitigations in the kernel were outright invented by that team. ASLR, SMAP, SMEP, NX, etc.

3. They were completely FOSS until very recently.

4. They have always maintained that they are entirely willing to upstream patches but that it's a lot of work and would require funding. Upstream has always been extremely hostile towards attempts to take small pieces of Grsecurity and upstream them.

__bjoernd4mo ago

> as they're profiting from selling the patchset

Profiting from selling their patchset is not the whole story, though. grsec was public and free for a long time and there were many effects at play preventing the kernel from adopting it.

woliveirajr4mo ago

But the patchset should use the same license as the original code, shouldn't?

ValdikSS4mo ago

It is: https://grsecurity.net/faq

michaelcampbell4mo ago

Thank goodness for reader mode. The transparent background where the text is with the wiggly line background is... challenging.

dpc_012344mo ago

Might be obviously, but there is definitely a lot of biases in the data here. It's unavoidable. E.g. many bugs will not be detected, but they will be removed when the code is rewritten. So code that is refactored more often will have lower age of fixed bugs. Components/subsystems that are heavily used will detect bugs faster. Some subsystems by their very nature can tolerate bugs more, while some by necessity will need to be more correct (like bpf).

a3w4mo ago

The kernel this speaks of is probably linux. Does windows have a similar round time?

pixl974mo ago

I mean, yea.

Here is a device driver bug that was around 11 years.

https://www.bitdefender.com/en-us/blog/hotforsecurity/google...

snvzz4mo ago

Millions of lines of code, all running in supervisor mode.

One bug is all it takes to compromise the entire system.

The monolithic UNIX kernel was a good design in the 60s; Today, we should know better[0][1].

0. https://sel4.systems/

1. https://genode.org/

tlb4mo ago

My conclusion is that microkernels offer some protection from random reboots, but not much against hacking

Say the USB system runs in its own isolated process. Great, but if someone pwns the USB process they can change disk contents, intercept and inject keystrokes, etc. You can usually leverage that into a whole system compromise.

Same with most subsystems: GPU, network, file system process compromises are all easily leveraged to pwn the whole system.

bawolff4mo ago

Year of HURD on the desktop?

calvinmorrison4mo ago

Highly unrealistic rewrite disease

josefx4mo ago

Of course by now processor manufacturers decided that blowing holes into the CPUs security model to make it go faster was the way to go. So your micro kernel is stuck on a hardware security model that looks like swiss cheese and smells like Surströmming.

__bjoernd4mo ago

How are SEL4 and Genode going for you in your day-to-day compute usage?

chelmuth3mo ago

I'm quite happy using SculptOS (Genode/NOVA) for all my productive work - every day ;-)

__bjoernd3mo ago

But you're a main project contributor. What about everyone else?

1 more reply

windowssuperfi4mo ago

Yeah cause windows is amazing Or maybe macos? Ignore their freebsd parts of course.

DowsingSpoon4mo ago

Yes. As far as kernels go, NT was pretty damn good.

So is Mach, by the way, if you can afford the microkernel performance overhead.

johncolanduoni4mo ago

Mach is not a very good microkernel at all, because the overhead is much higher than necessary. The L4 family’s IPC design is substantially more efficient, and that’s why they’re used in actual systems. Fuchsia/Zircon have improved on the model further.

Someone will of course bring up XNU, but the microkernel aspect of it died when they smashed the FreeBSD kernel into the codebase. DriverKit has brought some userspace drivers back, but they use shared memory for all the heavy lifting.

1 more reply

heavyset_go4mo ago

XNU monolith-ized itself over time, even over some microkernel-esque boundaries.

dundarious4mo ago

If you include all the drivers too (which surely makes the comparison more accurate), is that still the case?

1 more reply

cosmic_cheese4mo ago

Apple at least has been making a concerted effort to kick more macOS/iOS functionality out into userland in the past several years.

pjmlp4mo ago

Just like Windows since Vista.

lifetimerubyist4mo ago

NT is actually a pretty good kernel. NTFS and the userland is what is shit.

IcePic4mo ago

I think NTFS get a bit of crap from the OS above it adding limitations. If you read up on what NTFS allows, it is far better than what Windows and the explorer allows you to do with it.

speed_spread4mo ago

NTFS is a beast of a filesystem and has been nothing but solid for 25+ years. The performance grievances ignore the warranties that NTFS offers vs many antiquated POSIX filesystems.

1 more reply

edoceo4mo ago

Userland peaked in Windows 2000

eulgro4mo ago

From the stats we see that most bugs effectively come from the limitations of the language.

Impressive results on the model, I'm surprised they improved it with very simple heuristics. Hopefully this tool will be made available to the kernel developers and integrated to the workflow.

redleader554mo ago

I don't think the problem is the kernel. Kernel bugs stay hidden because no one runs recent Kernels.

My Pixel 8 runs kernel a stable minor from 6.1, which was released more than 4 years ago. Yes, fixes get backported to it, but the new features in 6.2->6.19 stay unused on that hardware. All the major distros suffer from the same problem, most people are not running them in production

Most hyperscalers are running old kernel versions on which they do backports. If you go Linux conferences you hear folks from big companies mentioning 4.xx, 3.xx kernels, in 2025.

sureglymop4mo ago

Only tangentially related but maybe someone here can help me.

I have a server which has many peripherals and multiple GPUs. Now, I can use vfio and vfio-pcio to memory map and access their registers in user space. My question is, how could I start with kernel driver development? And I specifically mean the dev setup.

Would it be a good idea to use vfio with or without a vm to write and test drivers? How to best debug, reload and test changing some code of an existing driver?

zkmon4mo ago

A bug is a piece of code that doesn't agree with requirements or architecture. The misalignment can not be attributed to code alone.

GaryBluto4mo ago

What's with the odd scribbles in the background?

kmavmOP4mo ago

It's an easter egg on the website that usually goes unnoticed. It's our first time on the front page of HN, so it's a little overutilized right now. Capital-C clears it.

eab-4mo ago

I'd find this article a bit more compelling if it was used to find current introduced bugs, instead of just using a holdout set

blueboo4mo ago

“In a sufficiently complex system, malfunction or even total non-function may go undetected for long periods, if ever”

John Gall, The Systems Bible

calebm4mo ago

It's interesting to consider that the same phenomenon may also hold true for humanity's psychological software.

burnt-resistor4mo ago

Speaking of nasty kernel bugs although on another platform, there's a nasty one in either Microsoft's Win 11 nwifi.sys handling of deadlock conditions or Qualcomm's QCNCM865 FastConnect 7800 WCN785x driver that panics because of a watchdog failure in nwifi!MP6SendNBLInternal+0x4b guarded by a deadlocked ndis!NdisAcquireRWLockRead+0x8b. It "BSODs" the system rather than doing something sane like dropping a packet or retransmitting.

Am I the only unreasonable maniac who wants a very long-term stable, seL4-like capability-based, ubiquitous, formally-verified μkernel that rarely/never crashes completely* because drivers are just partially-elevated programs sprinkled with transaction guards and rollback code for critical multiple resource access coordination patterns? (I miss hacking on MINIX 2.)

* And never need to reboot or interrupt server/user desktop activities because the core μkernel basically never changes since it's tiny and proven correct.

ryukoposting4mo ago

This is fascinating stuff, especially the per-subsystem data. I've worked with CAN in several different professional and amateur settings, I'm not surprised to see it near the bottom of this list. That's not a dig against the kernel or the folks who work on it... more of a heavy sigh about the state of the industries that use CAN.

On a related note, I'm seeing a correlation between "level of hoopla" and a "level of attention/maintenance." While it's hard to distinguish that correlation from "level of use," the fact that CAN is so far down the list suggests to me that hoopla matters; it's everywhere but nobody talks about it. If a kernel bug takes down someone's datacenter, boy are we gonna hear about it. But if a kernel bug makes a DeviceNet widget freak out in a factory somewhere? Probably not going to make the front page of HN, let alone CNN.

pixl974mo ago

There is a general rule on bugs is that the more devices they are on, the more apt they are to trigger.

A CAN with 10,000 machines total and relatively fixed applications is either going to trigger the bug right off the bat and then work around it, or trigger the bug so rarely it won't be recognized as a kernel issue.

General purpose systems running millions and millions of units with different workloads are an evolutionary breeding ground for finding bugs and exploits.

esseph4mo ago

Imagine if no one outside a select circle ever got to examine the code.

immibis4mo ago

Everything is open source if you're skilled with Ghidra.

We call AI models "open source" if you can download the binary and not the source. Why not programs?

KK7NIL4mo ago

> We call AI models "open source" if you can download the binary and not the source.

Who's "we"? There's been quite a lot of pushback on this naming scheme from the OSS community, with many preferring the term "open weights".

serf4mo ago

>We call AI models "open source" if you can download the binary and not the source. Why not programs?

the weights of a model aren't equivalent to the binary output of source code, no matter how you try to stretch the metaphor.

>why not

because we aren't beholden to change all definitions and concepts because some guy at some corp said so.

immibis4mo ago

Unless that corp is OSI, right?

heavyset_go4mo ago

Binaries and AI models can be inscrutable. They're meant to be interpreted by machines.

We want human readable, comprehensible, reproducible and maintainable sources at minimum when we say open source.

dspillett4mo ago

North Korea is called a “democratic people's republic”. Just because one thing that really isn't <whatever> is called <whatever> by the people in coontrol of it, doesn't mean that it is or that incorrectly calling other things <whatever> is correct.

Adrian-ChatLocl4mo ago

Still probably a lot better than Windows.

j / k navigate · click thread line to collapse

165 comments

Fiveplus4mo ago

Before the "rewrite it in Rust" comments take over the thread:

aw16211074mo ago

> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions)

While the bugs you describe are indeed things that aren't directly addressed by Rust's borrow checker, I think the article covers more ground than your comment implies.

For example, a significant portion (most?) of the article is simply analyzing the gathered data, like grouping bugs by subsystem:

    Subsystem        Bug Count  Avg Lifetime
    drivers/can      446        4.2 years
    networking/sctp  279        4.0 years
    networking/ipv4  1,661      3.6 years
    usb              2,505      3.5 years
    tty              1,033      3.5 years
    netfilter        1,181      2.9 years
    networking       6,079      2.9 years
    memory           2,459      1.8 years
    gpu              5,212      1.4 years
    bpf              959        1.1 years

Or by type:

    Bug Type         Count  Avg Lifetime  Median
    race-condition   1,188  5.1 years     2.6 years
    integer-overflow 298    3.9 years     2.2 years
    use-after-free   2,963  3.2 years     1.4 years
    memory-leak      2,846  3.1 years     1.4 years
    buffer-overflow  399    3.1 years     1.5 years
    refcount         2,209  2.8 years     1.3 years
    null-deref       4,931  2.2 years     0.7 years
    deadlock         1,683  2.2 years     0.8 years

And the section describing common patterns for long-lived bugs (10+ years) lists the following:

> 1. Reference counting errors

> 2. Missing NULL checks after dereference

> 3. Integer overflow in size calculations

> 4. Race conditions in state machines

All of which cover more ground than listed in your comment.

Furthermore, the 19-year-old bug case study is a refcounting error not related to highly concurrent state machines or hardware assumptions.

johncolanduoni4mo ago

I’m bullish about Rust in the kernel, but it will not solve all of the kinds of race conditions you see in that kind of context.

aw16211074mo ago

The example given looks like a generalized example:

    spin_lock(&lock);
    if (state == READY) {
        spin_unlock(&lock);
        // window here where another thread can change state
        do_operation();  // assumes state is still READY
    }

So I don't think you can draw strong conclusions from it.

> I’m bullish about Rust in the kernel, but it will not solve all of the kinds of race conditions you see in that kind of context.

Sure, all I'm trying to say is that "the class of bugs described here" covers more than what was listed in the parentheses.

2 more replies

materielle4mo ago

I don’t think that the parent comment is saying all of the bugs would have been prevented by using Rust.

But in the listed categories, I’m equally skeptical that none of them would have benefited from Rust even a bit.

1 more reply

yencabulator4mo ago

> It’s also worth noting that Rust doesn’t prevent integer overflow

Add a single line to a single file and you get that enforced.

https://rust-lang.github.io/rust-clippy/stable/index.html#ar...

RealityVoid4mo ago

Why doesn't it surprise me that the CAN bus driver bugs have the longest average lifetime?

apaprocki4mo ago

> Furthermore, the 19-year-old bug case study is a refcounting error

johncolanduoni4mo ago

anon-39884mo ago

Rust is not just about memory safety. It also have algebraic data types, RAII, among other things, which will greatly help in catching this kind of silly logic bugs.

JuniperMesos4mo ago

the84724mo ago

kubb4mo ago

It’s hilarious that you feel the need to preemptively take control of the narrative in anticipation of the Rust people that you fear so much.

Is this an irrational fear, I wonder? Reminds me of methods used in the political discourse.

Bridged77564mo ago

irishcoffee4mo ago

> It’s hilarious that you feel the need to preemptively take control of the narrative in anticipation of the Rust people that you fear so much.

> Is this an irrational fear, I wonder? Reminds me of methods used in the political discourse.

In a sad sort of way, I think its hilarious that hn users have been so completely conditioned to expect rust evangelism any time a topic like this comes up that they wanted to get ahead of it.

Not sure who it says more about, but it sure does say a whole lot.

kubb4mo ago

I don’t think evangelism is necessary anymore. Rust adoption is now a matter of time.

1 more reply

john01dav4mo ago

1 more reply

pjc504mo ago

> race condition in unsafe logic that interacts with DMA

mgaunard4mo ago

I don't think 70% of bugs are memory safety issues.

In my experience it's closer to 5%.

cogman104mo ago

I believe this is where that fact comes from [1]

Basically, 70% of high severity bugs are memory safety.

[1] https://www.chromium.org/Home/chromium-security/memory-safet...

saagarjha4mo ago

High severity security issues.

mgaunard4mo ago

Right, which is a measure which is heavily biased towards memory safety bugs.

IshKebab4mo ago

70% of security vulnerabilities are due to memory safety. Not all bugs.

stonemetal124mo ago

Using the data provided, memory safety issues (use-after-free, memory-leak, buffer-overflow, null-deref) account for 67% of their bugs. If we include refcount It is just over 80%.

tester7564mo ago

That's the figure that Microsoft and Google found in their code bases.

redeeman4mo ago

probably quite a bit less than 5%, however, they tend to be quite serious when they happen

mgaunard4mo ago

Only serious if you care about protecting from malicious actors running code on the same host.

1 more reply

BobbyTables24mo ago

I’ve seen too many embedded drivers written by well known companies not use spinlocks for data shared with an ISR.

At one point, I found serious bugs (crashing our product) that had existed for over 15 years. (And that was 10 years ago).

Rust may not be perfect but it gives me hope that some classes of stupidity will be either be avoided or made visible (like every function being unsafe because the author was a complete idiot).

eru4mo ago

> It is worth noting that the class of bugs described here (logic errors in highly concurrent state machines, incorrect hardware assumptions) wouldn't necessarily be caught by the borrow checker.

You are right about that, but even just using sum types eliminates a lot of logic errors, too.

keybored4mo ago

No other top-level comments have since mentioned Rust[1] and TFA mentions neither Rust nor topics like memory safety. It’s just plain bugs.

The Rust phantom zealotry is unfortunately real.

[1] Aha, but the chilling effect of dismissing RIR comments before they are even posted...

staticassertion4mo ago

paulddraper4mo ago

Rust would prevent a number of bugs, as it can model state machine guarantees as well.

Rewriting it all in Rust is extremely expensive, so it won't be done (soon).

wiz21c4mo ago

IshKebab4mo ago

As far as I know that's why Microsoft rewrote Typescript in Go instead of Rust.

1 more reply

lynx974mo ago

IshKebab4mo ago

Rust has other features that help prevent logic errors. It's not just C plus a borrow checker.

ramon1564mo ago

You're fighting air

marcosdumay4mo ago

Eh... Removing concurrence bugs is one of the main selling points for Rust. And algebraic types are a really boost for situations where you have lots of assumptions.

gjfr4mo ago

Our bug dataset was way smaller, though, as we had to pinpoint all bug introductions unfortunately. It's nice to see the Linux project uses proper "Fixes: " tags.

staticassertion4mo ago

> It's nice to see the Linux project uses proper "Fixes: " tags.

Sort of. They often don't.

giamma4mo ago

cogman104mo ago

> IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority

Not really true. A lot of very severe bugs have lurked for years and even decades. Heartbleed comes to mind.

The reason these bugs often lurk for so long is because they very often don't cause a panic, which is why they can be really tricky to find.

turtletontine4mo ago

> …lurked for years and even decades. Heartbleed comes to mind.

I don’t know much about Heartbleed, but Wikipedia says:

> Heartbleed is a security bug… It was introduced into the software in 2012 and publicly disclosed in April 2014.

cogman104mo ago

1 more reply

staticassertion4mo ago

> IMHO a fact that a bug hides for years can also be indication that such bug had low severity/low priority and therefore that the overall quality is very good.

It doesn't seem to indicate that. It indicates the bug just isn't in tested code or isn't reached often. It could still be a very severe bug.

The issue with longer lived bugs is that someone could have been leveraging it for longer.

galangalalgol4mo ago

Worst case is that it doesn't even cause correctness issues in normal use, only when misused in a way that is unlikely to happen unintentionally.

staticassertion4mo ago

I guess because I work in security the "unintentionally" doesn't matter much to me.

1 more reply

NewsaHackO4mo ago

It may be just my system, but the times look like hyperlinks but aren't for some reason. It is especially disappointing that the commit hashes don't link to the actual commit in the kernel repo.

Telaneo4mo ago

They're <strong> tags with color:#79635c on hover in the CSS. A really weird style choice for sure, but semantically they aren't meant to be links at all.

NewsaHackO4mo ago

I know, I am saying they should be links, as it is what one would expect from an article like this.

jmyeet4mo ago

2. The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.

marcosdumay4mo ago

> The Rust borrow checker isn't hard or difficult to use. What you're doing is hard and difficult to do correctly.

There are entire classes of structures that no, aren't hard to do properly, but the borrow checker makes artificially hard due to design limitations that are known to be sub-optimal.

No, two-directional linked lists and partially editable data structures aren't inherently hard. It's a Rust limitation that a piece of code can't take enough ownership of them to edit they safely.

silver_sun4mo ago

Their section on "Dataset limitations" says that the study "Only captures bugs with Fixes: tags (~28% of fix commits)."

Just worth noting that it is a significant extrapolation from only "28%" of fix commits to assume that the average is 2 years.

tremon4mo ago

sedatk4mo ago

Firefox bugs stay in the open for that long.

steveklabnik4mo ago

One of my favorite Firefox bugs was some I don’t quite remember the details of, but went something like this:

“There’s a crash while using this config file.” Something more complex than that, but ultimately a crash of some kind.

Years later, like 20 years later, the bug was closed. You see, they re-wrote the config parser in Rust, and now this is fixed.”

Nobody would believe you. But yet, it’s what happened.

nurettin4mo ago

To be fair, any rewrite could have fixed it, didn't have to wait for Rust.

Yossarrian224mo ago

No, Graydon Hoare took one look at the config code, went “fuck this” and decided to create a new language instead.

Xss34mo ago

But that take ruins all the intrigue of their comment... But youre spot on. They fabricated a story.

steveklabnik4mo ago

I didn’t say otherwise. Rust is not the point here.

mmooss4mo ago

All software has long-lived bugs. None are bug-free, at any point in their existance, so it's almost inevitable. Have you seen Windows' bug tracker?

The anti-Firefox mob really is striving to take shots at it.

The point of the article isn't a criticism of Linux, but an analysis that leads to more productive code review.

ValdikSS4mo ago

grsecurity project has fixed many security bugs but did not contribute back, as they're profiting from selling the patchset.

It's not uncommon for the bugs they found to be rediscovered 6-7 years later.

https://xcancel.com/spendergrsec

staticassertion4mo ago

This implies (or states, hard to say) that they don't upstream specifically in order to profit. That is nonsense.

1. Tons of bugs are reported upstream by grsecurity historically.

2. Tons of critical security mitigations in the kernel were outright invented by that team. ASLR, SMAP, SMEP, NX, etc.

3. They were completely FOSS until very recently.

__bjoernd4mo ago

> as they're profiting from selling the patchset

Profiting from selling their patchset is not the whole story, though. grsec was public and free for a long time and there were many effects at play preventing the kernel from adopting it.

woliveirajr4mo ago

But the patchset should use the same license as the original code, shouldn't?

ValdikSS4mo ago

It is: https://grsecurity.net/faq

michaelcampbell4mo ago

Thank goodness for reader mode. The transparent background where the text is with the wiggly line background is... challenging.

dpc_012344mo ago

a3w4mo ago

The kernel this speaks of is probably linux. Does windows have a similar round time?

pixl974mo ago

I mean, yea.

Here is a device driver bug that was around 11 years.

https://www.bitdefender.com/en-us/blog/hotforsecurity/google...

snvzz4mo ago

Millions of lines of code, all running in supervisor mode.

One bug is all it takes to compromise the entire system.

The monolithic UNIX kernel was a good design in the 60s; Today, we should know better[0][1].

0. https://sel4.systems/

1. https://genode.org/

tlb4mo ago

My conclusion is that microkernels offer some protection from random reboots, but not much against hacking

Same with most subsystems: GPU, network, file system process compromises are all easily leveraged to pwn the whole system.

bawolff4mo ago

Year of HURD on the desktop?

calvinmorrison4mo ago

Highly unrealistic rewrite disease

josefx4mo ago

__bjoernd4mo ago

How are SEL4 and Genode going for you in your day-to-day compute usage?

chelmuth3mo ago

I'm quite happy using SculptOS (Genode/NOVA) for all my productive work - every day ;-)

__bjoernd3mo ago

But you're a main project contributor. What about everyone else?

1 more reply

windowssuperfi4mo ago

Yeah cause windows is amazing Or maybe macos? Ignore their freebsd parts of course.

DowsingSpoon4mo ago

Yes. As far as kernels go, NT was pretty damn good.

So is Mach, by the way, if you can afford the microkernel performance overhead.

johncolanduoni4mo ago

1 more reply

heavyset_go4mo ago

XNU monolith-ized itself over time, even over some microkernel-esque boundaries.

dundarious4mo ago

If you include all the drivers too (which surely makes the comparison more accurate), is that still the case?

1 more reply

cosmic_cheese4mo ago

Apple at least has been making a concerted effort to kick more macOS/iOS functionality out into userland in the past several years.

pjmlp4mo ago

Just like Windows since Vista.

lifetimerubyist4mo ago

NT is actually a pretty good kernel. NTFS and the userland is what is shit.

IcePic4mo ago

I think NTFS get a bit of crap from the OS above it adding limitations. If you read up on what NTFS allows, it is far better than what Windows and the explorer allows you to do with it.

speed_spread4mo ago

NTFS is a beast of a filesystem and has been nothing but solid for 25+ years. The performance grievances ignore the warranties that NTFS offers vs many antiquated POSIX filesystems.

1 more reply

edoceo4mo ago

Userland peaked in Windows 2000

eulgro4mo ago

From the stats we see that most bugs effectively come from the limitations of the language.

Impressive results on the model, I'm surprised they improved it with very simple heuristics. Hopefully this tool will be made available to the kernel developers and integrated to the workflow.

redleader554mo ago

I don't think the problem is the kernel. Kernel bugs stay hidden because no one runs recent Kernels.

Most hyperscalers are running old kernel versions on which they do backports. If you go Linux conferences you hear folks from big companies mentioning 4.xx, 3.xx kernels, in 2025.

sureglymop4mo ago

Only tangentially related but maybe someone here can help me.

Would it be a good idea to use vfio with or without a vm to write and test drivers? How to best debug, reload and test changing some code of an existing driver?

zkmon4mo ago

A bug is a piece of code that doesn't agree with requirements or architecture. The misalignment can not be attributed to code alone.

GaryBluto4mo ago

What's with the odd scribbles in the background?

kmavmOP4mo ago

It's an easter egg on the website that usually goes unnoticed. It's our first time on the front page of HN, so it's a little overutilized right now. Capital-C clears it.

eab-4mo ago

I'd find this article a bit more compelling if it was used to find current introduced bugs, instead of just using a holdout set

blueboo4mo ago

“In a sufficiently complex system, malfunction or even total non-function may go undetected for long periods, if ever”

John Gall, The Systems Bible

calebm4mo ago

It's interesting to consider that the same phenomenon may also hold true for humanity's psychological software.

burnt-resistor4mo ago

* And never need to reboot or interrupt server/user desktop activities because the core μkernel basically never changes since it's tiny and proven correct.

ryukoposting4mo ago

pixl974mo ago

There is a general rule on bugs is that the more devices they are on, the more apt they are to trigger.

General purpose systems running millions and millions of units with different workloads are an evolutionary breeding ground for finding bugs and exploits.

esseph4mo ago

Imagine if no one outside a select circle ever got to examine the code.

immibis4mo ago

Everything is open source if you're skilled with Ghidra.

We call AI models "open source" if you can download the binary and not the source. Why not programs?

KK7NIL4mo ago

> We call AI models "open source" if you can download the binary and not the source.

Who's "we"? There's been quite a lot of pushback on this naming scheme from the OSS community, with many preferring the term "open weights".

serf4mo ago

>We call AI models "open source" if you can download the binary and not the source. Why not programs?

the weights of a model aren't equivalent to the binary output of source code, no matter how you try to stretch the metaphor.

>why not

because we aren't beholden to change all definitions and concepts because some guy at some corp said so.

immibis4mo ago

Unless that corp is OSI, right?

heavyset_go4mo ago

Binaries and AI models can be inscrutable. They're meant to be interpreted by machines.

We want human readable, comprehensible, reproducible and maintainable sources at minimum when we say open source.

dspillett4mo ago

Adrian-ChatLocl4mo ago

Still probably a lot better than Windows.

j / k navigate · click thread line to collapse