10% of Firefox crashes are caused by bitflips (opens in new tab)

(mas.to)

921 pointsmarvinborner22d ago486 comments

486 comments

I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.

Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

- overclocked CPU

- bad memory wait-state configuration

- underpowered power supply

- overheating due to under-specced cooling fans or dusty intakes

These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.

Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

PunchyHamster20d ago

> Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

Case in point: I was getting memory errors on my gaming machine, that persisted even after replacing the sticks. It caused windows bluesreen maybe once a month so I kinda lived with it as I couldn't afford to replace whole setup (I theoretized something on motherboard is wrong)

Then my power supply finally died (it was cheap-ish, not cheap-est but it had few years already). I replaced it, lo and behold, memory errors were gone

1 more reply

dvngnt_20d ago

GW1 was my childhood. The MMO with no monthly fees appealed to my Mom and I met friends for years. The 8 skill build system was genius, as was the cut scenes featuring your player character. If there's ever a 3rd game I would love to see something allowing for more expression through build creation though I could see how that's hard to balance.

3 more replies

dpe8220d ago

As a mobile dev at YouTube I'd periodically scroll through crash reports associated with code I owned and the long tail/non-clustered stuff usually just made absolutely no sense and I always assumed at least some of it was random bit flips, dodgy hardware, etc.

2 more replies

Analemma_20d ago

There's a famous Raymond Chen post about how a non-trivial percentage of the blue screen of death reports they were getting appeared to be caused by overclocking, sometimes from users who didn't realize they had been ripped off by the person who sold them the computer: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35.... Must've been really frustrating.

2 more replies

Helmut1000120d ago

I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.

8 more replies

jodrellblank20d ago

This is getting off-topic but I’m amazed by this ability to reach out to computers around the world as a sensor array and infer things we can’t easily find out in other ways. It’s in popular culture and HN comments most often as spyware and mass surveillance of people, and that’s a bit of a shame.

GPS location and movement data is what gives Google maps its near-real-time view of traffic on all roads, and busy-ness of all shops.

I think they collect location data from people riding public transport so they can tell you how long people wait on average at bus stops before getting on a bus.

Does Google collect atmospheric pressure readings from phone altimeters and use it for weather models? Could they?

Kindle collects details on books people read, how far they read, where they stop, which sections they highlight and quote, which words they look up in dictionaries.

I wonder if anyone’s curated a list of things like this which do happen or have been tried, excluding the “gathers user data for advertising” category which would become the biggest one, drowning out everything else.

I think current phones use accelerometer data to detect possible car crashes and call emergency services. Google could use that in aggregate to identify accident blackspots but I don’t know if they do. But that would be less useful because the police already know everywhere a big accident happens because people call the police. So that’s data easily found a different way.

2 more replies

mobilio20d ago

Yup!

I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

1 more reply

Modified301920d ago

Thanks to asrock motherboards for AMD’s threadripper 1950x working with ECC memory, that’s what I learned to overclock on.

I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.

From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)

What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.

I hate to think of how much effort will be spent debugging software in vain because of memory errors.

4 more replies

jug20d ago

As a community alpha tester of GW1, this was a fun read! Such an educational journey and what a well organized and fruitful one too. We could see the game taking shape before our eyes! As a European, I 100% relied on being young and single with those American time zones. :D Tests could end in my group at like 3 am, lol.

1 more reply

aiiane20d ago

I remember one of the first impressions I had in GW1 during test events was the sense of scale in the world that still managed to avoid excessive harsh geometry angles for the most part. Not surprised to hear it was pushing more polygons than average.

P.S. GW1 remains one of my favorite games and the source of many good memories from both PvP and PvE. From fun stories of holding the Hall of Heroes to some unforgettable GvG matches, y'all made a great game.

pndy21d ago

I didn't expect to read bits of GW story here from one of the founders - thanks!

arprocter20d ago

>Sometimes I'm amazed that computers even work at all!

Funny you say this, because for a good while I was running OC'd RAM

I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)

cookiengineer20d ago

I kind of wanted to confirm that. At that time I was still using a Compaq business laptop on which I played Guild Wars.

The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.

The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).

The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.

After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.

monster_truck20d ago

Every interesting bug report I've read about Guild Wars is Dwarf Fortress tier. A very hardcore, longtime player who was recounting some of the better ones to me shared a most excellent one wrt spirits or ghosts, some sort of player summoned thing that were sticking around endlessly and causing OOM errors?

samiv20d ago

Plot twist. The memory bit flip checking code was actually buggy and contained UB.

No, seriously did you actually verify the code for correctness before relying on it's results?

Dylan1680720d ago

> And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

For that one I'd guess no, because under normal circumstances hot locations like that will stay in cache.

taneq20d ago

Wow, that’s really interesting! I always suspected bit flips happened undetected way more than we thought, so it’s great to get some real life war stories about it. Also thanks for Guild Wars, many happy hours spent in GW2. :)

nxobject20d ago

Oh god yes… Dell OptiPlexes and bad caps went together in those days. I’m half convinced Valve put the gray towers in Counter-Strike so IT employees wasting time could shoot them up for therapy.

Agentlien20d ago

That's a really cool anecdote. The overclock makes sense. When we released Need For Speed (2015) I spent some time in our "war room", monitoring incoming crash reports and doing emergency patches for the worst issues.

The vast majority of crashes came from two buckets:

1. PCs running below our minimum specs

2. Bugs in MSI Afterburner.

1 more reply

andrepd20d ago

Amazing story! Reminds me of old gamasutra posts like these https://web.archive.org/web/20170522151205/http://www.gamasu...

PaulHoule20d ago

Back in the 90's I had an overclocked AMD486 machine which seemed OK most of the time but had segfaults compiling the Linux kernel. I sent in a bug report and Alan Cox closed it saying it was the fault of my machine being overclocked.

I dialed the machine back to the rated speed but it failed completely within 6 months.

jiggawatts20d ago

Some multiplayer real-time strategy (RTS) games used deterministic fixed-point maths and incremental updates to keep the players in sync. Despite this, there would be the occasional random de-sync kicking someone out of a game, more than likely because of bit flips.

1 more reply

benatkin20d ago

Yikes. Dude, you're getting a Packard Bell.

sidewndr4620d ago

Well wow I wasn't expecting to see yet another story from Patrick Wyatt here in the comments! Much appreciated, I've enjoyed reading everything you've written over the years.

fennecbutt20d ago

That's awesome. But also guild waaars, GW2 I played from beta for years, but it just got boring. Endless expansions with weird story.

We need GW3 already but my fear is mmo as a genre is dying.

1 more reply

SunnyNeon20d ago

How did you determine which of the causes it was?

danielEM20d ago

> problems because Dell sourced the absolute cheapest stuff for their computers;

Price itself has nothing to cause problems, it is either bad design or false or incomplete data on datasheets or all of it. Please STOP spreading this narrative, the right thing is to make ads, datasheets, marketing materials etc, etc to tell you the truth that is necessary for you to make proper decision as client/consumer.

hsbauauvhabzb20d ago

Did you/he ever consider redundant allocation for high value content and hash checks for low value assets that are still important?

I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?

3 more replies

just_testing20d ago

I loved reading your comment and got curious: how he detected the bitflips?

1 more reply

Salgat20d ago

Mike is such a legend.

yownie19d ago

this exactly the type of stories I come to HN to read, thanks!

rurban20d ago

I hate HW soo much. To revise the biggest problems in computing, beside out of tokens: HW bugs

Animats20d ago

ECC should have become standard around the time memories passed 1GB.

It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.

loeg20d ago

It's not even ECC price/availability that bothers me so much, it's that getting CPUs and motherboards that support ECC is non-trivial outside of the server space. The whole consumer class ecosystem is kind of shitty. At least AMD allows consumer class CPUs to kinda sorta use ECC, unlike Intel's approach where only the prosumer/workstation stuff gets ECC.

2 more replies

justin6620d ago

> ECC should have become standard around the time memories passed 1GB.

Ironically, that's around the time Intel started making it difficult to get ECC on desktop machines using their CPUs. The Pentium 3 and 440BX chipset, maxing out at 1GB, were probably the last combo where it pretty commonly worked with a normal desktop board and normal desktop processor.

WatchDog20d ago

All DDR5 ram has some amount of error correction built in, because DDR5 is much more prone to bit flipping, it requires it.

I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.

4 more replies

tombert20d ago

I am not sure I've ever seen a laptop that has ECC memory. I'm sure they exist but I don't think I've seen it.

I would definitely like to have a laptop with ECC, because obviously I don't want things to crash and I don't want corrupted data or anything like that, but I don't really use desktop computers anymore.

1 more reply

oybng20d ago

For the unaware, Intel is to blame for this

1 more reply

aforwardslash20d ago

ECC are traditionally slower, quite more complex, and they dont completely eliminate the problem (most memories correct 1 bit per word and detect 2 bits per word). They make sense when environmental factors such as flaky power, temperature or RF interference can be easily discarded - such as a server room. But yeah, I agree with you, as ECC solves like 99% of the cases.

3 more replies

ece20d ago

Looking back, I actually think the older the RAM the more likely you're able to notice bit-flips and they harm your workflow. EDO RAM was the worst in my experience (my first computer), SDRAM was a bit better, and random bit-flips atleast under load got very rare after DDR2. I think Google even had a paper comparing DDR1 vs DDR2 (link: https://static.googleusercontent.com/media/research.google.c...).

That said, memory DIMM capacity increases with even a small chance of bit-flips means lots of people will still be affected.

hedora20d ago

ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.

However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.

On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.

It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.

adonovan21d ago

Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.

Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.

However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.

In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

aforwardslash20d ago

> In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).

1 more reply

nitwit00520d ago

You might consider adding the CPU temperature to the report, if there's a reasonable way to get it (haven't tried inside a VM). Then you could at least filter out extremely hot hardware.

1 more reply

jamesfinlayson20d ago

Interesting reading - I've occasionally seen some odd crashes in an iOS app that I'm partly responsible for. It's running some ancient version of New Relic that doesn't give stack traces but it does give line numbers and it's always on something that should never fail (decoding JSON that successfully decoded thousands of times per day).

I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.

sieep20d ago

Ive been trying to push my boss towards more analytics/telemetry in production that focus on crashes, thanks for sharing.

tczMUFlmoNk20d ago

> Even with only about 1 in 1000 users enabling telemetry

How do you know the number/proportion of users who run without telemetry enabled, since by definition you're not collecting their data?

(Not imputing any malice, genuinely curious.)

1 more reply

charcircuit20d ago

>All of these point to memory corruption.

Actually "dereferencing a pointer that had just passed a nil check" could be from a flow control fault where the branch fails to be taken correctly.

kleiba20d ago

Firefox is about the only piece of software in my setup that occasionally crashes. I say "occasionally" for lack of a better word, it's not "all the time", but it is definitely more than I would want to.

If that was caused by bad memory, I would expect other software to be similarly affected and hence crash with about comparable frequency. However, it looks like I'm falling more into the other 90% of cases (unsurprisingly) because I do not observe other software crashing as much as firefox does.

Also, this whole crashing business is a fairly recent effect - I've been running firefox for forever and I cannot remember when it last was as much of an issue as it has become recently for me.

tuetuopay20d ago

Just check your memory with memtest.

Two years ago, I've had Factorio crash once on a null pointer exception. I reported the crash to the devs and, likely because the crash place had a null check, they told me my memory was bad. Same as you I said "wait no, no other software ever crashed weirdly on this machine!", but they were adamant.

Lo and behold, I indeed had one of my four ram sticks with a few bad addresses. Not much, something like 10-15 addresses tops. You need bad luck to hit one of those addresses when the total memory is 64GB. It's likely the null pointer check got flipped.

Browsers are good candidates to find bad memory: they eat a lot of ram, they scatter data around, they have a large chunk, and have JITs where a lot of machine code gets loaded left and right.

2 more replies

vultour20d ago

I spend probably thousands of hours in Firefox every year and I don't think I've ever had it crash.

1 more reply

mathw20d ago

Of course, nobody is claiming that there aren't lots of Firefox crashes which are caused by bugs in Firefox. Quite the opposite, based on these figures. What people find interesting is that the amount they're suspecting are down to hardware faults is way higher than most people would have expected.

Agingcoder20d ago

It depends on what you bitflip.

I once had a bitflip pattern causing lowercase ascii to turn into uppercase ascii in a case insensitive system. Everything was fine until it tried to uppercase numbers and things went wrong

The first time I had to deal with faulty ram ( more than 20y ago ), the bug would never trigger unless I used pretty much the whole dimm stick and put meaningful stuff in it etc in my case linking large executables , or untargzipping large source archives.

Flipping a pixel had no impact though

pflanze20d ago

Do you happen to use memory resource limits? I used to run Firefox under some, like everything, to prevent it from potentially making the whole system unresponsible, and at the same time had frequent cases of Firefox showing random visual corruptions and crashes. At some point I realized that it was because it was running out of memory, and didn't check malloc failures, thus just continued to run and corrupting memory. (That was some 6-8 years ago, maybe Firefox does better now?)

1 more reply

bmicraft20d ago

I've had some very bad ram (lots of errors found when tested) and consistently the only thing that actually crashed because of it was Firefox.

zvqcMMV6Zcr20d ago

For me the only software crashing(CTD ) was Factorio. Nothing else had any issues. I tried removing mods, searching for one that started causing issues. Memtestx86 said everything is OK. Replacing one stick of RAM instantly fixed all issues.

haspok20d ago

The most frequent crashes I have with Firefox are when I type in a text area (such as this one right now, or on Reddit, for example). The longer the text I type is, the more probable it is that it's going to crash. Or maybe it doesn't crash, just grinds to such a slow pace that it is equivalent to a crash.

My suspicion has always been some kind of a memory leak, but memory corruption also makes sense.

Unfortunately, Chrome (which I use for work - Firefox is for private stuff) has NEVER crashed on me yet. Certainly not in the past 5 years. Which is odd. I'm on Linux btw.

3 more replies

jlarocco20d ago

Firefox has a long history of denying problems, blaming the user, and fixing the issue years later.

It used to be memory usage, now it's crashing.

1 more reply

lqet20d ago

> Firefox is about the only piece of software in my setup that occasionally crashes.

I would add Thunderbird to that list.

xxs20d ago

run y-cruncher if you'd like to test memory and overall stability. It's decent test and a lot better than memtest (in my experience)

LunaSea20d ago

If only the had written Firefox in Rust, they wouldn't have had these issues .

bob102920d ago

I've written genetic programming experiments that do not require an explicit mutation operator because the machine would tend to flip bits in the candidate genomes under the heavy system load. It took me a solid week to determine that I didn't actually have a bug in my code. It happens so fast on my machine (when it's properly loaded) that I can depend on it to some extent.

rcbdev20d ago

Hyrum's law in action.

https://xkcd.com/1172/

thegrim3321d ago

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

rincebrain21d ago

The simplest way to do this, what I believe memtest86 and friends do, is to write a fixed pattern over a region of memory and then read it back later and see if it changed; then you write patterns that require flipping the bits that you wrote before, and so on.

Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].

There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.

edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.

[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...

[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568

[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...

[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...

2 more replies

tredre321d ago

> last year we deployed an actual memory tester that runs on user machines after the browser crashes.

He doesn't explain anything indeed but presumably that code is available somewhere.

2 more replies

hexyl_C_gut21d ago

It sounds like they don't know that the crashes are from bitflips but those crashes are from people with flaky memory which probably caused the crash?

wmf20d ago

A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.

1 more reply

hrmtst9383720d ago

I think claiming '100% positive' without explaining how you detect bitflips is a red flag, because credible evidence looks like ECC error counters and machine check events parsed by mcelog or rasdaemon, reproducible memtest86 failures, or software page checksums that mismatch at crash time.

Ask them to publish raw MCE and ECC dumps with timestamps correlated to crashes, or reproduce the failure with controlled fault injection or persistent checksums, because without that this reads like a hypothesis dressed up as a verdict.

1 more reply

KenoFischer20d ago

I'll submit my bit flip story for consideration also :) https://julialang.org/blog/2020/09/rr-memory-magic/

shevy-java20d ago

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

WhatsTheBigIdea20d ago

Your gut may be leading you astray?

I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.

If 50% of chrome crashes were due to bit flips, and bit flips effect the two browsers at basically the same rate, that would indicate that chrome experiences 1/5th the total crashes of firefox... even though the bit flip crashes happen at the same rate on both browsers.

It would have been better news for firefox if the number of crashes due to faulty hardware were actually much higher! These numbers indicate the vast majority of firefox crashes are actually from buggy software : (

1 more reply

LM35820d ago

10% of crashes does not imply 10% of your crashes.

1 more reply

BeetleB20d ago

Are people getting so many FF crashes? Mine rarely does. I leave it running, opening and closing tabs, for weeks on end.

12 more replies

bsder20d ago

> Bold claim. From my gut feeling this must be incorrect

RAM flips are common. This kind of thing is old and has likely gotten worse.

IBM had data on this. DEC had data on this. Amazon/Google/Microsoft almost certainly had data on this. Anybody who runs a fleet of computers gets data on this, and it is always eye opening how common it is.

ZFS is really good at spotting RAM flips.

bichiliad20d ago

I think they claim that if your computer has bad hardware, you're probably sending a lot of _additional_ crashes to their telemetry system. Your hardware might be working just fine, but the guy next to you might be sending 30% more crashes.

saati20d ago

I haven't seen a single firefox or chrome crash in months now, you should really stress-test your hardware.

3 more replies

nimih20d ago

> Bold claim.

I agree. Good thing he doesn't back up his claim with any sort of evidence or reasoned argument, or you'd look like a huge moron!

1 more reply

shakna20d ago

Chromium has better handling for bitflip errors. Mostly due to the Discardable buffers they make such extensive use of.

The hardware bugs are there. They're just handled.

1 more reply

hedora20d ago

I've had zero crashes in safari, ff or chrome in recent memory (except maybe OOMs). (Though I don't use Windows, so maybe that's part of the reason stuff just works?)

Perhaps you're part of the group driving hardware crashes up to 10% and need to fix your machine.

1 more reply

sfink20d ago

There's a very good chance your system also does not have flaky memory. Most don't. You're not contradicting the post.

Zambyte20d ago

What do you mean "the same amount"? If your browser never crashes, 10% of zero is zero.

pizza23420d ago

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

> Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

That's a misinterpretation. The finding refers to the composition of crashes, not the overall crash rate (which is not reported by the post). Brought to the extreme, there may have been 10 (reported) crashes in history of Firefox, and 1 due to faulty hardware, and the statement would still be correct.

estimator729220d ago

He addresses this in the thread.

phyzome20d ago

...normally browsers don't crash at all. Something's wrong with your computer.

maxerickson20d ago

I mean, I've had quite some number of crashes that I can't correlate to anything.

Hardware problems are just as good a potential explanation for those as anything else.

cellular20d ago

Maybe if Firefox tabs weren't such a memory hog it would be only 0.005% !

KennyBlanken20d ago

"Software engineer thinks everyone's hardware is broken, couldn't possibly be bugs in his code" sums it up about right.

camkego21d ago

It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."

keyringlight20d ago

One of the points Linus Torvalds made a few years back was that enthusiasts/PC gamers should be pissed that consumer product availability/support for ECC is spotty because as mentioned up-thread they're the kind of user that will push their system, and if memory is the cause of instability there will be a smoking gun (and they can then set the speed within its stable capacity). Diagnosing bad RAM is a pain in the rear even if you're actively looking for a cause, never mind trying to get a general user to go further than blaming software or gremlins in the system for weirdness on whatever frequency it's occurring at.

adonovan21d ago

It's true that in the very early days Google used cheap computers without ECC memory, and this explains the desire for checksums in older storage formats such as RecordIO and SSTable, but our production machines have used ECC RAM for a long time now.

srean21d ago

One of the nicest guys I have met. Was an intern at Google at that time, firing off mapreduces then (2003-2004) was quite a blast. The Peter Weinberger theme T-shirt too.

kdklol21d ago

I'm glad to see somebody is getting some data on this, I feel bad memory is one of the most underrated issues in computing generally. I'd like to see a more detailed writeup on this, like a short whitepaper.

bergheim20d ago

Strange. I have a tab hoarding problem, I often have over 1000 tabs open [1][2], and I cannot remember the last time Firefox crashed. I'm thinking it must have been years? I use ublock origin though, which might help since ads do their best to steal your computer and soul through any means possible of course.

I also use a bunch of other extensions though, dark reader, vimium, sideberry... I'd expect me to be a bit more exposed than the average user. Yet it's just rock stable for me. Maybe it just works better on linux?

1: I know this because I installed https://addons.mozilla.org/en-US/firefox/addon/tab-counter-p... to check :)

2: However after finding Karakeep I don't actually have 1000 tabs anymore!

andoando20d ago

I dont get the people with 10+ tabs open drives me crazy, how do you even know whats what?

Just bookmark shit you want to keep!

petterroea20d ago

As someone who has a strong background from hobby projects with five-digit users before going into work, I think one of the most interesting differences I experienced was that the problems you see at scale simply don't exist on small scale projects. Bit flips/bad memory is one of them.

bhelkey20d ago

I would love to see DDR4 vs DDR5 bitflips. As I understand it DDR5 must come with some level of ECC [1].

[1] https://www.corsair.com/us/en/explorer/diy-builder/memory/is...

drpixie20d ago

From Corsair

>> DDR5 technology comes with an exclusive data-checking feature that serves to improve memory cell reliability and increase memory yield for memory manufacturers. This inclusion doesn't make it full ECC memory though.

"Proper" ECC has a wider memory buss, so the CPU emits checksum bits that are saved alongside every word of memory, and checked again by the CPU when memory is read. Eg. a 64 bit machine would actually have 72 bit memory.

DDR5 "ECC" uses error correction only within the memory stick. It's there to reduce the error rate, so otherwise unacceptable memory is usable - individual cells have become so small that they are not longer acceptably reliable by themselves!

kevin_thibedeau20d ago

DDR5 comes with marginal DRAM that is patched up with ECC to boost yields. It's not the same as fully reliable RAM.

2 more replies

silon4220d ago

I wish also for desktop vs laptop ram comparison.

moconnor20d ago

Bit flips aren’t always bad hardware. I remember an anecdote from Sandia from my HPC days - they found they were getting more bit flips on some machines than others on their cluster and sometimes correlated.

Turned out at their altitude cosmic rays were flipping bits in the top-most machines in the racks, sometimes then penetrating lower and flipping bits in more machines too.

newscracker20d ago

This is quite surprising to me, since I thought the percentage would be a lot lesser.

But I don’t really know what the Firefox team does with crash reports and in making Firefox almost crash proof.

I have been using it at work on Windows and for the last several years it always crashes on exit. I have religiously submitted every crash report. I even visit the “about:crashes” page to see if there are any unsubmitted ones and submit them. Occasionally I’ll click on the bugzilla link for a crash, only to see hardly any action or updates on those for months (or longer).

Granted that I have a small bunch of extensions (all WebExtensions), but this crash-on-exit happens due to many different causes, as seen in the crash reports. I’m too loathe to troubleshoot with disabling all extensions and then trying it one by one. Why should an extension even cause a crash, especially when its a WebExtension (unlike the older XUL extensions that had a deeper integration into the browser)? It seems like there are fundamental issues within Firefox that make it crash prone.

I can make Firefox not crash if I have a single window with a few tabs. That use case is anyway served by Edge and Chrome. The main reasons I use Firefox, apart from some ideological ones, are that it’s always been much better at handling multiple windows and tons of tabs and its extensibility (Manifest V2 FTW).

I would sincerely appreciate Firefox not crashing as often for me.

ordu20d ago

It is hard to judge, but a crash on exit seems to me a possible consequence of a damaged memory. Firefox frees all the resources and collects the garbage. I expect it to touch a lot of memory locations, and do something with values retrieved.

> this crash-on-exit happens due to many different causes, as seen in the crash reports

It points to the same direction: all these different causes are just symptoms, the root cause is hiding deeper, and it is triggered by the firefox stopping.

It is all is not a guarantee that the root cause is bitflips, but you can rule it out by testing your memory.

1 more reply

rebelwebmaster20d ago

Can you share a link to a crash report from about:crashes? Sounds like some kind of shutdown hang getting force-killed maybe?

INTPenis20d ago

That's super interesting because I remember Linus Torvalds saying he requires ECC RAM in his computers, because he got tired of weird issues that were resolved by a reboot.

But non-ECC is fine for most of us mortals gaming and streaming.

I would expect pro gamers to opt for ECC though.

tredre321d ago

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%.

Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.

rockdoe20d ago

What's the expected behavior of a JavaScript program that allocates all memory on the machine?

1 more reply

LorenPechtel21d ago

Memory isn't the only resource.

soletta20d ago

I’ve also found that compiling large packages in GCC or similar tends to surface problems with the system’s RAM. Which probably means most typical software is resilient to a bit-flip; makes you wonder how many typos in actual documents might have been caused by bad R@M.

sfink20d ago

That's exactly how my bad RAM manifested itself. In fact, I was compiling Firefox, and gcc would get a segmentation fault at some random point during compilation. I'd have to clobber and restart the hour-long build. It was only when gcc started crashing while compiling other things that I even started considering the possibility of hardware failure. I'm a software developer, and based on what I produce myself, I just assume that all software is horribly buggy. ;-)

fooker20d ago

This seems like the kind of metric that 3 users with 15 year old machines can skew significantly.

Has to be normalized, and outliers eliminated in some consistent manner.

rockdoe20d ago

I'm pretty sure I saw them present on exactly this at FOSDEM?

conartist621d ago

Also a polite reminder that most of those crashes will be concentrated on machines with faulty memory so the naive way of stating the statistic may overestimate its impact to the average user. For the average user this is the difference between 4/5 crashes are from software bugs and 5/5 crashes are from software bugs, and for a lot of people it will still be 5/5

kev00920d ago

It's high enough that I would wonder if some systems software issues are mixed in, like rare races in malloc or page table management.

jurakovic20d ago

There is this app https://github.com/Smerity/bitflipped _Your computer is a cosmic ray detector. Literally._

ryukoposting20d ago

It's worth noting that the thread says "up to 10%," not "10%" as the title suggests. So it's reasonable to believe the rate is as low as 5% based on the only real figure given (25000 / 470000)

I think our education system should include a unit on "marketing bullshit" sometime early in elementary school. Maybe as part of math class, after they learn inequalities. "Ok kids, remind me, what does 'up to' mean?" "less than or equal to!"

Habgdnv20d ago

I bought my PC like 2 weeks ago and ran my ram at 5800 to test its limits and forgot to lower it. After few strange crashes of my fedora desktop - super strange behavior, apps refuse start/stop, can't even escape to the console... I ran memtest today and it lit all red in the first 2 minutes! Then I log in to my stable desktop at 5200 MT and I see this in the front HN page! What are the chances?!!

mrguyorama21d ago

People I think are overindexing on this being about "Bad hardware".

We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...

They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.

At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.

A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.

jmalicki21d ago

There is DRAM which is mildly defective but got past QC.

There are power suppliers that are mildly defective but got past QC.

There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.

Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.

There are a ton of causes for bitflips other than cosmic rays.

For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?

It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.

morelikeborelax20d ago

I used to partake in all RAM discussions online. Here, reddit, every technical hardware forum and anywhere workstations were being talked about.

The sentiment was always ECC is a waste and a scam. My goodness the unhinged posts from people who thought it was a trick and couldn't fathom you don't know you're having bits flipped without it. "it's a rip off" without even looking and seeinf often the price was just that of the extra chip.

I've discussed it for 20 years since the first Mac Pro and people just did not want to hear that it had any use. Even after the Google study.

Consumers giving professionals advice. Was same with workstation graphics cards.

hinkley20d ago

Every so often when I'm doing refactoring work and my list of worries has decreased to the point I can start thinking of new things to worry about, I worry about how as we reduce the accidental complexity of code and condense the critical bytes of the working memory tighter and tighter, how we are leaning very hard on very few bytes and hoping none of them ever bitflip.

I wonder sometimes if we shouldn't be doing like NASA does and triple-storing values and comparing the calculations to see if they get the same results.

1 more reply

shiroiuma21d ago

It'd be interesting to see how your experience would differ if you put it to sleep at night after switching to ECC RAM.

Unfortunately, not that many consumer platforms make this possible or affordable.

1 more reply

dbolgheroni20d ago

When debugging something, I often remember the the quote, often misattributed to Einstein: "Insanity is doing the same thing over and over again and expecting different results". Then I remember about bitflips, and run a second, maybe a third time, just expecting the next bit to flip to not be in the routine I'm trying to debug.

spiffy202520d ago

Travis Long had done something similar in 2022 at Mozilla.

https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...

fasteo20d ago

>>> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger

Having the number of unique machines would be great to see how skewed this estimate is.

sfink20d ago

To be fully accurate, it would also require tracking unique machines when collecting crash reports.

bilekas20d ago

Just out of interest is ECC memory supposed to me more resilient to these types of failure?

Neil4420d ago

I guess the percentage of crashes due to hardware is high because people with faulty hardware are experiencing the vast majority of crashes. It sounds kind of dumb when put like that, I'm actually surprised it's that low a percentage.

danbruc20d ago

I guess the percentage of crashes due to hardware is high because people with faulty hardware are experiencing the vast majority of crashes.

It is not that simple, it does not only depend on the hardware but also the code. It is like a race, what happens first - you hit a bug in the code or your hardware glitches? If the code is bug free, then all crashes will be due to hardware issues, whether faulty hardware or stray particles from the sun. When the code is one giant bug and crashes immediately every time, then you will need really faulty hardware or have to place a uranium rod on top of your RAM and point a heat gun at your CPU to crash before you hit the first bug, i.e. almost all crashes will be due to bugs.

So what you observe will depend on the prevalence of faulty hardware and how long it takes to hit an hardware issue vs how buggy the code is and how long it takes to hit a bug.

samus20d ago

Maybe a partial solution would be to duplicate pointer data, compare pointers at every deference and panics if it doesn't match up. In essence a poor man's version of ECC. It's a considerable runtime overhead, but it might be possible to hide it behind a flag, only to be turned on to reproduce bugs. Also, anti-cheat measures already do something similar.

Certain data is more sensitive as well and requires extra protection. Pointers and indexes obviously, which might send the whole application on a wild goose chase around memory. But also machine code, especially JIT-generated traces, is worth to be checksummed and verified before executing it.

kmoser21d ago

The next logical step would be to somehow inform users so they could take action to replace the bad memory. I realize this is a challenge given the anonymized nature of the crash data, but I might be willing to trade some anonymity in exchange for stability.

titaniumtravel21d ago

The easy solution for that is to just do that analysis locally... Firefox doesn't submit the full core dumps anyhow for this exact reason and therefore needs to do some preprocessing in any case.

sfink20d ago

I think the firefox crash reporter does now? It does a limited memory scan and reports problems it finds. No privacy violations required.

That's different from what you're suggesting, because you're right that the crash reports are analyzed with heuristics to guess at memory corruption. Aside from the privacy implications, though, I think that would have too many false alarms. A single bit flip is usually going to be an out of bounds write, not bad RAM.

shiroiuma21d ago

>The next logical step would be to somehow inform users so they could take action to replace the bad memory.

This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.

3 more replies

brador21d ago

How many are caused by cosmic radiation bitflips?

emmelaich20d ago

An SO question indicates "10 GB of memory should show an ECC event every 1,000 to 10,000 hours,"

https://stackoverflow.com/questions/2580933/cosmic-rays-what...

bakugo20d ago

I was running my PC with bad memory for a few weeks last year. Firefox crashed a LOT, way more than any other application I used during that time, so I've probably contributed a decent amount to these numbers...

shevy-java20d ago

It could be that firefox is written inefficiently though.

1 more reply

fastaguy8819d ago

It is perhaps worth noting that the 25,000 bit flips/out of 470,000 crashes (in a week) are probably not coming from all Firefox users. It would be useful to know how many of those crashes (and bit flips) are happening on the same machine. And whether the crashes/bit flips continue on the same machine continue from week to week.

I can certainly imagine that a very small fraction of Firefox users are generating these results, so that bit flips are not a problem generally.

Grisu_FTP17d ago

IIRC Linus Torvalds said in the Linus Tech Tips video he was in that he thinks many of the bluescreens Windows gets a bad rep for are actually bitflips that happen due to most desktops not using ECC-Memory.

Since i have seen this video this question has been in my mind from time to time.

strongpigeon20d ago

I might be too late to this thread to get an answer but I do wonder how much of those bitflips are due to rowhammer-style attacks. Firefox runs trillions of lines of untrusted code a day with a non-insignificant part that is of malicious intent. I wouldn’t be shocked if some of those “analog” crashes are due to that.

_0xdd20d ago

So, why aren't we all using ECC in 2026?

matja20d ago

Doesn't sell as well as a beautiful citrus blush milled aluminium case.

lunar_rover20d ago

Intel intentionally ripped ECC out of the sweet spot products to charge premium and unfortunately they succeeded.

Pentium G4560 supports ECC, Core i7 10700 doesn't.

1 more reply

haspok20d ago

Because 99% of laptops don't have it, and can't be memory upgraded?

stinkbeetle20d ago

This matches what I have long said, which is that adding ECC memory to consumer devices will not result in any incredible stability improvement. It will barely be a blip really.

As we know from Google and other papers, most of these 10% of flips will be caused by broken or marginal hardware, of which a good proportion of which could be weeded out by running a memory tester for a while. So if you do that you're probably looking a couple out of every hundred crashes being caused by bitflips in RAM. A couple more might be due to other marginal hardware. The vast majority software.

How often does your computer or browser crash? How many times per year? About 2-3 for me that I can remember. So in 50 years I might save myself one or two crashes if I had ECC.

ECC itself takes about 12.5% overhead/cost. I have also had a couple of occasions where things have been OOM-killed or ground to a halt (probably because of memory shortage). Could be my money would be better spent with 10% more memory than ECC.

People like to rave and rant at the greedy fatcats in the memory-industrial complex screwing consumers out of ECC, but the reality is it's not free and it's not a magical fix. Not when software causes the crashes.

Software developers like Linus get incredibly annoyed about bug reports caused by bit flips. Which is understandable. I have been involved in more than one crazy Linux kernel bug that pulled in hardware teams bringing up new CPU that irritated the bug. And my experience would be far from unique. So there's a bit of throwing stones in glass houses there too. Software might be in a better position to demand improvement if they weren't responsible for most crashes by an order of magnitude...

CamouflagedKiwi20d ago

This is a pretty big claim which seems to imply this is much more common than expected, but there's no real information here and the numbers don't even stack up:

> That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.

So the data actually only supports 5% being caused by bitflips, then there's a magic multiple of 2? Come on. Let alone this conservative heuristic that is never explained - what is it doing that makes him so certain that it can never be wrong, and yet also detects these at this rate?

sinuhe6920d ago

Oh, on my old PC, FF sometimes mysteriously crashed for apparently no reason. I sent bug reports and cleared the profile and it seemed to help for a while, then it crashed again. Much later, I suspected and tested the RAM and turned out, it had a faulty module!

SeanSullivan8620d ago

Hmm, can someone educate me here? Why don't bit flips ever seem to impact the results of calculations in settings like big-data analytics on AWS?

Is it a difference between server hardware managed by knowledgeable people and random hardware thrown together by home PC builders?

huhhuh20d ago

In Belgium elections, a party received 4096 unaccounted votes likely due to a bit flip: https://en.wikipedia.org/wiki/Electronic_voting_in_Belgium#R....

zadikian20d ago

Servers and pro workstations normally have ECC RAM.

matja20d ago

You can only detect what you measure. Are these big-data analytics processes running multiple times to detect differences?

OkGoDoIt20d ago

Presumably professional hardware uses ECC memory, which automatically corrects these kinds of errors.

m304720d ago

Stucke's talk about DNS being hazardous to your health is one of my all time favorites: https://www.youtube.com/watch?v=4PSc9BJDWhM

dana32120d ago

And.. how do they not know its their software being leaky and causing these bitflips?

These are potential bitflips.

I found an issue only yesterday in firefox that does not happen in other browsers on specific hardware.

My guess is that the software is riddled with edge-case bugs.

devy20d ago

I wonder if Chrome dev team can corroborate on this finding in their crash reporting.

charcircuit20d ago

When I had bad memory, Firefox was the only program which would crash because of it. I think there is also something to say about how Firefox's design could be improved to handle them better.

d--b20d ago

Does anyone know how they can detect hardware defects like this? This sounds like an incredibly hard problem. And I don’t see how they can do this without impacting performance significantly.

rockdoe20d ago

If the crash is isolated (no other reports) and flipping one bit in the crashing pointer value would make the pointer valid, it's assumed to be a bitflip. This obviously will only catch a minor portion of bitflips, i.e. any image or video data with bitflips wouldn't crash.

From what he's saying they run an actual memory test after a crash, too.

lifeisstillgood20d ago

I’m pretty sure Torvalds tells a story of spending days hunting down a compiler bug, only to find it was memory, and then simply never using anything other than EC memory again.

10+% is huge

seanalltogether20d ago

He specifically mentions this story in the LTT video from a few months ago.

https://youtu.be/mfv0V1SxbNA?si=hS4ZMRYqqLXMkxJW&t=526

stnvh21d ago

Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.

quesera20d ago

Describe "demons"?

I run four Firefox instances simultaneously, most of the time. No issues to report.

1 more reply

rs_rs_rs_rs_rs20d ago

What is your expected behaviour?

AndriyKunitsyn20d ago

>That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job.

CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?

benjaminl20d ago

In just about every way. CPU caches are made from SRAM and live on the CPU itself. Main system RAM is made from DRAM and live on separate chips even if they are soldered into the same physical package (system in package or SiP). The RAM still isn't on the SoC.

1 more reply

phs250120d ago

For one thing, static vs dynamic RAM. Static RAM (which is what's used for your typical CPU cache) is implemented with flip-flops and doesn't need to be refreshed, reads aren't destructive like DRAM, etc.

stinkbeetle20d ago

At that level, they are not different. They could suffer from UE due to defect, marginal system (voltage, temperature, frequency), or radiation upset, suffer electromigration/aging, etc. And you can't replace them either.

CPUs tend to be built to tolerate upsets, like having ECC and parity in arrays and structures whereas the DRAM on a Macbook probably does not. But there is no objective standard for these things, and redundancy is not foolproof it is just another lever to move reliability equation with.

wmf20d ago

Caches and registers are also subject to bitflips. In many CPUs the caches use ECC so it's less of a problem. Intel did a study showing that many bits in registers are unused so flipping them doesn't cause problems.

darkhorn20d ago

What brands or types of memory cards are less likely to crash by bitflips?

estimator729220d ago

ECC

ptek20d ago

So does this mean bool true = 3 or should bool true = 5?

This will bloat the code a bit.

alok-g20d ago

Interesting. Seems like software could be made a notch more robust by encoding true and false with a larger number of bit differences.

1 more reply

pulkas20d ago

what happens if bitflip occurs while you are detecting bitflip?

bitflippin...

matja20d ago

I have a machine with a 6 year uptime that was slowly accumulating single bit error corrections. The EDAC counter mysteriously stopped at 308 last year, and hasn't changed since, so I wonder if a bitflip in the counter circuit made it stop...

est20d ago

so could software engineering sommehow catch those crashes?

aforwardslash20d ago

Going to be downvoted, but I call bullshit on this. Bitflips are frequent (and yes ECC is an improvement but does not solve the problem), but not that frequent. One can either assume users that enabled telemetry are an odd bunch with flaky hardware, or the implementation isnt actually detecting bitflips (potentially, as the messages indicate), but a plathora of problems. Having a 1/10 probability a given struct is either processed wrong, parsed wrong or saved wrong would have pretty severe effects in many, many scenarios - from image editing to cad. Also, bitflips on flaky hardware dont choose protection rings - it would also affect the OS routines such as reading/writing to devices and everything else that touches memory. Yup, i've seen plenty of faulty ram systems (many WinME crashes were actually caused by defective ram sticks that would run fine with W98), it doesnt choose browsers or applications.

dheera20d ago

It says 10% of crashes

If Firefox itself has so few bugs that it crashes very infrequently, it is not contradictory to what you are saying.

I wouldn't be surprised if 99% of crashes in my "hello world" script are caused by bit flips.

1 more reply

tempaccount505020d ago

How can you possibly be this confident if you don't know the number of times Firefox was run and number of bug reports submitted? Say it's run 100,000,000 times, 1000 reports are submitted, and 10 are bit flips. Seems reasonable. You're misinterpreting what they are saying.

1 more reply

groundzeros201520d ago

Also having worked in big software with many users, this also doesn't match the data we had.

The only explanation I can see is if Firefox is installed on a user base of incredibly low quality hardware.

aforwardslash20d ago

I forgot to mention - yes Im assuming 100% of firefox instances crash, if run long enough; I (still) use firefox as a second browser.

vsgherzi21d ago

is there a way to get the memory tester he mentioned? Is it open source? Once Ram goes bad is there a way or recovering it or is it toasted forever?

foresto21d ago

You can map known-bad memory regions to avoid using them.

https://www.memtest86.com/blacklist-ram-badram-badmemorylist...

1 more reply

vizzier21d ago

https://www.memtest86.com/

Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.

RachelF20d ago

This is the best way of marking regions of RAM as bad in Windows:

https://github.com/prsyahmi/BadMemory

I've used it for many years. It only fixes physical hardware faults, not timing errors. For example if a RAM cell is damaged by radiation, not if you're overclocking your RAM.

wosined20d ago

The title should start with "Up to 10%"

phendrenad220d ago

Guesstimation at its finest.

1over13720d ago

Curious why this article is written into divided up chunks?

wmf20d ago

They're tweets.

wakawaka2820d ago

Ugh just write a real blog post dude.

eek212120d ago

Definitely going to hard disagree with Gabriele Svelto's take. I could point to the comments, however, let me bring up my own experiences across personal devices and organizational devices. In particular, note where he says this:

"I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate."

You can't claim any percentage if you don't know what you are measuring. Based on his hot take, I can run an overclocked machine have firefox crash a few hundred thousand times a day and he'll use my data to support his position. Further, see below:

First: A pre-text: I use Firefox, even now, despite what I post below. I use it because it is generally reliable, outside of specific pain points I mention, free, open source, compatible with most sites, and for now, is more privacy oriented than chrome.

Second: On both corporate and home devices, Firefox has shown to crash more often than Chrome/Chromium/Electron powered stuff. Only Safari on Windows beats it out in terms of crashes, and Safari on Windows is hot garbage. If bit flips were causing issues, why are chromium based browsers such as edge and Chrome so much more reliable?

Third: Admittedly, I do not pay close enough attention to know when Firefox sends crash reports, however, what I do know is that it thinks it crashes far more often than it does. A `sudo reboot` on linux, for example, will often make firefox think it crashed on my machine. (it didn't, Linux just kills everything quickly, flushes IO buffers, and reboots...and Firefox often can't even recover the session after...)

Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.

Just my thoughts.

jesup20d ago

force-kills like sudo reboot will show UI on restart indicating it didn't shut down cleanly, but that isn't reported as a crash. You can see how often you actually crash via about:crashes (and also see what happened)

hedora20d ago

Do you have any evidence that Firefox crashes more?

Also, the latest version of Safari for Windows was released in 2012. How old is your Firefox?

nickhodge20d ago

Rust would fix this. Oh wait.

titzer20d ago

I had a refurbished ThinkPad that had memory corruption. I only noticed because Firefox started to crash an unreasonable amount. Ran memcheck through BIOS and sure enough it was bad RAM.

Have we considered that maybe Firefox is the cause of bad memory?

sfink20d ago

It is.

If a tree falls in the forest with nobody around to hear it, does it make a sound?

If a computer flips bits while it's not doing anything with that memory, does it have bad RAM?

A fair number of people pretty much only use their computers as web browsers.

QED

wartywhoa2320d ago

But muh memory-safe Rust!!! :'(

nubinetwork21d ago

470k crashes in a week? Considering how low their market share is, that would suggest every install crashes several times a day... I gotta call bs.

refulgentis21d ago

470k crashes / week

67k crashes / day

claim: "Given # of installs is X; every install must be crashing several times a day"

We'll translate that to: "every install crashes 5 times a day"

67k crashes day / 5 crashes / install

12k installs

Your claim is there's 12k firefox users? Lol

bArray20d ago

> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger.

470k crashes in a single week, and this is under-reported! I bet the number of crashes is far higher. My snap Firefox on Ubuntu would lock-up, forcing me to kill it from the system monitor, and this was never reported as a crash.

Once upon a time I wrote software for safety critical systems in C/C++, where the code was deployed and expected to work for 10 years (or more) and interact with systems not built yet. Our system could lose power at any time (no battery) and we would have at best 1ms warning.

Even if Firefox moves to Rust, it will not resolve these issues. 5% of their crashes could be coming from resource exhaustion, likely mostly RAM - why is this not being checked prior to allocation? 5% of their crashes could be resolved tomorrow if they just checked how much RAM was available prior to trying to allocate it. That accounts for ~23k crashes a week. Madness.

With the RAM shortages and 8GB looking like it will remain the entry laptop norm, we need to start thinking more carefully about how software is developed.

NotGMan21d ago

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

I find this impossible to believe.

If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.

>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.

Might be the case, but 10% is still huge.

There imo has to be something else going on. Either their userbase/tracking is biased or something else...

4 more replies

j / k navigate · click thread line to collapse

486 comments

netcoyote21d ago

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

- overclocked CPU

- bad memory wait-state configuration

- underpowered power supply

- overheating due to under-specced cooling fans or dusty intakes

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

PunchyHamster20d ago

Then my power supply finally died (it was cheap-ish, not cheap-est but it had few years already). I replaced it, lo and behold, memory errors were gone

dvngnt_20d ago

dpe8220d ago

Analemma_20d ago

Helmut1000120d ago

I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.

8 more replies

jodrellblank20d ago

GPS location and movement data is what gives Google maps its near-real-time view of traffic on all roads, and busy-ness of all shops.

I think they collect location data from people riding public transport so they can tell you how long people wait on average at bus stops before getting on a bus.

Does Google collect atmospheric pressure readings from phone altimeters and use it for weather models? Could they?

Kindle collects details on books people read, how far they read, where they stop, which sections they highlight and quote, which words they look up in dictionaries.

2 more replies

mobilio20d ago

Yup!

I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

1 more reply

Modified301920d ago

Thanks to asrock motherboards for AMD’s threadripper 1950x working with ECC memory, that’s what I learned to overclock on.

I hate to think of how much effort will be spent debugging software in vain because of memory errors.

4 more replies

jug20d ago

1 more reply

aiiane20d ago

pndy21d ago

I didn't expect to read bits of GW story here from one of the founders - thanks!

arprocter20d ago

>Sometimes I'm amazed that computers even work at all!

Funny you say this, because for a good while I was running OC'd RAM

I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)

cookiengineer20d ago

I kind of wanted to confirm that. At that time I was still using a Compaq business laptop on which I played Guild Wars.

The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).

After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.

monster_truck20d ago

samiv20d ago

Plot twist. The memory bit flip checking code was actually buggy and contained UB.

No, seriously did you actually verify the code for correctness before relying on it's results?

Dylan1680720d ago

> And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

For that one I'd guess no, because under normal circumstances hot locations like that will stay in cache.

taneq20d ago

nxobject20d ago

Oh god yes… Dell OptiPlexes and bad caps went together in those days. I’m half convinced Valve put the gray towers in Counter-Strike so IT employees wasting time could shoot them up for therapy.

Agentlien20d ago

The vast majority of crashes came from two buckets:

1. PCs running below our minimum specs

2. Bugs in MSI Afterburner.

1 more reply

andrepd20d ago

Amazing story! Reminds me of old gamasutra posts like these https://web.archive.org/web/20170522151205/http://www.gamasu...

PaulHoule20d ago

I dialed the machine back to the rated speed but it failed completely within 6 months.

jiggawatts20d ago

1 more reply

benatkin20d ago

Yikes. Dude, you're getting a Packard Bell.

sidewndr4620d ago

Well wow I wasn't expecting to see yet another story from Patrick Wyatt here in the comments! Much appreciated, I've enjoyed reading everything you've written over the years.

fennecbutt20d ago

That's awesome. But also guild waaars, GW2 I played from beta for years, but it just got boring. Endless expansions with weird story.

We need GW3 already but my fear is mmo as a genre is dying.

1 more reply

SunnyNeon20d ago

How did you determine which of the causes it was?

danielEM20d ago

> problems because Dell sourced the absolute cheapest stuff for their computers;

hsbauauvhabzb20d ago

Did you/he ever consider redundant allocation for high value content and hash checks for low value assets that are still important?

I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?

3 more replies

just_testing20d ago

I loved reading your comment and got curious: how he detected the bitflips?

1 more reply

Salgat20d ago

Mike is such a legend.

yownie19d ago

this exactly the type of stories I come to HN to read, thanks!

rurban20d ago

I hate HW soo much. To revise the biggest problems in computing, beside out of tokens: HW bugs

Animats20d ago

ECC should have become standard around the time memories passed 1GB.

It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.

loeg20d ago

2 more replies

justin6620d ago

> ECC should have become standard around the time memories passed 1GB.

WatchDog20d ago

All DDR5 ram has some amount of error correction built in, because DDR5 is much more prone to bit flipping, it requires it.

I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.

4 more replies

tombert20d ago

I am not sure I've ever seen a laptop that has ECC memory. I'm sure they exist but I don't think I've seen it.

1 more reply

oybng20d ago

For the unaware, Intel is to blame for this

1 more reply

aforwardslash20d ago

3 more replies

ece20d ago

That said, memory DIMM capacity increases with even a small chance of bit-flips means lots of people will still be affected.

hedora20d ago

ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.

However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.

It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.

adonovan21d ago

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

aforwardslash20d ago

> In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

1 more reply

nitwit00520d ago

You might consider adding the CPU temperature to the report, if there's a reasonable way to get it (haven't tried inside a VM). Then you could at least filter out extremely hot hardware.

1 more reply

jamesfinlayson20d ago

I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.

sieep20d ago

Ive been trying to push my boss towards more analytics/telemetry in production that focus on crashes, thanks for sharing.

tczMUFlmoNk20d ago

> Even with only about 1 in 1000 users enabling telemetry

How do you know the number/proportion of users who run without telemetry enabled, since by definition you're not collecting their data?

(Not imputing any malice, genuinely curious.)

1 more reply

charcircuit20d ago

>All of these point to memory corruption.

Actually "dereferencing a pointer that had just passed a nil check" could be from a flow control fault where the branch fails to be taken correctly.

kleiba20d ago

Also, this whole crashing business is a fairly recent effect - I've been running firefox for forever and I cannot remember when it last was as much of an issue as it has become recently for me.

tuetuopay20d ago

Just check your memory with memtest.

Browsers are good candidates to find bad memory: they eat a lot of ram, they scatter data around, they have a large chunk, and have JITs where a lot of machine code gets loaded left and right.

2 more replies

vultour20d ago

I spend probably thousands of hours in Firefox every year and I don't think I've ever had it crash.

1 more reply

mathw20d ago

Agingcoder20d ago

It depends on what you bitflip.

I once had a bitflip pattern causing lowercase ascii to turn into uppercase ascii in a case insensitive system. Everything was fine until it tried to uppercase numbers and things went wrong

Flipping a pixel had no impact though

pflanze20d ago

1 more reply

bmicraft20d ago

I've had some very bad ram (lots of errors found when tested) and consistently the only thing that actually crashed because of it was Firefox.

zvqcMMV6Zcr20d ago

haspok20d ago

My suspicion has always been some kind of a memory leak, but memory corruption also makes sense.

Unfortunately, Chrome (which I use for work - Firefox is for private stuff) has NEVER crashed on me yet. Certainly not in the past 5 years. Which is odd. I'm on Linux btw.

3 more replies

jlarocco20d ago

Firefox has a long history of denying problems, blaming the user, and fixing the issue years later.

It used to be memory usage, now it's crashing.

1 more reply

lqet20d ago

> Firefox is about the only piece of software in my setup that occasionally crashes.

I would add Thunderbird to that list.

xxs20d ago

run y-cruncher if you'd like to test memory and overall stability. It's decent test and a lot better than memtest (in my experience)

LunaSea20d ago

If only the had written Firefox in Rust, they wouldn't have had these issues .

bob102920d ago

rcbdev20d ago

Hyrum's law in action.

https://xkcd.com/1172/

thegrim3321d ago

rincebrain21d ago

There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.

edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.

[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...

[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568

[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...

[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...

2 more replies

tredre321d ago

> last year we deployed an actual memory tester that runs on user machines after the browser crashes.

He doesn't explain anything indeed but presumably that code is available somewhere.

2 more replies

hexyl_C_gut21d ago

It sounds like they don't know that the crashes are from bitflips but those crashes are from people with flaky memory which probably caused the crash?

wmf20d ago

A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.

1 more reply

hrmtst9383720d ago

1 more reply

KenoFischer20d ago

I'll submit my bit flip story for consideration also :) https://julialang.org/blog/2020/09/rr-memory-magic/

shevy-java20d ago

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

WhatsTheBigIdea20d ago

Your gut may be leading you astray?

I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.

1 more reply

LM35820d ago

10% of crashes does not imply 10% of your crashes.

1 more reply

BeetleB20d ago

Are people getting so many FF crashes? Mine rarely does. I leave it running, opening and closing tabs, for weeks on end.

12 more replies

bsder20d ago

> Bold claim. From my gut feeling this must be incorrect

RAM flips are common. This kind of thing is old and has likely gotten worse.

ZFS is really good at spotting RAM flips.

bichiliad20d ago

saati20d ago

I haven't seen a single firefox or chrome crash in months now, you should really stress-test your hardware.

3 more replies

nimih20d ago

> Bold claim.

I agree. Good thing he doesn't back up his claim with any sort of evidence or reasoned argument, or you'd look like a huge moron!

1 more reply

shakna20d ago

Chromium has better handling for bitflip errors. Mostly due to the Discardable buffers they make such extensive use of.

The hardware bugs are there. They're just handled.

1 more reply

hedora20d ago

I've had zero crashes in safari, ff or chrome in recent memory (except maybe OOMs). (Though I don't use Windows, so maybe that's part of the reason stuff just works?)

Perhaps you're part of the group driving hardware crashes up to 10% and need to fix your machine.

1 more reply

sfink20d ago

There's a very good chance your system also does not have flaky memory. Most don't. You're not contradicting the post.

Zambyte20d ago

What do you mean "the same amount"? If your browser never crashes, 10% of zero is zero.

pizza23420d ago

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

> Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

estimator729220d ago

He addresses this in the thread.

phyzome20d ago

...normally browsers don't crash at all. Something's wrong with your computer.

maxerickson20d ago

I mean, I've had quite some number of crashes that I can't correlate to anything.

Hardware problems are just as good a potential explanation for those as anything else.

cellular20d ago

Maybe if Firefox tabs weren't such a memory hog it would be only 0.005% !

KennyBlanken20d ago

"Software engineer thinks everyone's hardware is broken, couldn't possibly be bugs in his code" sums it up about right.

camkego21d ago

It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."

keyringlight20d ago

adonovan21d ago

srean21d ago

One of the nicest guys I have met. Was an intern at Google at that time, firing off mapreduces then (2003-2004) was quite a blast. The Peter Weinberger theme T-shirt too.

kdklol21d ago

bergheim20d ago

1: I know this because I installed https://addons.mozilla.org/en-US/firefox/addon/tab-counter-p... to check :)

2: However after finding Karakeep I don't actually have 1000 tabs anymore!

andoando20d ago

I dont get the people with 10+ tabs open drives me crazy, how do you even know whats what?

Just bookmark shit you want to keep!

petterroea20d ago

bhelkey20d ago

I would love to see DDR4 vs DDR5 bitflips. As I understand it DDR5 must come with some level of ECC [1].

[1] https://www.corsair.com/us/en/explorer/diy-builder/memory/is...

drpixie20d ago

From Corsair

kevin_thibedeau20d ago

DDR5 comes with marginal DRAM that is patched up with ECC to boost yields. It's not the same as fully reliable RAM.

2 more replies

silon4220d ago

I wish also for desktop vs laptop ram comparison.

moconnor20d ago

Turned out at their altitude cosmic rays were flipping bits in the top-most machines in the racks, sometimes then penetrating lower and flipping bits in more machines too.

newscracker20d ago

This is quite surprising to me, since I thought the percentage would be a lot lesser.

But I don’t really know what the Firefox team does with crash reports and in making Firefox almost crash proof.

I would sincerely appreciate Firefox not crashing as often for me.

ordu20d ago

> this crash-on-exit happens due to many different causes, as seen in the crash reports

It points to the same direction: all these different causes are just symptoms, the root cause is hiding deeper, and it is triggered by the firefox stopping.

It is all is not a guarantee that the root cause is bitflips, but you can rule it out by testing your memory.

1 more reply

rebelwebmaster20d ago

Can you share a link to a crash report from about:crashes? Sounds like some kind of shutdown hang getting force-killed maybe?

INTPenis20d ago

That's super interesting because I remember Linus Torvalds saying he requires ECC RAM in his computers, because he got tired of weird issues that were resolved by a reboot.

But non-ECC is fine for most of us mortals gaming and streaming.

I would expect pro gamers to opt for ECC though.

tredre321d ago

Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.

rockdoe20d ago

What's the expected behavior of a JavaScript program that allocates all memory on the machine?

1 more reply

LorenPechtel21d ago

Memory isn't the only resource.

soletta20d ago

sfink20d ago

fooker20d ago

This seems like the kind of metric that 3 users with 15 year old machines can skew significantly.

Has to be normalized, and outliers eliminated in some consistent manner.

rockdoe20d ago

I'm pretty sure I saw them present on exactly this at FOSDEM?

conartist621d ago

kev00920d ago

It's high enough that I would wonder if some systems software issues are mixed in, like rare races in malloc or page table management.

jurakovic20d ago

There is this app https://github.com/Smerity/bitflipped _Your computer is a cosmic ray detector. Literally._

ryukoposting20d ago

It's worth noting that the thread says "up to 10%," not "10%" as the title suggests. So it's reasonable to believe the rate is as low as 5% based on the only real figure given (25000 / 470000)

Habgdnv20d ago

mrguyorama21d ago

People I think are overindexing on this being about "Bad hardware".

They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.

At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.

jmalicki21d ago

There is DRAM which is mildly defective but got past QC.

There are power suppliers that are mildly defective but got past QC.

There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.

Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.

There are a ton of causes for bitflips other than cosmic rays.

It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.

morelikeborelax20d ago

I used to partake in all RAM discussions online. Here, reddit, every technical hardware forum and anywhere workstations were being talked about.

I've discussed it for 20 years since the first Mac Pro and people just did not want to hear that it had any use. Even after the Google study.

Consumers giving professionals advice. Was same with workstation graphics cards.

hinkley20d ago

I wonder sometimes if we shouldn't be doing like NASA does and triple-storing values and comparing the calculations to see if they get the same results.

1 more reply

shiroiuma21d ago

It'd be interesting to see how your experience would differ if you put it to sleep at night after switching to ECC RAM.

Unfortunately, not that many consumer platforms make this possible or affordable.

1 more reply

dbolgheroni20d ago

spiffy202520d ago

Travis Long had done something similar in 2022 at Mozilla.

https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...

fasteo20d ago

>>> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger

Having the number of unique machines would be great to see how skewed this estimate is.

sfink20d ago

To be fully accurate, it would also require tracking unique machines when collecting crash reports.

bilekas20d ago

Just out of interest is ECC memory supposed to me more resilient to these types of failure?

Neil4420d ago

danbruc20d ago

I guess the percentage of crashes due to hardware is high because people with faulty hardware are experiencing the vast majority of crashes.

So what you observe will depend on the prevalence of faulty hardware and how long it takes to hit an hardware issue vs how buggy the code is and how long it takes to hit a bug.

samus20d ago

kmoser21d ago

titaniumtravel21d ago

The easy solution for that is to just do that analysis locally... Firefox doesn't submit the full core dumps anyhow for this exact reason and therefore needs to do some preprocessing in any case.

sfink20d ago

I think the firefox crash reporter does now? It does a limited memory scan and reports problems it finds. No privacy violations required.

shiroiuma21d ago

>The next logical step would be to somehow inform users so they could take action to replace the bad memory.

This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.

3 more replies

brador21d ago

How many are caused by cosmic radiation bitflips?

emmelaich20d ago

An SO question indicates "10 GB of memory should show an ECC event every 1,000 to 10,000 hours,"

https://stackoverflow.com/questions/2580933/cosmic-rays-what...

bakugo20d ago

shevy-java20d ago

It could be that firefox is written inefficiently though.

1 more reply

fastaguy8819d ago

I can certainly imagine that a very small fraction of Firefox users are generating these results, so that bit flips are not a problem generally.

Grisu_FTP17d ago

Since i have seen this video this question has been in my mind from time to time.

strongpigeon20d ago

_0xdd20d ago

So, why aren't we all using ECC in 2026?

matja20d ago

Doesn't sell as well as a beautiful citrus blush milled aluminium case.

lunar_rover20d ago

Intel intentionally ripped ECC out of the sweet spot products to charge premium and unfortunately they succeeded.

Pentium G4560 supports ECC, Core i7 10700 doesn't.

1 more reply

haspok20d ago

Because 99% of laptops don't have it, and can't be memory upgraded?

stinkbeetle20d ago

This matches what I have long said, which is that adding ECC memory to consumer devices will not result in any incredible stability improvement. It will barely be a blip really.

How often does your computer or browser crash? How many times per year? About 2-3 for me that I can remember. So in 50 years I might save myself one or two crashes if I had ECC.

CamouflagedKiwi20d ago

This is a pretty big claim which seems to imply this is much more common than expected, but there's no real information here and the numbers don't even stack up:

sinuhe6920d ago

SeanSullivan8620d ago

Hmm, can someone educate me here? Why don't bit flips ever seem to impact the results of calculations in settings like big-data analytics on AWS?

Is it a difference between server hardware managed by knowledgeable people and random hardware thrown together by home PC builders?

huhhuh20d ago

In Belgium elections, a party received 4096 unaccounted votes likely due to a bit flip: https://en.wikipedia.org/wiki/Electronic_voting_in_Belgium#R....

zadikian20d ago

Servers and pro workstations normally have ECC RAM.

matja20d ago

You can only detect what you measure. Are these big-data analytics processes running multiple times to detect differences?

OkGoDoIt20d ago

Presumably professional hardware uses ECC memory, which automatically corrects these kinds of errors.

m304720d ago

Stucke's talk about DNS being hazardous to your health is one of my all time favorites: https://www.youtube.com/watch?v=4PSc9BJDWhM

dana32120d ago

And.. how do they not know its their software being leaky and causing these bitflips?

These are potential bitflips.

I found an issue only yesterday in firefox that does not happen in other browsers on specific hardware.

My guess is that the software is riddled with edge-case bugs.

devy20d ago

I wonder if Chrome dev team can corroborate on this finding in their crash reporting.

charcircuit20d ago

When I had bad memory, Firefox was the only program which would crash because of it. I think there is also something to say about how Firefox's design could be improved to handle them better.

d--b20d ago

Does anyone know how they can detect hardware defects like this? This sounds like an incredibly hard problem. And I don’t see how they can do this without impacting performance significantly.

rockdoe20d ago

From what he's saying they run an actual memory test after a crash, too.

lifeisstillgood20d ago

I’m pretty sure Torvalds tells a story of spending days hunting down a compiler bug, only to find it was memory, and then simply never using anything other than EC memory again.

10+% is huge

seanalltogether20d ago

He specifically mentions this story in the LTT video from a few months ago.

https://youtu.be/mfv0V1SxbNA?si=hS4ZMRYqqLXMkxJW&t=526

stnvh21d ago

Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.

quesera20d ago

Describe "demons"?

I run four Firefox instances simultaneously, most of the time. No issues to report.

1 more reply

rs_rs_rs_rs_rs20d ago

What is your expected behaviour?

AndriyKunitsyn20d ago

CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?

benjaminl20d ago

1 more reply

phs250120d ago

stinkbeetle20d ago

wmf20d ago

darkhorn20d ago

What brands or types of memory cards are less likely to crash by bitflips?

estimator729220d ago

ECC

ptek20d ago

So does this mean bool true = 3 or should bool true = 5?

This will bloat the code a bit.

alok-g20d ago

Interesting. Seems like software could be made a notch more robust by encoding true and false with a larger number of bit differences.

1 more reply

pulkas20d ago

what happens if bitflip occurs while you are detecting bitflip?

bitflippin...

matja20d ago

est20d ago

so could software engineering sommehow catch those crashes?

aforwardslash20d ago

dheera20d ago

It says 10% of crashes

If Firefox itself has so few bugs that it crashes very infrequently, it is not contradictory to what you are saying.

I wouldn't be surprised if 99% of crashes in my "hello world" script are caused by bit flips.

1 more reply

tempaccount505020d ago

1 more reply

groundzeros201520d ago

Also having worked in big software with many users, this also doesn't match the data we had.

The only explanation I can see is if Firefox is installed on a user base of incredibly low quality hardware.

aforwardslash20d ago

I forgot to mention - yes Im assuming 100% of firefox instances crash, if run long enough; I (still) use firefox as a second browser.

vsgherzi21d ago

is there a way to get the memory tester he mentioned? Is it open source? Once Ram goes bad is there a way or recovering it or is it toasted forever?

foresto21d ago

You can map known-bad memory regions to avoid using them.

https://www.memtest86.com/blacklist-ram-badram-badmemorylist...

1 more reply

vizzier21d ago

https://www.memtest86.com/

Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.

RachelF20d ago

This is the best way of marking regions of RAM as bad in Windows:

https://github.com/prsyahmi/BadMemory

I've used it for many years. It only fixes physical hardware faults, not timing errors. For example if a RAM cell is damaged by radiation, not if you're overclocking your RAM.

wosined20d ago

The title should start with "Up to 10%"

phendrenad220d ago

Guesstimation at its finest.

1over13720d ago

Curious why this article is written into divided up chunks?

wmf20d ago

They're tweets.

wakawaka2820d ago

Ugh just write a real blog post dude.

eek212120d ago

Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.

Just my thoughts.

jesup20d ago

hedora20d ago

Do you have any evidence that Firefox crashes more?

Also, the latest version of Safari for Windows was released in 2012. How old is your Firefox?

nickhodge20d ago

Rust would fix this. Oh wait.

titzer20d ago

I had a refurbished ThinkPad that had memory corruption. I only noticed because Firefox started to crash an unreasonable amount. Ran memcheck through BIOS and sure enough it was bad RAM.

Have we considered that maybe Firefox is the cause of bad memory?

sfink20d ago

It is.

If a tree falls in the forest with nobody around to hear it, does it make a sound?

If a computer flips bits while it's not doing anything with that memory, does it have bad RAM?

A fair number of people pretty much only use their computers as web browsers.

QED

wartywhoa2320d ago

But muh memory-safe Rust!!! :'(

nubinetwork21d ago

470k crashes in a week? Considering how low their market share is, that would suggest every install crashes several times a day... I gotta call bs.

refulgentis21d ago

470k crashes / week

67k crashes / day

claim: "Given # of installs is X; every install must be crashing several times a day"

We'll translate that to: "every install crashes 5 times a day"

67k crashes day / 5 crashes / install

12k installs

Your claim is there's 12k firefox users? Lol

bArray20d ago

> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger.

With the RAM shortages and 8GB looking like it will remain the entry laptop norm, we need to start thinking more carefully about how software is developed.

NotGMan21d ago

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

I find this impossible to believe.

If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.

Might be the case, but 10% is still huge.

There imo has to be something else going on. Either their userbase/tracking is biased or something else...

4 more replies

j / k navigate · click thread line to collapse