Google and read up - it is a problem, has killed people, has thrown election results, and much more.
It's such a common problem than bitsquatting is a real thing :)
Want to do an experiment? Pick a bitsquatted domain for a common site, and see how often you get hits.
As for the case of bitflips killing someone: Bitflips are not the root cause here. The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail. Bitflips are just one of many reasons for hardware failure.
So those systems didn't fail when a bitflip happened?
> The root cause is that somebody engineered something life-critical that mistakenly assumed hardware can not fail.
The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy. (For context, I've written code for NASA, written a few proposals on making things more radiation hardened, and my PhD thesis was on a new class of error correcting codes - so I do know a little about making redundant software and hardware specifically designed to mitigate bitflips).
By claiming a bitflip didn't kick off the problems, and trying to push the cause elsewhere, you may as well blame all of engineering for making a device that can kill on failure.
So your argument is a red herring
>On the whole, you fail to make a case that preventing bitflips is the solution to a problem
Yes, had those bitflips been prevented, or not happened, those fatalities would not have happened.
>Ya, I'm not buying that biyflips are a problem.
If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.
I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.
I guess you define something being a problem differently than I or the ECC ram industry do.
I didn't say that. I'm saying that the root cause (as in "root cause analysis") is not the bitflip. Designating the bitflip as the root cause is like analyzing your drunk driving accident and concluding that the root cause must be ethanol, rather than your drinking habits.
> The systems I am aware of were designed with bitflips in mind. NO software can handle arbitrary amounts of bitflips. ALL software designed to mitigate bitflips only lower the odds via various forms of redundancy.
Of course, and I'm not actually arguing that adding in ECC is completely worthless to that effect, though it is close to worthless. Luckily, ECC is quite cheap, if not free, so throwing it in there makes sense.
However, suppose ECC would increase the cost by several magnitudes, would it still be worth it? Obviously not. Redundancy alone reduces the probability of spurious failure by several magnitudes, and simply increasing redundancy would be far cheaper than adding in ECC.
> If bitflips are not a problem then we don't need ECC ram (or ECC almost anything!) which is clearly used a lot. So bitflips are enough of a problem that a massively widespread technology is in place to handle precisely that problem.
My point is that bitflips either don't really matter, in case data integrity is not mission critical, or they don't actually solve the problem, in case data integrity is mission critical.
If you have solved the problem of data integrity through redundancy, then ECC doesn't make much of a difference anymore. If you haven't solved the problem, then ECC will only prevent a vanishingly small subset of disasters that are awaiting you.
> I guess you've never written a program and watched bits flip on computers you control? You should try it - it's a good exercise to see how often it does happen.
I don't care how often it happens. I care about the odds of a bitflip causing an actual problem. If a computer crashes, that's okay, it'll reboot. If any data were to be corrupted, it would most likely happen at the disk level and not the DRAM level.
> I guess you define something being a problem differently than I or the ECC ram industry do.
Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.
Yes, that is clear.
> If you have solved the problem of data integrity...
As above, this is not a binary, black and white thing, but you keep presenting it as such. It's probabilistic, and higher protection is not free - the tradeoff is engineering.
> Redundancy alone reduces the probability of spurious failure by several magnitudes
ECC "alone reduces the probability of spurious failure by several magnitudes". That's why it is used.
Naive redundancy ignores almost a century of better method form forward error correcting codes. I have a feeling your idea of redundancy is having multiple exact copies of a system or data and having them vote, which is a terribly expensive way to do data protection when there are vastly better methods.
>Of course, somebody who sells ECC RAM will want to convince you that ECC actually solves a real problem. The same can be said about the nutritional supplement industry, or many other industries that rely on make-belief.
And we're done. If you don't think ECC helps a real problem then I see why you don't understand bitflip causing problems. Good luck.