(Of course, in extreme scenarios, like at Google scale, even ECC can fail to fail due to multibit errors, but in almost all non-pathological scenarios, SECDED[1] is enough to catch all erroneous cases.)
Intel deciding that consumers (including those buying Haswell-E CPUS) do not need ECC really irks me. Textbook market segmentation from a near monopoly.
Currently you can not have your cake and eat it:
You cannot have the best single-thread performance (offered by overclocking Haswell-E series or Skylake 6700k) and have ECC.
So if one is building the ultimate workstation, you have a hard choice, do you go with X99 chipset(no ECC but can overclock) or do you go to the server motherboards with C610 chipsets which are quite limited as far as consumer interests are.
Interesting are the Intel mobile Xeons which now provide a venue for ECC on a laptop.
Generally if you are willing to give up a single clock bin in exchange for ECC you end up with a cheaper (and cooler) system that's more reliable. Generally if you want the cheapest 4c/8t CPU it's a xeon, NOT an i7.
I don't feel particularly artifically segmented. Additionally the high end desktop motherboards tend to be more expensive than the server boards. Often I find a nice server board at $180 and the nice desktop boards are often another $100. Sure they are marketed to gamers, but I really just want a nice reliable power and cooling and it's not clear which of the cheaper desktop boards are really going to last 24/7 for 5 years.
Today I'd buy the E3-1270 for $339 over the $350 i7-6700k. Keep in mind the k chips are a premium AND they don't come with a fan like the non-k chips do. Sure it's 3.6 - 4 GHz instead of 4.0 to 4.2 GHz, not a particularly noticeable difference, especially since that both thermally throttle as needed.
I think ECC is well justified because it doesn't just detect dimm errorrs, but also motherboard errors, cpu errors, and socket (dimm or cpu) errors. If a node randomly crashes/hangs it's very hard to track down why... unless you have ECC and often will help you pin it down. I'd much rather see something strange show up in mcelog than wait for a hang, or worse a corruption.
Most of my "ecc" errors have actually been motherboard, socket, or (in AMDs case) CPUs. When I look at larger samples some dimms are WAY less reliable than others. Strongly implying it's not high energy particles, but something out of spec.
Two questions:
(1) IIRC, some operating systems, seeing some ECC errors, maybe just the uncorrectable ones or maybe also the correctable ones, moved to mark the associated memory, or block of memory, as faulty, maybe stopping the (applications) program using that memory, and continued on. Is this done with current operating systems?
(2) What would Windows Server do with a thread, process, address space or whatever the heck that encountered a memory error detected by ECC, especially one that was uncorrectable?
I'm eager to know since I'm eager to build a server, with ECC memory, and run Windows Server in production.
(2) So as far is I know, the normal consequence of a detected multi bit error will be a system reset.
The margins between working and non-working DRAM these days are extremely small. E.g. Rowhammer demonstrated that even user-space programs could readily obliterate main memory, without even trying very hard to do so.[1]
But, maybe in this case he's right. It's not like "open source Internet forum software" is anything that's mission critical. If there's an occasional garble in a character or two, will the latte-swilling hipsters even notice? :-)
Just like the original Google servers he points to. Who cares if they occasionally screwed up in reporting search results, because they didn't have ECC memory. Overall the experience was still 100x better than using something like Altavista.
The average consumer knows that more "jigabits" are better and more "jigahertz" is better (see Intel NetBurst for how badly that can go wrong).
See a link elsewhere in this tread, someone posted a memory error presentation that talked about FIT, failures in time. But the average consumer doesn't know what that is.
Hence we get a race to the bottom. PC assemblers are willing to sell their mothers into slavery if it can save them $0.05 in build cost. ECC doesn't fit into that narrative.
BTW ECC is "in every computer" nowadays. As yet another poster mentioned, Intel CPUs use ECC internally to protect their caches.
Dell was pretty good at shaving pennies and providing WalMart-ized desktops and servers.
I think the offerings need to be optimized and reduce and cut features to just what's necessary based on actual, intended uses rather than guessing or throwing every possible feature into a retail desktop or offering a blizzard of different, poorly-explained SKUs (what's the diff btwn A78Z-VX and A78C-VX+?)
Related, see also: http://cr.yp.to/hardware/ecc.html
It does matter for stuff like big databases and ERP.
DDR4 implemented some mitigation against such attacks as well as some additional soft ECC mechanisms but as these types of attacks are fairly new it's not quite yet clear as how effective they are.
Memory errors are especially insidious compared to how common they are. ECC is worth it.
I forget now why we even thought to build a server without ECC RAM, but I sure learned my lesson after that.
I had a background process [1] on each that simply allocated a 128 MB buffer, filled it with a known data pattern, and then went into an infinite loop that slept a while, woke up and checked the integrity of the buffer, and if any of the data had changed logged the change and restored the data pattern.
Based on the error rates I'd seen published, I expected to catch a few errors. For example, using the rate that Tomte's comment [2] cites I think I'd expect about 6 errors a year.
I never caught an error.
I also have two desktops with ECC (a 2008 Mac Pro and a 2009 Mac Pro). I've used the 2008 Mac Pro every working day since I bought it in 2008, and the 2009 Mac Pro every day since I bought it in 2009. Neither of them has ever reported correcting an error.
I have no idea why I have not been able to see an error.
Just wait, and relax. You'll get there eventually.
Anyone who has bought a popular bitsquatted domain name can attest to this.
And I'm sure there are many other vectors of attacks using this flaw.
a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130 nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009
b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies, Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005
c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004
d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel, 2003, IEEE Computer Society
e) Basic mechanisms and modeling of single-event upset in digital microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.
f) Destructive single-event effects in semiconductor devices and ICs, F. W. Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.
g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson, Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340
h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual Research Conference on Reliability Stanford University, October 2000
i) International Technology Roadmap for Semiconductors (ITRS), several papers.
If that's correct, the math is simple: you have bit flips in your PC about once a day.
It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.
Also, most modern processors use ECC for their caches (even when the main memory is non-ECC) and they serve the vast majority of memory requests, so it is unlikely that intermediate values in a tight computation are affected by non-ECC RAM. That adds to the "silentness" aspect of the bit flip in consumer systems.
Without ECC these errors would have unknown consequences. They could happen in some unused region of memory, or they could happen in a dirty page in the filesystem cache. It's not fun to discover that your filesystem has been silently corrupted a unknown time after the fact.
Maybe Google doesn't need ECC. Their data is duplicated across several machines and it's extremely unlikely that a few corrupt servers would lead to any data loss.
However, on a smaller scale (and just like RAID) it's cheaper to have ECC than add more servers for extra redundancy.
People need to adapt to a world where we have more cores instead of faster execution per core. You can't compare late 90's growth in execution speed per core with the situation we have today.
Write software for an environment where the number of cores scale, instead of an environment where the execution speed of a single core is more important.
Is that so bad? He's writing and hosting the code, and he's paying the bill to do it. Seems to me he should be able to pick how to do it.
Edit: found it: https://www.youtube.com/watch?v=ZPbyDSvGasw
Also, I still see "fire hazard" when I look at the early Google racks. No idea how Equinix let them get away with it. Too much ivory tower going on there. Not enough "you know we're liable if we burn down the colo with that crap, right?"
So you could have a PCB fire, but PCBs are made to be flame retardant. You could have a wire insulation fire, but the amount of material would be so low that it wouldn't be able to start a fire anywhere else.
So I am basically saying there isn't really anything there that could sustain a fire and that there isn't a lot of energy to start ignition in the first place.
So yeah, they're pretty flame retardant.
If it didn't burn down Google's stuff, it could have burned down other people's gear. I have decades of experience here; I'm not an ivory tower nerd. Any datacenter/colo provider worth a salt will jump on you immediately for having cardboard in your environment. DRT makes you unbox everything outside the various colos and won't even let cardboard enter.
The Xeon E3-1270 v5 goes from 3.6 to 4.0 GHz and only costs 10% more than the i7-6700 (3.4-4.0 GHz)
Also, the Xeon E3-1230 v5 goes from 3.4 to 3.8 GHz (same base clock) and costs less than the Core i7-6700.
In general, you should never buy non-Xeon CPUs if you have the choice, both for desktop and for servers, since ECC memory is essential if you don't want to have a significant chance of having to replace your RAM after discovering mysterious problems with your system.
Will someone die if the data gets corrupted? No - then no ECC should be enough. And you should have checksums everywhere anyway.
What happens to the data after you have read it to memory and successfully verified the checksum? You probably process it in memory, and have no idea afterwards if the changes are due to your code, or because of errors.
Of course you can now propose to also checksum and check the data while it is in memory. Which is basically what ECC does, in hardware, for cheap, requiring no CPU cycles.