To ECC or Not to ECC (opens in new tab)

(blog.codinghorror.com)

96 pointsshritesh10y ago52 comments

52 comments

Please note that the main purpose of ECC is not to reduce RAM error rate and make it look more reliable, but to help the system stop the process when an unrecoverable memory error occurs as opposed to propagating it and resulting in unpredictable outcomes. The change in the effect of failure is what matters most, not the probability of it. Without ECC, there's often no clear way to realize that the result of a computation is valid or garbage and should be discarded.

(Of course, in extreme scenarios, like at Google scale, even ECC can fail to fail due to multibit errors, but in almost all non-pathological scenarios, SECDED[1] is enough to catch all erroneous cases.)

[1]: http://cr.yp.to/hardware/ecc.html

sireat10y ago

Exactly, you want to know when the error is due to memory.

Intel deciding that consumers (including those buying Haswell-E CPUS) do not need ECC really irks me. Textbook market segmentation from a near monopoly.

Currently you can not have your cake and eat it:

You cannot have the best single-thread performance (offered by overclocking Haswell-E series or Skylake 6700k) and have ECC.

So if one is building the ultimate workstation, you have a hard choice, do you go with X99 chipset(no ECC but can overclock) or do you go to the server motherboards with C610 chipsets which are quite limited as far as consumer interests are.

Interesting are the Intel mobile Xeons which now provide a venue for ECC on a laptop.

sliken10y ago

Textbook market segmentation? Xeons have ECC, larger thermal envelope, and some additional testing. Sure they are identical silicon.

Generally if you are willing to give up a single clock bin in exchange for ECC you end up with a cheaper (and cooler) system that's more reliable. Generally if you want the cheapest 4c/8t CPU it's a xeon, NOT an i7.

I don't feel particularly artifically segmented. Additionally the high end desktop motherboards tend to be more expensive than the server boards. Often I find a nice server board at $180 and the nice desktop boards are often another $100. Sure they are marketed to gamers, but I really just want a nice reliable power and cooling and it's not clear which of the cheaper desktop boards are really going to last 24/7 for 5 years.

Today I'd buy the E3-1270 for $339 over the $350 i7-6700k. Keep in mind the k chips are a premium AND they don't come with a fan like the non-k chips do. Sure it's 3.6 - 4 GHz instead of 4.0 to 4.2 GHz, not a particularly noticeable difference, especially since that both thermally throttle as needed.

I think ECC is well justified because it doesn't just detect dimm errorrs, but also motherboard errors, cpu errors, and socket (dimm or cpu) errors. If a node randomly crashes/hangs it's very hard to track down why... unless you have ECC and often will help you pin it down. I'd much rather see something strange show up in mcelog than wait for a hang, or worse a corruption.

Most of my "ecc" errors have actually been motherboard, socket, or (in AMDs case) CPUs. When I look at larger samples some dimms are WAY less reliable than others. Strongly implying it's not high energy particles, but something out of spec.

mjevans10y ago

If it weren't a market segmentation strategy, just like limiting the RAM capacity, then there'd be equivalent 'server' chips for most current 'desktop' feature sets and vice versa. However that is clearly not the case, both in my own shopping experience and in the experience of Jeff Atwood (this is in fact something he complains about in this very article).

ECC would require running and connecting a few more traces, but that would /surely/ be offset by not having to create as many layouts or source/stock as many parts. In the past AMD used to have a competing/selling point of /all/ of their CPUs supporting ECC ram. Today that is not the case, as they too have mirrored (colluded?) Intel's market segmentation strategy.

davidy12310y ago

Asrock's X99 boards can accept Xeons and ECC memory.

sireat10y ago

Nice catch!

You have to give props for Asrock as they are frequently offering unique motherboard features (like that mini-ITX X99 etc).

However, it still does not change the fact that Xeon v3 models are slower(tops at 3.9Ghz Turbo) single threaded than Haswell-E, so it is the same story just a different chipset, either you get fast single threaded performance with Haswell-E or you get Xeon with ECC.

Take these somewhat similar chips: http://ark.intel.com/products/82931/Intel-Core-i7-5930K-Proc... http://ark.intel.com/products/81900/Intel-Xeon-Processor-E5-...

Why can't we have 5930k with ECC support? Because then some thrifty IT managers would buy those for their low end server needs.

bjwbell10y ago

The newer (haswell or later) Intel gpus DO have ECC for the gpu caches. I guess marketing missed that one or engineering insisted.

graycat10y ago

Good thoughts. Thanks.

Two questions:

(1) IIRC, some operating systems, seeing some ECC errors, maybe just the uncorrectable ones or maybe also the correctable ones, moved to mark the associated memory, or block of memory, as faulty, maybe stopping the (applications) program using that memory, and continued on. Is this done with current operating systems?

(2) What would Windows Server do with a thread, process, address space or whatever the heck that encountered a memory error detected by ECC, especially one that was uncorrectable?

I'm eager to know since I'm eager to build a server, with ECC memory, and run Windows Server in production.

geococcyxc10y ago

(1) I have never heard of this behavior, correctable errors will be reported via MCA on Intel and uncorrectable ones will reset the system (and probably be logged in some firmware log).

(2) So as far is I know, the normal consequence of a detected multi bit error will be a system reset.

trav422510y ago

Yeah, agreed. I found myself wondering why he kept referring to "reliability". Perhaps he defines it to include data integrity, but it didn't come across that way to me. I kept thinking to myself "nobody I know uses ECC just to increase reliability".

acqq10y ago

I believed that Google from the beginning implemented their own checksum codes in their software to regularly verify the data processed or communicated on their non-ECC computers. And I doubt the open-source software the article author uses does the same?

PhantomGremlin10y ago

I know that Jeff is a demigod to some people, but I interpret this article as: "As a software guy, I don't really understand why I need this fancy hardware, so this can't be important". IMO he's wrong.

The margins between working and non-working DRAM these days are extremely small. E.g. Rowhammer demonstrated that even user-space programs could readily obliterate main memory, without even trying very hard to do so.[1]

But, maybe in this case he's right. It's not like "open source Internet forum software" is anything that's mission critical. If there's an occasional garble in a character or two, will the latte-swilling hipsters even notice? :-)

Just like the original Google servers he points to. Who cares if they occasionally screwed up in reporting search results, because they didn't have ECC memory. Overall the experience was still 100x better than using something like Altavista.

[1] https://en.wikipedia.org/wiki/Row_hammer

theandrewbailey10y ago

What Jeff is trying to say is: if ECC is so desperately needed to prevent memory errors that are supposedly happening all the time, why isn't ECC in every computer everywhere?

PhantomGremlin10y ago

That question is very easily answered.

The average consumer knows that more "jigabits" are better and more "jigahertz" is better (see Intel NetBurst for how badly that can go wrong).

See a link elsewhere in this tread, someone posted a memory error presentation that talked about FIT, failures in time. But the average consumer doesn't know what that is.

Hence we get a race to the bottom. PC assemblers are willing to sell their mothers into slavery if it can save them $0.05 in build cost. ECC doesn't fit into that narrative.

BTW ECC is "in every computer" nowadays. As yet another poster mentioned, Intel CPUs use ECC internally to protect their caches.

bro-stick10y ago

There's at least two broad classes of error correction and detection: at-rest and in-flight.

Each storage hierarchy component (RAM, SSD, CPU caches, etc.) and interconnection (chip-to-chip, add-on card, cable to another box) needs to be looked at for risk of nondetection/data loss based on risk consequences of the intended use.

For example, billing database servers for a successful company probably should use RAID array/SAN/NAS (say RAID6 or ZFS with RAIDZ3) and Chipkill ECC memory on an enterprise-class box with decent vendor support.

CDN boxes for serving free, static content can be almost anything.

For larger shops, they have the economies of scale to ask from OEMs and ODMs to build custom boxes that are more optimized than COTS gear at Dell, HP or CDW.

When Jeff's venture takes off, they might explore gear customized for running Ruby and/or partnering with 37signals and the like to have OEMs/ODMs folks develop better performing gear and open source it like Facebook has.

bro-stick10y ago

Cost is king in a commodity market where IBM, et. al. left for more profitable waters.

Dell was pretty good at shaving pennies and providing WalMart-ized desktops and servers.

I think the offerings need to be optimized and reduce and cut features to just what's necessary based on actual, intended uses rather than guessing or throwing every possible feature into a retail desktop or offering a blizzard of different, poorly-explained SKUs (what's the diff btwn A78Z-VX and A78C-VX+?)

Related, see also: http://cr.yp.to/hardware/ecc.html

yuhong10y ago

"Non-parity" RAM probably started becoming common around the 1993-1995 period when DRAM demand was increasing and prices was not falling much. For example, 4Mbit DRAM was costing more than $10 per chip during this period. Nowadays Intel uses it for market segmentation.

Spooky2310y ago

Because the consequence of failure in most desktop scenarios is low, and doesn't justify the cost for mainstream use cases.

It does matter for stuff like big databases and ERP.

dogma113810y ago

Just to be clear SECDED ECC doesn't protect you against row hammer and similar memory disturbance attacks.

DDR4 implemented some mitigation against such attacks as well as some additional soft ECC mechanisms but as these types of attacks are fairly new it's not quite yet clear as how effective they are.

teddyh10y ago

We once had a new server with all new hardware which had weird problems and kept crashing mysteriously. Memory tests showed no errors, so we were all tearing our hair out. We took the server offline and set it to test continously – still no errors. After running Memtest86 on nothing but test #4, for about a day or so – then a few memory errors showed up. Replaced memory, problem gone, server started working.

Memory errors are especially insidious compared to how common they are. ECC is worth it.

rwmj10y ago

I remember circa 1999 having a database server which had a stuck bit in memory. The bit happened to be placed in the page cache, so it subtly corrupted disk writes resulting in the database throwing checksum errors. It took an insane amount of time to even diagnose where the problem could be. We of course thought it was the disks themselves and tried many variations of disks and external RAID cards. Finally, one run of memtest86 found the real problem, and I threw away the memory and motherboard and replaced it with one capable of ECC RAM.

I forget now why we even thought to build a server without ECC RAM, but I sure learned my lesson after that.

beachstartup10y ago

i wouldn't even call a machine without ecc a server or workstation. more like a consumer device that's been given a job it can't do.

teddyh10y ago

This was many, maybe more than 10, years ago.

tzs10y ago

I tried to catch soft errors for about a year on a couple of Linux boxes I had. They were both desktop form factor machines, one being used as a home server and one as a desktop at work.

I had a background process [1] on each that simply allocated a 128 MB buffer, filled it with a known data pattern, and then went into an infinite loop that slept a while, woke up and checked the integrity of the buffer, and if any of the data had changed logged the change and restored the data pattern.

Based on the error rates I'd seen published, I expected to catch a few errors. For example, using the rate that Tomte's comment [2] cites I think I'd expect about 6 errors a year.

I never caught an error.

I also have two desktops with ECC (a 2008 Mac Pro and a 2009 Mac Pro). I've used the 2008 Mac Pro every working day since I bought it in 2008, and the 2009 Mac Pro every day since I bought it in 2009. Neither of them has ever reported correcting an error.

I have no idea why I have not been able to see an error.

[1] http://pastebin.com/Bv56kVwC

[2] https://news.ycombinator.com/item?id=10600308

Ono-Sendai10y ago

Did you check the resulting (dis)assembly? If you compile with optimisations the reading (and maybe writing) to the RAM buffer may be optimised away.

marcosdumay10y ago

As soon as you have a power fluctuation, air conditioning malfunction, or a few dirty caused short cuts, you'll get enough errors to converge on the published average.

Just wait, and relax. You'll get there eventually.

yuhong10y ago

That is normal of course, and the published error rates are over large amounts of RAM I think.

tshtf10y ago

Soft errors are fairly common; in fact it allows for problems in DNS resolution such as Bitsquatting: https://www.defcon.org/images/defcon-19/dc-19-presentations/...

Anyone who has bought a popular bitsquatted domain name can attest to this.

baby10y ago

Also errors in packets signatures from TLS handshakes (http://cryptologie.net/article/294/factoring-rsa-keys-with-t...)

And I'm sure there are many other vectors of attacks using this flaw.

Tomte10y ago

IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in time"; per 10e-9 hours of operation) and gives the following sources:

a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130 nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009

b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies, Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004

d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel, 2003, IEEE Computer Society

e) Basic mechanisms and modeling of single-event upset in digital microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.

f) Destructive single-event effects in semiconductor devices and ICs, F. W. Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.

g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson, Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340

h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual Research Conference on Reliability Stanford University, October 2000

i) International Technology Roadmap for Semiconductors (ITRS), several papers.

If that's correct, the math is simple: you have bit flips in your PC about once a day.

It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.

mehrdada10y ago

> It's just that (a) you often won't notice those transient errors (one pixel in your multi-megapixel photo is one bit off) and (b) a lot of your RAM is probably unused.

Also, most modern processors use ECC for their caches (even when the main memory is non-ECC) and they serve the vast majority of memory requests, so it is unlikely that intermediate values in a tight computation are affected by non-ECC RAM. That adds to the "silentness" aspect of the bit flip in consumer systems.

cushychicken10y ago

These things do happen with a reasonable amount of frequency. I used to work at a division of a major memory manufacturer that dealt with writing tests to find these DIMMs that exhibited these sorts of failures - the semiconductor industry calls them "variable retention transfers". (Aside: numerous PhDs in the field of semiconductor physics have built prosperous careers trying to understand why these soft failures happen. Short answer: we have some theories, but we don't really know.) It was provably worth millions of dollars to be able to screen for this sort of phenomenon, because a Google or an Apple or an IBM would return a whole manufacturing lot of your bleeding edge, high-margin DIMMs if they found one bit error in one chip of one lot. Each lot was shipping for millions and millions of dollars.

CrLf10y ago

Anyone who've managed even a modest amount of servers with ECC RAM for a reasonable amount of time has surely seen ECC events in their hardware logs. Most of these are one-time errors that never happen again on the same server, ever.

Without ECC these errors would have unknown consequences. They could happen in some unused region of memory, or they could happen in a dirty page in the filesystem cache. It's not fun to discover that your filesystem has been silently corrupted a unknown time after the fact.

Maybe Google doesn't need ECC. Their data is duplicated across several machines and it's extremely unlikely that a few corrupt servers would lead to any data loss.

However, on a smaller scale (and just like RAID) it's cheaper to have ECC than add more servers for extra redundancy.

wmf10y ago

Or he could have waited a few months and gotten ECC anyway: http://ark.intel.com/products/88171/Intel-Xeon-Processor-E3-...

yuhong10y ago

Interestingly, the only vendor which sells 16GB unbuffered ECC DDR4 DIMMs seems to be Crucial: http://www.crucial.com/usa/en/ct16g4wfd8213

wmf10y ago

E3 DIMMs have always been rare and usually expensive; I wish Intel would enable regular registered DIMMs.

yuhong10y ago

The funny thing is that they still so expensive when the x8 chips are so cheap.

sebcat10y ago

What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"

People need to adapt to a world where we have more cores instead of faster execution per core. You can't compare late 90's growth in execution speed per core with the situation we have today.

Write software for an environment where the number of cores scale, instead of an environment where the execution speed of a single core is more important.

ketralnis10y ago

> What he's saying is essentially "The code I write/the platform I choose scales poorly over multiple cores. Therefor I decide to blame the hardware, and skip features that are good for me"

Is that so bad? He's writing and hosting the code, and he's paying the bill to do it. Seems to me he should be able to pick how to do it.

vox_mollis10y ago

This cannot possibly be right. There was a DC21 talk regarding DNS request misfires due to bit flips in non-ECC DRAM, and the researcher was able to collect a surprisingly large number of requests on the basis of this.

Edit: found it: https://www.youtube.com/watch?v=ZPbyDSvGasw

ketralnis10y ago

Importantly, those DNS packets go through a number of systems that are not clients or servers. Wifi, microwave antennae, undersea cables, consumer routers, unpowered hubs, you name it. It's hard to know whether these bit flips are actually coming from cosmic rays or EM interference or rare decompression bugs.

Animats10y ago

If soft errors are rare, parity checking, without correction, might be more useful. It's better to have a server fail hard than make errors. In a "cloud" service, the systems are already in place to handle a hard failure and move the load to another machine. Unambiguous hardware failure detection is exactly what you want.

mehrdada10y ago

In practice, you basically get one-bit error correction 'for free' when you have enough redundancy to detect two-bit soft errors. Simple parity can only detect one bit flip, so if you want to catch two-bit errors, you might as well correct one-bit errors you find on your way at no extra cost.

scurvy10y ago

I don't think that data corruption was a huge issue for Google back then (really early on). Corrupt data? Big whoop. Re-index the internet in another X hours, and it's gone. I doubt they had much persistent storage as most of their data was transient and well, the Internet.

Also, I still see "fire hazard" when I look at the early Google racks. No idea how Equinix let them get away with it. Too much ivory tower going on there. Not enough "you know we're liable if we burn down the colo with that crap, right?"

upofadown10y ago

There is no extra chance of a short circuit before the power supply. After the power supply the power is limited, either by explicit current limiting or just because they are switching power supplies where transformer saturation limits the power.

So you could have a PCB fire, but PCBs are made to be flame retardant. You could have a wire insulation fire, but the amount of material would be so low that it wouldn't be able to start a fire anywhere else.

So I am basically saying there isn't really anything there that could sustain a fire and that there isn't a lot of energy to start ignition in the first place.

XorNot10y ago

Fun fact: I had my desktop get unreliable for a few weeks until I finally thought it might be just dust build up in the case. Opened it up and found a huge scorch mark on the motherboard where a capacitor had clearly burned up.

So yeah, they're pretty flame retardant.

scurvy10y ago

Cardboard breaks down over time. It turns into particulate matter that goes airborne into really hot server intakes and comes out tiny little burning embers.

If it didn't burn down Google's stuff, it could have burned down other people's gear. I have decades of experience here; I'm not an ivory tower nerd. Any datacenter/colo provider worth a salt will jump on you immediately for having cardboard in your environment. DRT makes you unbox everything outside the various colos and won't even let cardboard enter.

upofadown10y ago

>comes out tiny little burning embers.

The auto-ignition temp of paper is over 200C. The maximum junction temperature of most electronics is somewhere around 100C. This this literally could not of ever happened unless the equipment was already on fire.

I'll leave the idea that cardboard breaks down fast enough to be noticed over the life of a server to someone more knowledgeable. I note that there was no mention of cardboard in the article.

1 more reply

devit10y ago

The article is wrong.

The Xeon E3-1270 v5 goes from 3.6 to 4.0 GHz and only costs 10% more than the i7-6700 (3.4-4.0 GHz)

Also, the Xeon E3-1230 v5 goes from 3.4 to 3.8 GHz (same base clock) and costs less than the Core i7-6700.

In general, you should never buy non-Xeon CPUs if you have the choice, both for desktop and for servers, since ECC memory is essential if you don't want to have a significant chance of having to replace your RAM after discovering mysterious problems with your system.

venomsnake10y ago

Isn't it simple enough calculation:

Will someone die if the data gets corrupted? No - then no ECC should be enough. And you should have checksums everywhere anyway.

jo90910y ago

where do you create that checksum? If its on a computer without ECC, you will just checksum the data including the error, then write that data including the error to disk.

What happens to the data after you have read it to memory and successfully verified the checksum? You probably process it in memory, and have no idea afterwards if the changes are due to your code, or because of errors.

Of course you can now propose to also checksum and check the data while it is in memory. Which is basically what ECC does, in hardware, for cheap, requiring no CPU cycles.

j / k navigate · click thread line to collapse