undefined | Better HN

0 pointsbitcharmer5y ago0 comments

How did you run those tests? From what I understand on the topic, for your results to be statistically significant you need at least hundreds of machines and very rigid testing methodology.

0 comments

avian5y ago

As someone who also ran a similar test myself and haven't seen a bit flip, I'm also skeptical of the 96% figure.

I'm too lazy to run the exact numbers right now, but with "4 GB, 96% percent chance, three days" as the hypothesis, I think you'll find that an experimental result of "8 GB, 0% chance, 14 days" is highly statistically significant.

Edit: rough back of napkin estimate - you're not seeing an event in roughly 10x trials (2x number of bits and ~5x number of days). Given hypothesis is true your experimental result has a probability of (1-0.96)^10 = very very small. Conclusion: hypothesis is false.

bitcharmerOP5y ago

The 96% figure comes from Google and was obtained in a large scale experiment over many months. I've been in this business long enough to have witnessed adverse effects of cosmic rays an non-ECC memory multiple times myself. I don't think you're sample gets anywhere near statistical significance. Not mentioning testing methodology.

toast05y ago

My anecdotal evidence is far from rigorous, but the Google data from ten years ago doesn't match up with my experience running thousands of ECC enabled servers up to a few years ago. Their rates seem a lot higher than what my servers experienced; we would page on any ram errors, correctable or not (uncorrectable would halt the machine, so we would have to inspect the console to confirm; when we knowingly tried machines with uncorrectable errors after a halt, they nearly all failed again within 24 hours, so those we didn't inspect the console of probably were counted on their second failure), and while there were pages from time to time, it felt like a lot less than 8% of the machine having a

There's a lot of variables that go into RAM errors, including manufacturing quality and condition of the ram, the dimm, the dimm slot, the motherboard generally, the power supply, the wiring, and the temperature of all of those. Google was known for cost cutting in their servers, especially early on; so I wouldn't be surprised if some of that resulted in higher bitflip rate than running in commercially available servers. Things like running bare motherboards, supported only on the edges cause excess strain and can impact resistance and capacitance of traces on the board (and in extreme cases, break the traces).

Dylan168075y ago

> The 96% figure comes from Google

No it doesn't. You're assuming an even distribution of errors, which is very much not the case.

Google found that the average number of errors is around that range, but they also found that only one third of their servers had any errors in a year.

j / k navigate · click thread line to collapse