Data centers are too reliable (opens in new tab)

(dmerr.tumblr.com)

93 pointsdm_mongodb14y ago25 comments

25 comments

24 comments · 9 top-level

TamDenholm14y ago· 6 in thread

I wonder if its cheaper (and/or better) for a hosting provider to spend less on reliability and instead build in multiple redundancy. Thoughts?

Also, i've never heard the expression 4 nines before, anyone explain that for me?

tangothedog14y ago

It's a common way to describe uptime. It means 99.99% uptime/availability, which works out to under 1 hour of downtime per year.

http://en.wikipedia.org/wiki/Nines_(engineering)

synnik14y ago

Multiple Redundancy IS reliability. As long as everything stays running from the customer/business perspective, you are reliable.

When you start to define reliability as a specific layer of your infrastructure, you are setting yourself up for problems. You should be redundant at all layers, thereby reliable across your entire stack.

mdda14y ago

4-nines is 99.99% reliability (downtime of about 1 hour per year). More detail :http://en.wikipedia.org/wiki/High_availability

esrauch14y ago

The number of nines refers to how reliable something is. "Four nines" means that it is up 99.99% of the time, or 0.01% of downtime, which is 0.0001 * 365 * 24 = just shy of 1 hour of downtime per year (or 1 day downtime every 30 years).

These things are usually specified in SLA (service level agreements) where if the provider exceeds the amount of downtime in the SLA means that you don't have to pay. If Amazon has a 4-nines SLA that means that you don't get any discount if they have 1 hour down per year.

pavel_lishin14y ago

> I wonder if its cheaper (and/or better) for a hosting provider to spend less on reliability and instead build in multiple redundancy. Thoughts?

Wouldn't it look the same from my end?

Hovertruck14y ago

I assume it refers to uptime, e.g. 99.99% uptime.

spydum14y ago· 4 in thread

It used to be, most application code was written to expect everything beneath it (transport, network, data, physical) would work flawlessly.

Then people realized, that was just not happening in the real world:

- Dual-channel interfaces would still fail (bad line card on the switch).

- Redundant Storage frames could suffer catastrophic outages.

- Datacenters with dozens of layers of redundancy could still fail due to fiber cuts, or faulty transformers.

- Unexpected crap happens, and you can never account for all of it.

So gradually, a movement began: virtualized computing and automated provisioning tools really gained traction and maturity. A shift towards more defensive programming emerged: we now don't trust the bottom layers to work. We plan ahead for them, we expect failure. Google, publishing their server specs and approach, really pushed this forward: people realized you could run serious infrastructure on mickey mouse hardware.

No longer is reliability needed at the component layer. We can move the responsibility much farther up the stack, and just added resources to cope.

I think many people (especially in the enterprise space, where change is slow, and tech is slow to adopt) have not adjusted for this change yet. They continue to buy uber-expensive server gear with redundant power supplies, mirrored disk (for redundancy, not performance reasons), ECC checking memory, and so on. They don't realize the $10,000 server they spent on a single server, could have bought three less redundant options that would have tripled their computing power. For the most part, they are right: you can't do these things until you fundamentally change your approach to software.

mjb14y ago

> ECC checking memory

There is a balance, and I don't feel that throwing out ECC memory is necessarily the right choice for the majority of server applications. Low hardware cost needs to be achieved with an holistic approach - simply buying the cheapest possible components is unlikely to lead to the lowest cost unless your software and datacenter designs are really specialized for it.

DRAM errors are rather common in real systems[1]. There are two big hidden costs to this. The first one is the risk of silent data corruption. Unless you are willing to write your software in a way that is very careful to check all calculations, you run the risk of getting the wrong answer. The other hidden cost is operational: memory errors are often difficult to diagnose and you have to pay a highly skilled human to do it, as well as lose the use of the server while it is being done.

It may be that buying ECC RAM decreases the cost-per-page reliably served of your entire operation. If you are Google scale then that may not be the case, but for nearly all smaller operations it is.

'Enterprise' type hard drives are another potential long-term saving by spending more up front. Having a human replace a disk, and having the server down for the time it takes for a disk to be replaced, is expensive. If you have a large number of disks, especially if you are sensitive to small numbers of IO errors, it may be worth paying more up front.

Using an external view of Google's architecture to say 'cheap hardware is always good' is too simplistic. Yes, there is good evidence that single-host reliability mechanisms like RAID might give a poor ROI. Yes, redundancy is a powerful way to get reliability. But, before you take this to the extreme, you need to have carefully designed applications, carefully designed datacenters, and extremely low per-host operations costs (probably through aggressive automation). Unless you have these things, the optimal cost-per-request server design for your company may be very different from the ones Google, Facebook and Amazon chose.

[1] http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf

EricBurnett14y ago

Indeed. And even at Google scale where 'good enough' is an art form, ECC memory is deemed to be worth the money. The paper you cited does its study on Google hardware, and I can confirm that it's still used today.

throwawayday14y ago

I'd say that most companies aren't at the level where they code for this - they buy COTS software for their business, and much of it is still of the 'put all your eggs in one basket and watch that basket' mentality. I know it works that way at my place of employment. No matter how much I gnash my teeth and wail, the programmers still haven't caught up to current tech.

6ren14y ago

For the network aspects: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...

snprbob8614y ago· 2 in thread

I'm reminded of Netflix's Chaos Monkey:

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u...

nooneelse14y ago

Ever since reading that last year, whenever a movie I'm watching stops to buffer, I wonder if it could have been the Chaos Monkey.

jvoorhis14y ago

I think of the chaos monkey more like a baseball umpire: perhaps whenever your movie doesn't stop to buffer, you might thank the chaos monkey.

1 more reply

pdenya14y ago· 1 in thread

Link bait headline. Obviously the issue here is that Data Centers are not reliable enough. A good failover option is necessary but generators and UPSs don't represent the primary costs behind data centers, removing them won't come close to cutting prices in half.

The point seems to be to expect failure and make sure you're prepared but increasing the likely hood of failures for a slight decrease in cost is not a good way to go about it.

mbreese14y ago

I think that the argument is that since the data center is "reliable enough", you don't feel the need to adequately test your failover option/procedures. If you knew that your data center would go down more frequently, you would better test your failover.

The question then becomes, when is it economically more viable to use multiple low(er) availability data centers as opposed to one hyper available one? (And since you should be designing for data center failover anyway, that is a fixed cost).

wccrawford14y ago· 1 in thread

If redundancy was cheap and easy, we'd do it even if we had reliable data centers.

It's not, and that's why we don't have it.

mbreese14y ago

And therein lies the argument... that it would be cheaper to building in data-center redundancy into your design that it would be to ensure that your data center had 5-nines of uptime.

And if you're large enough to require 5-nines (or even 4) of uptime, you should be looking at data-center level redundancy anyway.

jzawodn14y ago· 1 in thread

Hm. We haven't had the same experience. In the last 3+ years, I recall 1 unplanned outage that could be blamed on "the data center"

regularfry14y ago

We've seen a couple, and I've heard about a couple more. It always seems to be the UPSes, oddly enough.

mjb14y ago

As you try and push a system towards 100% reliability, you need to understand your risk model better and better. When you get to levels around 5 nines, very unlikely events which could cause an hour's outage every 20 years start to dominate. In a system as complex as a large datacenter, it is always going to be difficult and expensive to understand all of these risks, and even more difficult and expensive to design around them.

That is why redundancy is so important. Instead of fighting an a battle which is exponentially increasing in difficulty, you chose to optimize the reliability of a single component. You give up optimizing each of the really complex subsystems (datacenters) at a certain level (3 or 4 nines) and focus on optimizing the reliability of a really simple component for detecting failures and directing traffic to the online datacenter.

Reliability engineers have known this for a really long time. If you can fit redundancy into your design, it is almost always a cheaper way to approach high reliability than optimizing the reliability of each subsystem.

rfugger14y ago

For 99% of applications, there just isn't budget or a need for software failover. Making data centers continually more reliable serves this 99% well.

apaprocki14y ago

If you do not model your software for disaster recovery survivability and routinely test scenarios where you lose a data center, there will certainly be something bad that pops up when it happens in a real-world scenario. Once systems become too complex with too many interacting pieces you need to run real-world DR situations on a schedule to ensure something isn't missed.

j / k navigate · click thread line to collapse

25 comments

24 comments · 9 top-level

TamDenholm14y ago· 6 in thread

I wonder if its cheaper (and/or better) for a hosting provider to spend less on reliability and instead build in multiple redundancy. Thoughts?

Also, i've never heard the expression 4 nines before, anyone explain that for me?

tangothedog14y ago

It's a common way to describe uptime. It means 99.99% uptime/availability, which works out to under 1 hour of downtime per year.

http://en.wikipedia.org/wiki/Nines_(engineering)

synnik14y ago

Multiple Redundancy IS reliability. As long as everything stays running from the customer/business perspective, you are reliable.

mdda14y ago

4-nines is 99.99% reliability (downtime of about 1 hour per year). More detail :http://en.wikipedia.org/wiki/High_availability

esrauch14y ago

pavel_lishin14y ago

> I wonder if its cheaper (and/or better) for a hosting provider to spend less on reliability and instead build in multiple redundancy. Thoughts?

Wouldn't it look the same from my end?

Hovertruck14y ago

I assume it refers to uptime, e.g. 99.99% uptime.

spydum14y ago· 4 in thread

It used to be, most application code was written to expect everything beneath it (transport, network, data, physical) would work flawlessly.

Then people realized, that was just not happening in the real world:

- Dual-channel interfaces would still fail (bad line card on the switch).

- Redundant Storage frames could suffer catastrophic outages.

- Datacenters with dozens of layers of redundancy could still fail due to fiber cuts, or faulty transformers.

- Unexpected crap happens, and you can never account for all of it.

No longer is reliability needed at the component layer. We can move the responsibility much farther up the stack, and just added resources to cope.

mjb14y ago

> ECC checking memory

It may be that buying ECC RAM decreases the cost-per-page reliably served of your entire operation. If you are Google scale then that may not be the case, but for nearly all smaller operations it is.

[1] http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf

EricBurnett14y ago

throwawayday14y ago

6ren14y ago

For the network aspects: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...

snprbob8614y ago· 2 in thread

I'm reminded of Netflix's Chaos Monkey:

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u...

nooneelse14y ago

Ever since reading that last year, whenever a movie I'm watching stops to buffer, I wonder if it could have been the Chaos Monkey.

jvoorhis14y ago

I think of the chaos monkey more like a baseball umpire: perhaps whenever your movie doesn't stop to buffer, you might thank the chaos monkey.

1 more reply

pdenya14y ago· 1 in thread

The point seems to be to expect failure and make sure you're prepared but increasing the likely hood of failures for a slight decrease in cost is not a good way to go about it.

mbreese14y ago

wccrawford14y ago· 1 in thread

If redundancy was cheap and easy, we'd do it even if we had reliable data centers.

It's not, and that's why we don't have it.

mbreese14y ago

And therein lies the argument... that it would be cheaper to building in data-center redundancy into your design that it would be to ensure that your data center had 5-nines of uptime.

And if you're large enough to require 5-nines (or even 4) of uptime, you should be looking at data-center level redundancy anyway.

jzawodn14y ago· 1 in thread

Hm. We haven't had the same experience. In the last 3+ years, I recall 1 unplanned outage that could be blamed on "the data center"

regularfry14y ago

We've seen a couple, and I've heard about a couple more. It always seems to be the UPSes, oddly enough.

mjb14y ago

rfugger14y ago

For 99% of applications, there just isn't budget or a need for software failover. Making data centers continually more reliable serves this 99% well.

apaprocki14y ago

j / k navigate · click thread line to collapse