undefined | Better HN

0 pointsmixermachine1y ago0 comments

1 in a million is the probability that all three servers die in one months, without swapping out the broken ones. So at some point in the month all the data is gone.

If you replace the failed(or failing) node right away, the failure percentage goes down greatly. You would likely need the probability of a node going done in 30 minutes time space. Assuming the migration can be done in 30 min.

(i hope this calculation is correct)

If 1% probability per month then 1%/(43800/30) = (1/1460)% probability per 30 min.

For three instances: (1/1460)% * (1/1460)% * (1/1460)% = (1/3112136000)% probability per 30 min that all go down.

Calculated for one month (1/3112136000)% * (43800/30) = (1/2131600)%

So one in 213 160 000 that all three servers go down in a 30 minute time span somewhere in one month. After the 30 minutes another replica will already be available, making the data safe.

I'm happy to be corrected. The probability course was some years back :)

0 comments

2 comments · 1 top-level

TylerE1y ago· 1 in thread

One thing I will suggest: you’re assuming failures are non-correlated and have an equally weighted chance per in it of time.

Neither is a good assumption from my experience. Failures being correlated to any degree greatly increases the chances of what the aviation world refers to as “the holes in the Swiss cheese lining up”.

mixermachineOP1y ago

You are 100% correct. Heavily depends on where the servers reside. Just a rough estimate for the case that the failures are non related.

j / k navigate · click thread line to collapse