My post was directed at folks who said "web sites that went down as a result of such unprecedented EC2 problem made an engineering mistake by not building to be able to withstand."
Again - I am emphasizing "engineering mistake", not business mistake or funding mistake or resource allocation mistake.
My point is there are things you rationally protect against. But at some point, putting up defenses against more and more bad things stops being rational.
For different systems this point (where it stops being rational) is different.
(I presume knowing that would explain the purpose of the gedankenexperiment.)
I guess I’m not sure how relevant the parable is, because while it doesn’t cost have coffee & a pot at your house — and you will use it anyway, and it doesn’t cost anything day-to-day to just keep it there — this isn’t true for, say, having a backup system just waiting to take over in the event of catastrophic failure.
Then again, the Netflix strategy (3 clusters at 60% utilization → 2 clusters at 90%) is superficially similar to usually having an option of homebrew, walking, and driving for coffee, and sometimes just being forced to pick one option. Superficially.