> Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.
Anecdotal (as is my position). I can theoretically understand this happening but not only have I never seen it, such an issue would need to be escalated. That's a "this is unacceptable" high-level phone call. A call you more than likely have a chance of someone in actual authority answering because IME unless you have SERIOUS spend with big cloud you'll be lucky to make it a rung or two up sales/support.
Plus backups and redundancies that should prevent even the failure of a chassis/storage/etc from being a significant critical issue.
> their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved
As a Founder/CTO I have the opposite take - put me and my team in a position to /do something/ vs sitting around waiting for AWS to come back whenever it decides to and while they obscure comms, don't update the fake status dashboards, etc. Meanwhile you're telling your customer "Ummm, we don't know - Amazon has a problem. When it comes back I guess it's back".
Coming from a background of telecom, healthcare, and nuclear energy I can't believe that even flies.