I would love to have service providers show their (down sampled!) Alarms actually used for operational excellence publicly (from a read replica/etc)
Doing so would enforce that you actually have those in place, since they're public and now a marketing point. That said, I get the concern of trolls and competitors trying to get a "low score".
Edit: the subtle difference from the article is that it sounds like you want historical data, rather than present-state data.
The fuck it is.
100% unavailable for 5% of your customers is very different than dropping 5% of requests in a uniform distribution.
I've found that AWS is MUCH better at dealing with problems than internal teams since AWS has many more resources to throw at a problem
Additionally, you can find ways to mitigate that failure from occurring or being as destructive in the future.
It’s important to factor that into decisions when choosing cloud provider vs on prem solutions.
With on-prem, you can choose to defer deploying that significant (and therefore higher risk) change when it's crunch time for your business. That can reduce regression impact considerably.
They track their metrics by p50 (the average performance/reliability for everyone) but also by p99, p99.9, etc., which is the performance/reliability for the extreme edge cases, such as exactly what the author is describing. They already do evaluate their systems from the perspective of how it's performing for the worst affected customers. Again, maybe the issue is the contract itself, sure, but they do already try their best to prevent a small handful of customers from getting overly affected by something.
You should be exposing the maximum metric from your app, computing a percentile from an aggregated histogram is lossy.
[edit: Found the link, "How NOT to Measure Latency" by Gil Tene]
If a vendor who can completely stop my operation has an outage, and the SLA says they owe me that 10% as a refund, I'm still having to deal with the 10x I'm losing because one of my vendors is having a bad day.
Those guarantees - if they even honor them, and if you can spare the time to chase them down - are still a quick road to bankruptcy.
So at the end of the day I probably have to raise my costs 10% in order to guarantee that no single vendor can drop me to 0%. And if those two vendors share a vendor, I may still be screwed.
Reading the fine print on the SLA's is extremely important, because they often do not say what you think they say.
https://aws.amazon.com/legal/service-level-agreements/ https://www.microsoft.com/licensing/docs/view/Service-Level-... https://cloud.google.com/terms/sla/
I have seen refunds on the order of hundreds of thousands of dollars. It's cold comfort if the impact to you was on the order of millions of dollars, but still it is something. As you can see it's not a free-money-a-thon, it's generally a % of your spend of the services which were not available.
There typically is a defined process for submitting a refund ticket, which will result in an availability review. This documented process is not always easy to find.
The only one I could easily find is for Microsoft:
https://learn.microsoft.com/en-us/partner-center/request-cre...
(It's just a support topic when you're submitting a support ticket)
I suspect the economics of being a CSP may not be so favorable if SLA refunds were automatic, and you didn't have to work for it.
`/assets/app_bundle.js` failing will most likely be visible immediately and make everything else useless, unless you've been clever and only used JS for upgrading website/app experience, rather than replacing
`/metrics/user-activity` failing won't (shouldn't) have any impact on the user experience
`/stripe/payment-succeeded-callback` failing could have disastrous impacts on the user, but not immediately be visible when it's failing.
That's how monopolies work. They need not fear their customers.
In time, this becomes Orwell's "If you want a vision of the future, imagine a boot stamping on a human face – forever." Ask anyone who's had a dispute with the Apple app store.