Your nines are not my nines (2019) (opens in new tab)

(rachelbythebay.com)

106 pointsthewarpaint2y ago31 comments

31 comments

28 comments · 8 top-level

hughesjj2y ago· 6 in thread

Hot take:

I would love to have service providers show their (down sampled!) Alarms actually used for operational excellence publicly (from a read replica/etc)

Doing so would enforce that you actually have those in place, since they're public and now a marketing point. That said, I get the concern of trolls and competitors trying to get a "low score".

compumike2y ago

Do you mean something like this page: https://heiioncall.com/status ? We use all of these internally at Heii On-Call https://heiioncall.com/ , and get paged when any of these triggers are alerting.

Edit: the subtle difference from the article is that it sounds like you want historical data, rather than present-state data.

hinkley2y ago

Few things burn my butt more than chatting with five people (possibly not even working for the same company) who all see a service down while their status page shows green.

The fuck it is.

100% unavailable for 5% of your customers is very different than dropping 5% of requests in a uniform distribution.

gfv2y ago

The shining example is https://grafana.wikimedia.org .

hinkley2y ago

Any time my coworkers start acting like we are amazing for how many requests we handle per second I send them to wikimedia.org. That'll smack the smug right outta most people.

2023throwawayy2y ago

Interesting their fastest backend response seems to be around 500ms up to a full second.

1 more reply

dilyevsky2y ago

You can show it delayed by like a week maybe

tekla2y ago· 5 in thread

I dont really get why Cloud matters here. The exact same dynamic exists for on-prem services.

glitchc2y ago

In on-prem, at least one part of the business can raise alarms and if it is a big chunk of the revenue, then the rest of the business tends to sit up and take notice. Here, unless your business is Uber or DoD, it's too small for the cloud providers to sit up and take notice.

tekla2y ago

I dont find that to be the case at all. My place is a startup and we have a very close relationship with our Account Managers. We can have product leads on the phone within 24 hours if we encounter AWS internal issues.

I've found that AWS is MUCH better at dealing with problems than internal teams since AWS has many more resources to throw at a problem

1 more reply

reidjs2y ago

It’s a matter of control. You have practically zero direct control over the vendor provided service because you are too small for them to care. If you control the system on-prem, you can at least attempt to fix it by hiring someone able to fix it or by diverting resources you already have.

Additionally, you can find ways to mitigate that failure from occurring or being as destructive in the future.

It’s important to factor that into decisions when choosing cloud provider vs on prem solutions.

aidenn02y ago

I think the point of TFA is, unless you have hundreds of on-prem services, one service going down for hours will significantly move the needle in your monitoring.

rlpb2y ago

> The exact same dynamic exists for on-prem services.

With on-prem, you can choose to defer deploying that significant (and therefore higher risk) change when it's crunch time for your business. That can reduce regression impact considerably.

thegrim332y ago· 3 in thread

Sure, there's the issue of what your contract says and what the guarantee is, but all these companies do already track their metrics in ways that at least attempt to detect and respond to the problems the author describes.

They track their metrics by p50 (the average performance/reliability for everyone) but also by p99, p99.9, etc., which is the performance/reliability for the extreme edge cases, such as exactly what the author is describing. They already do evaluate their systems from the perspective of how it's performing for the worst affected customers. Again, maybe the issue is the contract itself, sure, but they do already try their best to prevent a small handful of customers from getting overly affected by something.

remram2y ago

I remember seeing a talk years ago about percentiles and how they lie: https://www.youtube.com/watch?v=lJ8ydIuPFeU

You should be exposing the maximum metric from your app, computing a percentile from an aggregated histogram is lossy.

[edit: Found the link, "How NOT to Measure Latency" by Gil Tene]

hinkley2y ago

Here's the thing though. If I'm selling a product and I'm sending more than 10% of the money to a single vendor I have several problems.

If a vendor who can completely stop my operation has an outage, and the SLA says they owe me that 10% as a refund, I'm still having to deal with the 10x I'm losing because one of my vendors is having a bad day.

Those guarantees - if they even honor them, and if you can spare the time to chase them down - are still a quick road to bankruptcy.

So at the end of the day I probably have to raise my costs 10% in order to guarantee that no single vendor can drop me to 0%. And if those two vendors share a vendor, I may still be screwed.

lrem2y ago

Google loves to talk about billions of users. That is quite a few nines. Obviously there’s fewer users of cloud than search. But an engineer can only care about so many, before they need to save their sanity. Human attention is the one thing that’ll never scale.

RajT882y ago· 2 in thread

The way it works with cloud providers is - you can file for a refund for SLA breach. After all - those SLA's are at a service level for the customer. If you're yelling at support or engineering on the phone, you're likely getting the 9's treatment the author describes - this is the wrong forum to hold the provider accountable unless you're yelling about mitigation time (then, best of luck to you!).

Reading the fine print on the SLA's is extremely important, because they often do not say what you think they say.

https://aws.amazon.com/legal/service-level-agreements/ https://www.microsoft.com/licensing/docs/view/Service-Level-... https://cloud.google.com/terms/sla/

I have seen refunds on the order of hundreds of thousands of dollars. It's cold comfort if the impact to you was on the order of millions of dollars, but still it is something. As you can see it's not a free-money-a-thon, it's generally a % of your spend of the services which were not available.

There typically is a defined process for submitting a refund ticket, which will result in an availability review. This documented process is not always easy to find.

The only one I could easily find is for Microsoft:

https://learn.microsoft.com/en-us/partner-center/request-cre...

(It's just a support topic when you're submitting a support ticket)

smcleod2y ago

Leaving the work to the victim isn’t exactly great and getting a refund of some credits that you were going to spend with them anyway often doesn’t come close to the repetitional loss + time spent on the issue. The incentives for the large players are all based on making more year on year profit.

RajT882y ago

I mean, yeah.

I suspect the economics of being a CSP may not be so favorable if SLA refunds were automatic, and you didn't have to work for it.

hinkley2y ago· 2 in thread

There's an old joke that goes something like, "Most of the people chasing five nines uptime achieved five eights."

hn_go_brrrrr2y ago

I know of a case where an engineer asked for "nine fives" of reliability. The recipient naturally misread it.

dieselgate2y ago

Is the moral of the story people should start by chasing "one nine at a time" or something?

sjsdaiuasgdia2y ago· 1 in thread

This is a concept I've had to explain to entirely too many teams over the years, that 0.001% of requests failing as a (mostly) random distribution of all requests is very different than a 0.001% subset of requests that will fail (nearly) every time until the underlying issue is mitigated. They look the same on a high level dashboard but they are completely different conditions in terms of how the customer will feel it, and understanding which kind of problem you have also guides the investigation and troubleshooting process.

capableweb2y ago

In addition, some requests are more important that others.

`/assets/app_bundle.js` failing will most likely be visible immediately and make everything else useless, unless you've been clever and only used JS for upgrading website/app experience, rather than replacing

`/metrics/user-activity` failing won't (shouldn't) have any impact on the user experience

`/stripe/payment-succeeded-callback` failing could have disastrous impacts on the user, but not immediately be visible when it's failing.

ChrisArchitect2y ago· 1 in thread

(2019)

ChrisArchitect2y ago

lots of discussion then: https://news.ycombinator.com/item?id=20451714

Animats2y ago

"You are the bug on the windscreen of the locomotive. The train has no idea you were ever there." - Rachel by the Bay.

That's how monopolies work. They need not fear their customers.

In time, this becomes Orwell's "If you want a vision of the future, imagine a boot stamping on a human face – forever." Ask anyone who's had a dispute with the Apple app store.

j / k navigate · click thread line to collapse

31 comments

28 comments · 8 top-level

hughesjj2y ago· 6 in thread

Hot take:

I would love to have service providers show their (down sampled!) Alarms actually used for operational excellence publicly (from a read replica/etc)

Doing so would enforce that you actually have those in place, since they're public and now a marketing point. That said, I get the concern of trolls and competitors trying to get a "low score".

compumike2y ago

Do you mean something like this page: https://heiioncall.com/status ? We use all of these internally at Heii On-Call https://heiioncall.com/ , and get paged when any of these triggers are alerting.

Edit: the subtle difference from the article is that it sounds like you want historical data, rather than present-state data.

hinkley2y ago

Few things burn my butt more than chatting with five people (possibly not even working for the same company) who all see a service down while their status page shows green.

The fuck it is.

100% unavailable for 5% of your customers is very different than dropping 5% of requests in a uniform distribution.

gfv2y ago

The shining example is https://grafana.wikimedia.org .

hinkley2y ago

Any time my coworkers start acting like we are amazing for how many requests we handle per second I send them to wikimedia.org. That'll smack the smug right outta most people.

2023throwawayy2y ago

Interesting their fastest backend response seems to be around 500ms up to a full second.

1 more reply

dilyevsky2y ago

You can show it delayed by like a week maybe

tekla2y ago· 5 in thread

I dont really get why Cloud matters here. The exact same dynamic exists for on-prem services.

glitchc2y ago

tekla2y ago

I've found that AWS is MUCH better at dealing with problems than internal teams since AWS has many more resources to throw at a problem

1 more reply

reidjs2y ago

Additionally, you can find ways to mitigate that failure from occurring or being as destructive in the future.

It’s important to factor that into decisions when choosing cloud provider vs on prem solutions.

aidenn02y ago

I think the point of TFA is, unless you have hundreds of on-prem services, one service going down for hours will significantly move the needle in your monitoring.

rlpb2y ago

> The exact same dynamic exists for on-prem services.

With on-prem, you can choose to defer deploying that significant (and therefore higher risk) change when it's crunch time for your business. That can reduce regression impact considerably.

thegrim332y ago· 3 in thread

remram2y ago

I remember seeing a talk years ago about percentiles and how they lie: https://www.youtube.com/watch?v=lJ8ydIuPFeU

You should be exposing the maximum metric from your app, computing a percentile from an aggregated histogram is lossy.

[edit: Found the link, "How NOT to Measure Latency" by Gil Tene]

hinkley2y ago

Here's the thing though. If I'm selling a product and I'm sending more than 10% of the money to a single vendor I have several problems.

Those guarantees - if they even honor them, and if you can spare the time to chase them down - are still a quick road to bankruptcy.

So at the end of the day I probably have to raise my costs 10% in order to guarantee that no single vendor can drop me to 0%. And if those two vendors share a vendor, I may still be screwed.

lrem2y ago

RajT882y ago· 2 in thread

Reading the fine print on the SLA's is extremely important, because they often do not say what you think they say.

https://aws.amazon.com/legal/service-level-agreements/ https://www.microsoft.com/licensing/docs/view/Service-Level-... https://cloud.google.com/terms/sla/

There typically is a defined process for submitting a refund ticket, which will result in an availability review. This documented process is not always easy to find.

The only one I could easily find is for Microsoft:

https://learn.microsoft.com/en-us/partner-center/request-cre...

(It's just a support topic when you're submitting a support ticket)

smcleod2y ago

RajT882y ago

I mean, yeah.

I suspect the economics of being a CSP may not be so favorable if SLA refunds were automatic, and you didn't have to work for it.

hinkley2y ago· 2 in thread

There's an old joke that goes something like, "Most of the people chasing five nines uptime achieved five eights."

hn_go_brrrrr2y ago

I know of a case where an engineer asked for "nine fives" of reliability. The recipient naturally misread it.

dieselgate2y ago

Is the moral of the story people should start by chasing "one nine at a time" or something?

sjsdaiuasgdia2y ago· 1 in thread

capableweb2y ago

In addition, some requests are more important that others.

`/metrics/user-activity` failing won't (shouldn't) have any impact on the user experience

`/stripe/payment-succeeded-callback` failing could have disastrous impacts on the user, but not immediately be visible when it's failing.

ChrisArchitect2y ago· 1 in thread

(2019)

ChrisArchitect2y ago

lots of discussion then: https://news.ycombinator.com/item?id=20451714

Animats2y ago

"You are the bug on the windscreen of the locomotive. The train has no idea you were ever there." - Rachel by the Bay.

That's how monopolies work. They need not fear their customers.

In time, this becomes Orwell's "If you want a vision of the future, imagine a boot stamping on a human face – forever." Ask anyone who's had a dispute with the Apple app store.

j / k navigate · click thread line to collapse