undefined | Better HN

0 pointsjohngalt9y ago0 comments

Sysadmin: I can forgive outages, but falsely reporting 'up' when you're obviously down is a heinous transgression.

Somewhere a sysadmin is having to explain to a mildly technical manager that AWS services are down and affecting business critical services. That manager will be chewing out the tech because the status site shows everything is green. Dishonest metrics are worse than bad metrics for this exact reason.

Any sysadmin who wasn't born yesterday knows that service metrics are gamed relentlessly by providers. Bluntly there aren't many of us, and we talk. Message to all providers: sysadmins losing confidence in your outage reporting has a larger impact than you think. Because we will be the ones called to the carpet to explain why <services> are down when <provider> is lying about being up.

0 comments

26 comments · 11 top-level

carbocation9y ago· 9 in thread

People were joking about this but it turns out to be true: they host the status icons on their service: https://twitter.com/awscloud/status/836656664635846656

devy9y ago

Due to HN's flaky Cloudflare 503 Bad Gateway error, I noticed that Cloudflare is also being affected by S3 being down in a similar but subtle way. See their status page's broken logo on the upper left hand corner.[1] It was actually directly linking to a S3 URL: https://s3.amazonaws.com/statuspage-production/pages-transac...

[1]: https://www.cloudflarestatus.com/

3 more replies

swearfu9y ago

Saw that too, sounds like a convenient excuse for being caught in a lie.

AWS Employee #1: Hey, people are catching on that our status page isn't accurate

AWS Employee #2: Tell them it's cause of S3

1 more reply

paulddraper9y ago

The icons aren't hosted there (or if they are, they are cached). https://status.aws.amazon.com/images/status3.gif

The status information is hosted there.

2 more replies

sirn9y ago

While status icon being hosted on S3 is funny, I think it's more likely that it's not the icon itself that caused the status page to not getting updated, but rather the fault of service information (say, a JSON file) that used to generate the status page that was stored on S3. The banner could probably be configured locally, so they choose to update that for the time being (e.g. while moving the status bucket somewhere else).

buryat9y ago

> Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard.

jpwgarrison9y ago

I like how HN (and others) handle this - there should be a static link to a 3rd party source, like a twitter feed, at the top of any status page.

taobility9y ago

if that simple, why the text desc for Details also didn't reflect the incident?

jaequery9y ago

Is there any service that distributes your files to multiple cloud services at the same time? With this recent S3 outage, I'm now feeling uneasy to store files on S3 for mission critical apps.

2 more replies

kangman9y ago

ins3ption

MaxfordAndSons9y ago· 2 in thread

It's unbelievable that the status page is still showing green checkmarks, almost what, 2 hours into the outage?

edit: oh, it is actually because of the outage! So if they can't get a fresh read on the service status from s3, they just optimistically assume it's green... even though the service failing to provide said read... is one of the services they're optimistically showing as green XD

purplecones9y ago

Hey can't change it due to the S3 issue. See their twitter post: https://twitter.com/awscloud/status/836656664635846656

3 more replies

sitkack9y ago

Same related flaw as Three Mile Island. Fail closed and measure the output, not the intent.

dragonwriter9y ago· 1 in thread

> Because we [sysadmins] will be the ones called to the carpet to explain why <services> are down when <provider> is lying about being up.

But isn't that the whole point of lying: to the less technical manager (often the only person whose view matters at major customers), the status board saying "up" means the problem is the sysadmins, not the vendor.

mjcl9y ago

That works in the vendor's favor in the short term, but can screw them in the long term because you get staff who go the extra mile to avoid the vendor in the future, including structuring requirements to avoid them.

For example, by experience and gossip I know Wind stream has awful reliability, but they handwave that away. By including a requirement I knew they couldn't meet (dynamic E911), they were knocked out of a 200 site VoIP RFP early.

rdtsc9y ago· 1 in thread

> but falsely reporting 'up' when you're obviously down is a heinous transgression.

When SLA's are in play and so are job performance scores and bonuses there is probably a strong incentive to fudge numbers. It can be done officially ("Ah but sub-chapter 3 of chapter X in the fine print explains this wasn't technically an outage") or unofficially.

vocatus_gate9y ago

When I worked in Antarctica any outage affecting users that lasted over 50 minutes was considered an official "outage" and had to be reported to mission command. So of course ALL maintenance was rolled back/backed out if it came anywhere even close to 50 minutes, just so we wouldn't have to fill out the stupid outage paperwork.

primitivesuave9y ago· 1 in thread

Thank you for the insight. Could you and/or any sysadmin on here elaborate on what a "nail in the coffin" situation might look like? For example, is this current outage with inaccurate status updates enough to seriously consider migrating to another CDN provider? If so, which one would you migrate to?

i336_9y ago

Disclaimer, not a job-toting sysadmin quite yet, but here's my 2¢:

- Architectural SPOFs (single points of failure) need to be carefully weighed up in any design, and "ALL our files are on $single_provider" is one such huge red flag. Unfortunately these considerations are all too frequently drowned out by the ease of going with the least path of resistance.

For example GitHub occasionally goes down, which breaks a remarkable amount of infrastructure: a huge number of people don't know how to use Git, do full clones from scratch each time, and have no idea how to work without a server (even though Git is built to work locally); CI systems tend to want to do green-field rebuilds, so start out with empty directory trees and need to do full clones each build (I'm not sure if any CI systems come with out-of-the-box Git caching); GH-powered authentication systems fall apart; etc. Kinda crazy, scary and really annoying, but yeah.

In terms of "nail in the coffin", that depends on a lot of factors, including a subjective analysis of how much local catastrophe was caused by the incident; subjective opinions about the provider's reaction to the issue, what they'll do to mitigate it, perhaps how transparent they are about it; etc.

Ultimately, the Internet likes to pretend that AWS and cloud computing is basically rock-solid. Unfortunately it's not, and stuff goes down. There were some truly redundant architecture experiments in the 80s (for example, the Tandem Nonstop Computer, one of which was recently noted to have been running continuously for 24 years: https://news.ycombinator.com/item?id=13514909) but x86 never really went there, and superscalar computing is built on a sped-up version of the same ideas that connect desktop computers together, so while there are lots of architectural optical illusions, well, stuff falls apart.

- Everyone in this thread is talking about Google Compute Engine, but it really depends on your usage patterns and requirements. GCE is pretty much the single major competitor to AWS, although the infrastructure is _completely_ different - different tools, different APIs, different pricing infrastructure. The problem is that it's not like like MySQL vs PostgreSQL or Ubuntu vs Debian; it's like SQL vs Redis, or Linux vs BSD. Both work great, but you basically have to do twice the integration work, and map things manually. With this said, if you don't have particularly high resource usage, VPS or dedicated hosting may actually work out more cost-effectively.

TL;DR: you go back to the SPOF problem, where _you_ have to foot the technical debt for the reliability level you want. Yay.

jnordwick9y ago· 1 in thread

Why is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?

johngaltOP9y ago

The manager is not the bad guy. They are doing everything they should do in the scenario I presented. Checking into an outage affecting a critical system. Criticizing the sysadmin's findings based on the evidence that Amazon's status page disagrees. I don't expect a non-technical party to believe me over Amazon.

The bad guys are the providers who report false positives to preserve metrics.

paulddraper9y ago

Hurry, look now, so you can tell your grandchildren!!!

Greenish ELB, RDS.

Yellow EC2, Lambda.

Red S3, Auto Scaling.

EDIT: A few dozen services in us-east-1 are down/degraded.

discreditable9y ago

This is why I always set up my own monitoring for services in addition to the provider's status page. Simple SmokePing graphs have saved me a ton of time when it comes to troubleshooting provider outages. It especially helps when I can show them exactly when there are problems.

rabidonrails9y ago

Just commenting here because hopefully people can see: AWS status page updated: 1:44 CST

jnordwick9y ago

Any is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?

rdiddly9y ago

It's not a lie, it's an "alternative fact" about how totally like awesome AWS is!

j / k navigate · click thread line to collapse

0 comments

26 comments · 11 top-level

carbocation9y ago· 9 in thread

People were joking about this but it turns out to be true: they host the status icons on their service: https://twitter.com/awscloud/status/836656664635846656

devy9y ago

[1]: https://www.cloudflarestatus.com/

3 more replies

swearfu9y ago

Saw that too, sounds like a convenient excuse for being caught in a lie.

AWS Employee #1: Hey, people are catching on that our status page isn't accurate

AWS Employee #2: Tell them it's cause of S3

1 more reply

paulddraper9y ago

The icons aren't hosted there (or if they are, they are cached). https://status.aws.amazon.com/images/status3.gif

The status information is hosted there.

2 more replies

sirn9y ago

buryat9y ago

> Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard.

jpwgarrison9y ago

I like how HN (and others) handle this - there should be a static link to a 3rd party source, like a twitter feed, at the top of any status page.

taobility9y ago

if that simple, why the text desc for Details also didn't reflect the incident?

jaequery9y ago

Is there any service that distributes your files to multiple cloud services at the same time? With this recent S3 outage, I'm now feeling uneasy to store files on S3 for mission critical apps.

2 more replies

kangman9y ago

ins3ption

MaxfordAndSons9y ago· 2 in thread

It's unbelievable that the status page is still showing green checkmarks, almost what, 2 hours into the outage?

purplecones9y ago

Hey can't change it due to the S3 issue. See their twitter post: https://twitter.com/awscloud/status/836656664635846656

3 more replies

sitkack9y ago

Same related flaw as Three Mile Island. Fail closed and measure the output, not the intent.

dragonwriter9y ago· 1 in thread

> Because we [sysadmins] will be the ones called to the carpet to explain why <services> are down when <provider> is lying about being up.

mjcl9y ago

rdtsc9y ago· 1 in thread

> but falsely reporting 'up' when you're obviously down is a heinous transgression.

vocatus_gate9y ago

primitivesuave9y ago· 1 in thread

i336_9y ago

Disclaimer, not a job-toting sysadmin quite yet, but here's my 2¢:

TL;DR: you go back to the SPOF problem, where _you_ have to foot the technical debt for the reliability level you want. Yay.

jnordwick9y ago· 1 in thread

Why is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?

johngaltOP9y ago

The bad guys are the providers who report false positives to preserve metrics.

paulddraper9y ago

Hurry, look now, so you can tell your grandchildren!!!

Greenish ELB, RDS.

Yellow EC2, Lambda.

Red S3, Auto Scaling.

EDIT: A few dozen services in us-east-1 are down/degraded.

discreditable9y ago

rabidonrails9y ago

Just commenting here because hopefully people can see: AWS status page updated: 1:44 CST

jnordwick9y ago

Any is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?

rdiddly9y ago

It's not a lie, it's an "alternative fact" about how totally like awesome AWS is!

j / k navigate · click thread line to collapse