AWS status updates not working due to S3 (opens in new tab)

(twitter.com)

126 pointsjoshua_wold9y ago47 comments

47 comments

38 comments · 10 top-level

simplehuman9y ago· 9 in thread

It baffles me that AWS, a leader in cloud computing can make such a rudimentary mistake. Seriously, I interviewed there and they asked me to write a b+ tree and I failed. And then you see fundamental errors like this which possibly cannot be made by people who had the smarts to write b+ trees in 15 minutes...

I want to take this opportunity to complain about the interview system. Hire people who care about the product and company. Such mistakes cannot be made by people who care.

wheaties9y ago

Writing a B+ tree from memory and making sure your infrastructure isn't doing something stupid are fundamentally different skills. One requires that you regurgitate the contents of a text book on a white board, the other that you can engineer a solution. I wish them well on an interview set up for hiring the former; I try to hire the later.

simplehuman9y ago

> Writing a B+ tree from memory and making sure your infrastructure isn't doing something stupid are fundamentally different skills.

It's funny. You know it. I know it. Entire HN knows it. And yet _no_ interview follows any such common sense rules. Just go to a Google/FB interview and they ask you all sort of questions. It doesn't matter what you are interviewing for. In fact, in many cases they don't even tell you which group/team/project you will be assigned to. Since they will "assess" where you fit best.

3 more replies

dragonwriter9y ago

> cannot be made by people who had the smarts to write b+ trees in 15 minutes...

Writing B+ trees at the drop of a hat is probably more a signal of memorization and recency of taking a data structures class than smarts, particularly the smarts necessary to develop and maintain robust distributed infrastructure.

gdulli9y ago

1. The number of people at Amazon who need to be able to write a b+ tree is negligible.

2. The number of people at Amazon who need to be able to choose a search/sort/etc. algorithm or data structure and understand why one is more appropriate than another for a given use case is much higher.

3. The number of people at Amazon who need to demonstrate common sense is very high. This skill is much more closely related to #2 than #1.

burntrelish12739y ago

Yup, it's BS in-lieu of real problems.

CS fundamentals are nice to know, but how often does one implement something custom like BigTable/Colossus from scratch vs. buy/use OTS? The support/scalability/technical debt/unforeseen costs of implementing something entirely new is typically much greater than using adequate "lego" that already exist.

Judgement of cost/benefit DIY vs. OTS can be gained (hopefully) without too much wasted effort, time, money, morale & business life-expectancy.

dvnguyen9y ago

I doubt whether tech companies have data to backup for this type of interview, something like the correlation between people who do well in whiteboard binary inversion interview and people who do well in real world jobs. Or they just do it because that's the way Google does it.

kevan9y ago

>Such mistakes cannot be made by people who care.

That's a very naive assertion. Humans make mistakes, they always have and they always will, no matter how smart they are and how much they care. That's why pilots have checklists that they go through before they're even allowed to leave the gate.

simplehuman9y ago

This statement (it starts with 'such') was about a very specific mistake. This is not some hard engineering problem. People who 'care' about the authenticity of the status page will not make the error of basing it on the infrastructure it monitors.

1 more reply

burntrelish12739y ago

Because bureaucracy, siloization, Tragedy of the Commons and likely a massive infrastructure with tons of technical debt and inability to make major changes except by gradual incrementalism, oft too late. Infrastructure needs active SimianArmy-style breakage finding without "sacred cows" but with less than 100% uptime across all services to ferret-out outage edge-cases.

cwmma9y ago· 5 in thread

So the obvious answer would be to host it on like azure or google cloud storage but I can just imagine the institutional push back that would get trying to do that.

idlewords9y ago

What if I told you you could make a red dot without hosting an image anywhere?

doubleplusgood9y ago

Red dot as a service?

cwmma9y ago

what if I told you that if you the status text didn't update?

heavymark9y ago

Seems like another commenter beat me to the punch, but to me the obvious answer would seem to be not hosting images at all as they could create a red circle in css.

Or as others noted reverse the logic so that it shows red icons by default but as long as the services are working then it replaces that with a green icon. And when those external services are down it would go back to a red icon.

jvolkman9y ago

The status text also didn't update. Seems the S3 dependency is more than just icon hosting.

Cafey9y ago· 3 in thread

This should be the official anti-pattern when designing a status page.

qeternity9y ago

It is...it's literally the reason products like statuspage.io exist, because if your status page has any dependencies on the services for which is provides statuses, then it's not really a useful status page.

simplehuman9y ago

And yet, I cannot find any obvious information on where statuspage is hosted.

2 more replies

BrailleHunting9y ago

The status page should work with just IPv4 or IPv6, BGP and round-robin on a bunch of location-diverse, simple, real metal, web-boxes that only serve status.

brational9y ago· 3 in thread

Drone crashes into my living room with groceries. Receive email that my package was successfully delivered.

throwaway292929y ago

I would hate to imagine drones dropping out of the sky if S3 went down in the future.

burntrelish12739y ago

Meanwhile, on InfoWars "S3: NWO AI mothership ready to hijack everything" ;)

Rapzid9y ago

Cracker! ROTFL

ohstopitu9y ago· 3 in thread

Just to be clear...best practices with designing status pages:

1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)

2. make sure your service reports to your status page instead of your status page looking for the service.

3. redundancy for your status page?

anything anyone-else wants to add?

mschuster919y ago

> 1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)

Many people forget DNS in the equation.

If it's on a subdomain of your regular site, it will go down in case the domain is accidentally/maliciously transferred or legal authorities seize/block it (we're seeing the extremely long arm of the US law enforcement with Mr. Dotcom, as well as Erdogan and other dictators or the Chinese firewall).

If it's on a different domain that's on the same DNS hoster (e.g. Amazon's Route 36, or for that matter your own hoster!) you're screwed if the DNS fails.

If it's via the same registrar, you're screwed if someone obtains access to your registrar account (this once again includes law enforcement).

Obviously this also holds true for the TLD itself - e.g. imagine Verisign (holding .com and .net) has problems, you want a .info, for example.

Conclusion: different datacenter/provider for the HTTP server part, different DNS provider(s), different TLD. For the datacenter and DNS provider level you can use high-availability (multiple different NS entries, multiple different servers), this can also protect from legal overreach.

Also, your status page may have a negligible load as long as your service is operating fine, but people tend to go to status pages and manically press Cmd+R until there's a green light - so best use nginx/lighttpd with static pages and minimal assets only.

If you're running HTTPS on your main site and you do choose to name it "status.mydomain.com", also deploy HTTPS on your status page - else people visiting status.mydomain.com may transmit session cookies in cleartext in case you forgot the SECURE flag or the client does not honor this (for whatever reason).

Oh, and do buy a separate HTTPS cert instead of using your usual wildcard cert or your primary cert with the status page as SAN, so your status page stays up when your primary cert expires...

janywer9y ago

Not sure about 2 - What are your arguments for this?

If the status page relies on getting updated information from the service, it may not even notice when the whole thing just crashes and goes down in flames. Attempting to do some predefined calls to the service to evaluate whether it is working correctly appears like a better solution?

djsumdog9y ago

Yea I was wondering about that comment too. I mean you can do both. Your status page should be static, updated by a service which both polls and accepts information from your services. You ideally want to go yellow if one of the two fails.

But yes, in general, the status page and status services should be entirely on their own independent infrastructure; and in a different data centre. A number of providers offer independent status page services. If your entire company runs off Digital Ocean, your status page/services should probably be running on Linode or AWS or whatever.

BrailleHunting9y ago· 2 in thread

This is a problem of "monoculture" dependencies and failure to implement HA by using multiple services. All Github releases are down, atom downloads are down and so on. Companies, including Amazon, should be using other CDNs for HA purposes, even if NIH.

It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.

ghaff9y ago

Assuming that it actually makes business sense to do so. There are certainly cases where you can make a perfectly rational business decision to depend on someone else's services and you're OK with your uptime not being any better than their uptime.

snuxoll9y ago

> It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.

This one is relatively easy to "fix" at least, it's nothing having multiple DNS providers for public records can't handle as well as ensuring redundancy for your internal DNS services. Bonus points if you run your own recursive resolver so you aren't dependent on some other party not screwing up somehow.

joshuak9y ago· 1 in thread

So now you know that a deadman switch is the better way to report availability. The logic was backwards for this signal. The default condition is failed. Not failed requires proof.

It's interesting how easy it is to accidentally invert logical operations. I see it in code all the time. A condition will test that A is true when what they really need to know is if B and C are both false. It's like some kind of cognitive tick.

ryanbrunner9y ago

That's good practice, sure, but their problem was even more fundamental than that. Their status page was dependent on the service it was reporting on being up. That fails the most basic requirement of a status page.

fred2569y ago· 1 in thread

Looks like they've fixed it now. (The status page, not s3)

joshua_woldOP9y ago

yup - https://twitter.com/awscloud/status/836662601090134017

alpb9y ago· 1 in thread

> The dashboard not changing color is related to S3 issue.

I don't understand this. The icon URL is in the HTML. Both icons https://status.aws.amazon.com/images/status0.gif and https://status.aws.amazon.com/images/status3.gif have been working for us all along. Plus clearly they are able to update the status page contents, because they added the "increased error rates" message there too. I don't want to believe it but is it fair to assume they did not want to replace status0.gif with status3.gif in HTML? Please correct me if I'm not getting this straight.

In any case, it's a bad day for AWS folks, I'm feeling their pain too. Being a cloud provider is a tough business to be at and the pressure is really high.

ryanbrunner9y ago

One explanation might be that they use an internal tool to update the status page definitions, and parts of that tool are hosted on S3. Or that the status definitions themselves are hosted on S3 (and then read and transformed into the HTML page everyone sees)

paulpauper9y ago

ironic .the status update doomed by its own downtime

j / k navigate · click thread line to collapse

47 comments

38 comments · 10 top-level

simplehuman9y ago· 9 in thread

I want to take this opportunity to complain about the interview system. Hire people who care about the product and company. Such mistakes cannot be made by people who care.

wheaties9y ago

simplehuman9y ago

> Writing a B+ tree from memory and making sure your infrastructure isn't doing something stupid are fundamentally different skills.

3 more replies

dragonwriter9y ago

> cannot be made by people who had the smarts to write b+ trees in 15 minutes...

gdulli9y ago

1. The number of people at Amazon who need to be able to write a b+ tree is negligible.

3. The number of people at Amazon who need to demonstrate common sense is very high. This skill is much more closely related to #2 than #1.

burntrelish12739y ago

Yup, it's BS in-lieu of real problems.

Judgement of cost/benefit DIY vs. OTS can be gained (hopefully) without too much wasted effort, time, money, morale & business life-expectancy.

dvnguyen9y ago

kevan9y ago

>Such mistakes cannot be made by people who care.

simplehuman9y ago

1 more reply

burntrelish12739y ago

cwmma9y ago· 5 in thread

So the obvious answer would be to host it on like azure or google cloud storage but I can just imagine the institutional push back that would get trying to do that.

idlewords9y ago

What if I told you you could make a red dot without hosting an image anywhere?

doubleplusgood9y ago

Red dot as a service?

cwmma9y ago

what if I told you that if you the status text didn't update?

heavymark9y ago

Seems like another commenter beat me to the punch, but to me the obvious answer would seem to be not hosting images at all as they could create a red circle in css.

jvolkman9y ago

The status text also didn't update. Seems the S3 dependency is more than just icon hosting.

Cafey9y ago· 3 in thread

This should be the official anti-pattern when designing a status page.

qeternity9y ago

simplehuman9y ago

And yet, I cannot find any obvious information on where statuspage is hosted.

2 more replies

BrailleHunting9y ago

The status page should work with just IPv4 or IPv6, BGP and round-robin on a bunch of location-diverse, simple, real metal, web-boxes that only serve status.

brational9y ago· 3 in thread

Drone crashes into my living room with groceries. Receive email that my package was successfully delivered.

throwaway292929y ago

I would hate to imagine drones dropping out of the sky if S3 went down in the future.

burntrelish12739y ago

Meanwhile, on InfoWars "S3: NWO AI mothership ready to hijack everything" ;)

Rapzid9y ago

Cracker! ROTFL

ohstopitu9y ago· 3 in thread

Just to be clear...best practices with designing status pages:

1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)

2. make sure your service reports to your status page instead of your status page looking for the service.

3. redundancy for your status page?

anything anyone-else wants to add?

mschuster919y ago

> 1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)

Many people forget DNS in the equation.

If it's on a different domain that's on the same DNS hoster (e.g. Amazon's Route 36, or for that matter your own hoster!) you're screwed if the DNS fails.

If it's via the same registrar, you're screwed if someone obtains access to your registrar account (this once again includes law enforcement).

Obviously this also holds true for the TLD itself - e.g. imagine Verisign (holding .com and .net) has problems, you want a .info, for example.

Oh, and do buy a separate HTTPS cert instead of using your usual wildcard cert or your primary cert with the status page as SAN, so your status page stays up when your primary cert expires...

janywer9y ago

Not sure about 2 - What are your arguments for this?

djsumdog9y ago

BrailleHunting9y ago· 2 in thread

It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.

ghaff9y ago

snuxoll9y ago

> It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.

joshuak9y ago· 1 in thread

So now you know that a deadman switch is the better way to report availability. The logic was backwards for this signal. The default condition is failed. Not failed requires proof.

ryanbrunner9y ago

fred2569y ago· 1 in thread

Looks like they've fixed it now. (The status page, not s3)

joshua_woldOP9y ago

yup - https://twitter.com/awscloud/status/836662601090134017

alpb9y ago· 1 in thread

> The dashboard not changing color is related to S3 issue.

In any case, it's a bad day for AWS folks, I'm feeling their pain too. Being a cloud provider is a tough business to be at and the pressure is really high.

ryanbrunner9y ago

paulpauper9y ago

ironic .the status update doomed by its own downtime

j / k navigate · click thread line to collapse