I want to take this opportunity to complain about the interview system. Hire people who care about the product and company. Such mistakes cannot be made by people who care.
It's funny. You know it. I know it. Entire HN knows it. And yet _no_ interview follows any such common sense rules. Just go to a Google/FB interview and they ask you all sort of questions. It doesn't matter what you are interviewing for. In fact, in many cases they don't even tell you which group/team/project you will be assigned to. Since they will "assess" where you fit best.
Writing B+ trees at the drop of a hat is probably more a signal of memorization and recency of taking a data structures class than smarts, particularly the smarts necessary to develop and maintain robust distributed infrastructure.
2. The number of people at Amazon who need to be able to choose a search/sort/etc. algorithm or data structure and understand why one is more appropriate than another for a given use case is much higher.
3. The number of people at Amazon who need to demonstrate common sense is very high. This skill is much more closely related to #2 than #1.
CS fundamentals are nice to know, but how often does one implement something custom like BigTable/Colossus from scratch vs. buy/use OTS? The support/scalability/technical debt/unforeseen costs of implementing something entirely new is typically much greater than using adequate "lego" that already exist.
Judgement of cost/benefit DIY vs. OTS can be gained (hopefully) without too much wasted effort, time, money, morale & business life-expectancy.
That's a very naive assertion. Humans make mistakes, they always have and they always will, no matter how smart they are and how much they care. That's why pilots have checklists that they go through before they're even allowed to leave the gate.
Or as others noted reverse the logic so that it shows red icons by default but as long as the services are working then it replaces that with a green icon. And when those external services are down it would go back to a red icon.
1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)
2. make sure your service reports to your status page instead of your status page looking for the service.
3. redundancy for your status page?
anything anyone-else wants to add?
Many people forget DNS in the equation.
If it's on a subdomain of your regular site, it will go down in case the domain is accidentally/maliciously transferred or legal authorities seize/block it (we're seeing the extremely long arm of the US law enforcement with Mr. Dotcom, as well as Erdogan and other dictators or the Chinese firewall).
If it's on a different domain that's on the same DNS hoster (e.g. Amazon's Route 36, or for that matter your own hoster!) you're screwed if the DNS fails.
If it's via the same registrar, you're screwed if someone obtains access to your registrar account (this once again includes law enforcement).
Obviously this also holds true for the TLD itself - e.g. imagine Verisign (holding .com and .net) has problems, you want a .info, for example.
Conclusion: different datacenter/provider for the HTTP server part, different DNS provider(s), different TLD. For the datacenter and DNS provider level you can use high-availability (multiple different NS entries, multiple different servers), this can also protect from legal overreach.
Also, your status page may have a negligible load as long as your service is operating fine, but people tend to go to status pages and manically press Cmd+R until there's a green light - so best use nginx/lighttpd with static pages and minimal assets only.
If you're running HTTPS on your main site and you do choose to name it "status.mydomain.com", also deploy HTTPS on your status page - else people visiting status.mydomain.com may transmit session cookies in cleartext in case you forgot the SECURE flag or the client does not honor this (for whatever reason).
Oh, and do buy a separate HTTPS cert instead of using your usual wildcard cert or your primary cert with the status page as SAN, so your status page stays up when your primary cert expires...
If the status page relies on getting updated information from the service, it may not even notice when the whole thing just crashes and goes down in flames. Attempting to do some predefined calls to the service to evaluate whether it is working correctly appears like a better solution?
But yes, in general, the status page and status services should be entirely on their own independent infrastructure; and in a different data centre. A number of providers offer independent status page services. If your entire company runs off Digital Ocean, your status page/services should probably be running on Linode or AWS or whatever.
It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.
This one is relatively easy to "fix" at least, it's nothing having multiple DNS providers for public records can't handle as well as ensuring redundancy for your internal DNS services. Bonus points if you run your own recursive resolver so you aren't dependent on some other party not screwing up somehow.
It's interesting how easy it is to accidentally invert logical operations. I see it in code all the time. A condition will test that A is true when what they really need to know is if B and C are both false. It's like some kind of cognitive tick.
I don't understand this. The icon URL is in the HTML. Both icons https://status.aws.amazon.com/images/status0.gif and https://status.aws.amazon.com/images/status3.gif have been working for us all along. Plus clearly they are able to update the status page contents, because they added the "increased error rates" message there too. I don't want to believe it but is it fair to assume they did not want to replace status0.gif with status3.gif in HTML? Please correct me if I'm not getting this straight.
In any case, it's a bad day for AWS folks, I'm feeling their pain too. Being a cloud provider is a tough business to be at and the pressure is really high.