Status: http://azure.microsoft.com/en-us/status/#current
Twitter: https://twitter.com/search?f=realtime&q=azure&src=typd
(yes we do have AWS, too)
Sigh.
'Unable to Submit Request We are unable to complete the incident submission process at this time. Please refer to this page for phone numbers to call for Azure support.'
We're changing that now, will need to replicate across different cloud providers, too. We're changing a lot because of last night's outage.
Alternately, can't you just have multiple A records to distribute your load across cloud platforms and just drop the one for whichever platform is having an outage?
It's still pretty obnoxious, because they basically said "there's something suspicious going on, we won't tell you what, but you better fix it or we'll shut you down." Gee, thanks. After a week of daily emails they finally responded with a network trace... that showed we're doing some outbound HTTP calls. That's it - they redacted everything except for the ports and the first two octets of the destination IP addresses. So very helpful, and certainly looks suspicious... </sarc>
1. I was never notified of the outage. I noticed it myself when attempting to log into one of my VMs and then started looking for status updates. Sadly, the best status updates I got were here on Hacker News.
2. When my servers did come back up, at least one of my IP addresses had changed, which meant I had to update all of the relevant DNS entries (which, as everyone here no doubt knows, can take up to 48 hours to propagate). I was never notified of this change in any way.
Also this is a PITA if you use the @ entry in your DNS.
Secondly, you are using an IP address and expecting that to be static? The recommended approach is to use a CNAME so you don't hit that issue, alternatively, you can have up to 5 Reserved-IPs per subscription and attach that Reserved-IP to your VM : New-AzureReservedIP from powershell
Edit : see http://azure.microsoft.com/blog/2014/05/14/reserved-ip-addre...
I think that's been largely dispelled.
I can't find the link right now unfortunately but I remember a post looking into DNS propagation realities from either this or last year, and they found that overwhelming majority of DNS servers they tried (99%+) respected the TTLs set exactly as they should. *
My personal rule of thumb is, if it hasn't propagated within an hour, I need to look at it again because I messed up.
Tools like this [1]are invaluable when you're paranoid about whether your new record has propagated.
[1] https://www.whatsmydns.net
* ugh. Does anyone know which post I'm talking about? My google-fu is failing me hard.
Google App Engine has had numerous outages like this, the only one I can find any public documentation for being a 6 hour outage in 2012: http://googleappengine.blogspot.co.uk/2012/10/about-todays-a... (and let's not forget the old-style Datastore corruption incident, where every App Engine user got to manually merge split-brain database tables after a messed up failover)
The App Engine team has a proactive policy about posting about downtime:
https://groups.google.com/forum/#!forum/google-appengine-dow...
Since the team highlights basically anything that looks like it is impacting customers, the issues don't always warrant a stand-alone blog post, but you'll notice that generally speaking the last post in each thread is a full public post-mortem with diagnosis and remediation.
Let me know if there's more you think might be useful for you as a GAE customer. Thanks!
EDIT2: Now the databases are down, this is costing us a lot of money. EDIT: Just went up again.
It would be great if anyone knows how to mitigate these in the future - what can I do to protect myself against this in the future? (Except leave Azure)
Obviously there is a segnifigant cost associated with engineering this level of cross platform redundancy which is why reliability is an important factor in making your platform choices. If you can tolerate some downtime, you can be more flexible, otherwise it will costs one way or the other.
In any case you should consider having a user notification site setup on a completely different service (or two) so that when things go wrong you can redirect everyone to that site to keep your customers informed. This is especially important when you have partial outages that could create inconstancies in your database or application state if you where to continue to allow users to interact with it in a degraded state.
Our big hosted site is hosted in Europe is actually working but our blogs and a news website are both down. We offer a paid service at 600$ a year and if the main site was down it would be very bad for our reputation.
Our DNS points to Azure on all these domains and things are hosted as "Azure Web Site" - how would notifications work if Azure itself is failing? Would I need to proxy the traffic through elsewhere?
Are there any services that solve this problem for me? I really don't mind paying a few dollars every month and not worry about this.
The only manual step was to delay the switch back until our vms where working fine and had all resources. We do this changing route53 health check to one that is always failing.
We had also to purge our crashed mongo nodes because the journal was broken.
https://auth0.com/availability-trust/img/auth0-infrastructur...
It really depends on how much risk you're willing to accept, and how much that is worth to you. It can be quantified via revenue lost, but reputation is much harder to put a number on.
This is not the first time this has happened in the last two months (after a relatively reliable year). The problem is I'm not sure any other hosting provider would do any better.
We have failover loadbalancing running between multiple datacenters, no issue here!
edit : 99.99%
Do Microsoft say this about Traffic Manager or are you suggesting you have to pay for extra services to get the advertised reliability figure?
Who was selling that to you? Because I'm pretty sure it wasn't Microsoft…
http://azure.microsoft.com/en-us/support/legal/sla/
9 hours of downtime means they are down to at most 98.75% for this cycle.
Also, funny if you try to zoom out in Chrome to see the whole thing, the row headers get out of alignment.
Why would I want to 'X' out specific rows/columns in the table? It was so complicated to begin with, someone thought adding more complication through end-user customization was a good idea? I just noticed, you can even expand some of the rows...
Seriously, a status page should tell you either "It's up" or "What's down". It's not even showing history over time, this is just a snapshot. The text at the top directly contradicts the icons in the table, making the whole thing even more ridiculous.
The footnote at the bottom is the best, "The Australia Regions are available only to customers with billing addresses in Australia and New Zealand." Thanks for that useful nugget! /s
When the cloud is down, all we can do is fiddle our thumbs and hope it doesn't happen again. Or maybe we could send an angry letter to Microsoft, and hope somebody reads it.
If you're using your own servers, or even VPS, you do have control over infrastructure, and can plan for changes and mitigate problems quickly, and you can run for years without downtime if nothing is changing significantly. Depending on your staff, funding, etc that might be attractive or not. Each has its own advantages, and disadvantages.
I wonder how many customers Azure just lost do to their unexpected 2 day fiasco
Amazon had a number of EBS fiascoes and survived just fine. I'd expect Azure to do the same.
It's obviously not going to destroy anyone's business, but there is a lot more competition than there used to be.
Put your servers in different regions, use Azure/Google, BlueMix/AWS, or even hybrid cloud, do something. Have a DR plan.
If the disaster strikes my region, I probably have better things to do than IT things (like running for my life :-).
But with the cloud the disaster could be thousand of kilometers away and still affect me. That's the problem with the cloud : why should I stop working in my remote French town because there's a landslide in Ireland (or wherever they put the European cloud data centers) ?
I don't say the cloud doesn't have it's uses (especially as a redundant backup far far away) but the all cloud model has way more risks than what people think ... and vendors don't rush to explain that.
I'm one of those guy that think the future will be more and more harsh for the western civilization (think collapse of the Soviet Union). There will be less money for everything, infrastructure in particular, things will fail and you will have to deal with it locally and the DIY way.
Question - are AWS or GCE better at proactively messaging when there's an outage?
See https://groups.google.com/forum/#!forum/gce-operations and https://groups.google.com/forum/#!forum/google-appengine-dow....
And I find out about it by yelling at Heroku - they told me that Amazon is having issues before Amazon's status turned yellow.
Looking forward to the post mortem.
What does your client/customer think of you being on Azure? That you chose the crappy solution because your low-tech infrastructure still uses windows, which does not carry a lot of tech cred.
20% of Azure VMs are Linux.
You are not well informed.
More likely the have _something_ which runs on Azure. Fortune 500s are, pretty much by definition, quite large - and probably have tons of departments and sub departments. And at least one of those departments probably has a task of trying out new things, like Azure, by running something on it.
What surprises me is that nearly 20% of Fortune 500s _don't_ have something running on Azure.
(I wonder what percentage "run on" Amazon)
How much of the user's data would be forever lost in such an event ?
The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.
Anyway, the point I'm trying to make is that we should design our services or apps with this in mind - the cloud can and will fail from time to time, maybe forever. So, if possible, use the cloud as a 'bonus' feature, a means to back up data and store user's data offline for when the dark day comes at least the user still has his data.
Is havin your stuff stores locally any more secure in that situation. If someone wants your data they'll knock on your door and beat you and your family until you give it to them
While you're obviously going to be unhappy with downtime, this is a genuine part of calculation you should have made when you decided to outsource all your eggs into one basket.
Seriously considering another layer above azure to mitigate this in the future. Very disappointing to see.
At least initially their status indicated they're handing the problem but lately it's just been "All Good" and they said they resolved it on twitter but it's not at 100% yet: http://azure.microsoft.com/en-us/status/
Mistakes happen, services go down, I can get over that. What matters is how its dealt with. At the moment I would not want to be an Azure customer dealing with 9 hours+ downtime whilst MS are saying everything is great. At the very least change it to "Having some issues" or similar!
(not that convenient to copy paste the OP link from a mobile device)
Their error pages are less graceful than mine.
Didn't receive any calls yet, but i don't think that will take long.
Disgusting management interface
Abysmal support
Way to fuck up a mustard sandwich Microsoftie
We moved everything we had away from that Virus named Azure.