Microsoft Azure Outage (opens in new tab)

(azure.microsoft.com)

168 pointswenbert11y ago172 comments

Is Azure down?

Status: http://azure.microsoft.com/en-us/status/#current

Twitter: https://twitter.com/search?f=realtime&q=azure&src=typd

172 comments

130 comments · 44 top-level

photorized11y ago· 10 in thread

I feel like an idiot. MS featured my Azure startup today, quoting me about overall stability etc (which has been the case for us, until today). They then proceeded to go down, taking all our production systems with them.

(yes we do have AWS, too)

Sigh.

photorized11y ago

Microsoft, you've got to be kidding me. Just tried opening a billing ticket, completed the forms in detail, attached screenshots, clicked Submit...

'Unable to Submit Request We are unable to complete the incident submission process at this time. Please refer to this page for phone numbers to call for Azure support.'

Encosia11y ago

For what it's worth, I've been watching several Azure-hosted sites that I control and they've been coming back online sequentially (and are all back online now). Whatever they're fixing, it seems to be taking some time, but is progressing steadily at a good pace in the last hour.

1 more reply

razzberryman11y ago

They should run their support system on AWS since they're likely to get a lot of ticket requests if Azure is down. :)

higherpurpose11y ago

Azure has had quite a few outages this year. I'd say it's already lower than that 99.999 percent uptime or w/e they are advertising.

rbanffy11y ago

According to https://cloudharmony.com/status-1year-for-azure, they didn't reach 99.99%.

toyg11y ago

0.001% of 365 days is 8.76 hours. So yeah, shot for the year; but of course they'll do some "hollywood timekeeping" (or just ignore the matter altogether) and keep advertising...

1 more reply

cddotdotslash11y ago

And the year isn't even over yet!

cddotdotslash11y ago

Just curious - if you have AWS too, then why did it take everything down? Can't you just swap the DNS?

photorized11y ago

To answer your question: some parts of our SaaS (e.g. data gathering/processing) are on both AWS and Azure, but the customer-facing portal web app is 100% on Azure (in two regions North East and North Central), so we couldn't just swap the DNS.

We're changing that now, will need to replicate across different cloud providers, too. We're changing a lot because of last night's outage.

philwelch11y ago

Does DNS propagate quickly enough to alleviate an outage or is it just a matter of ensuring that you recover within a few hours rather on waiting on an outage resolution that might take longer?

Alternately, can't you just have multiple A records to distribute your load across cloud platforms and just drop the one for whichever platform is having an outage?

1 more reply

pkorzeniewski11y ago· 9 in thread

Azure support is probably the worst one I had ever deal with. When my account (and service itself) stopped working, I haven't received any email. When I tried to sign in, all I got was some generic error saying "There's something wrong with your account". My services of course were down and I couldn't do ANYTHING. I've contacted the support to learn that my account has been blocked (!) because there was some suspicious (!!) activity going on. What the... No, they couldn't tell me what exactly it was. I've exchanged emails back and forth with the support for several days to learn nothing new, my account and services were still disabled and I was more than pissed off. From that day I hate Azure and I advise anyone against using it, because such situation is absolutely unacceptable.

breischl11y ago

Interesting, I actually just got one of those "suspicious activity" emails last week. But they just threatened to shut down my services, they haven't actually done it (yet).

It's still pretty obnoxious, because they basically said "there's something suspicious going on, we won't tell you what, but you better fix it or we'll shut you down." Gee, thanks. After a week of daily emails they finally responded with a network trace... that showed we're doing some outbound HTTP calls. That's it - they redacted everything except for the ports and the first two octets of the destination IP addresses. So very helpful, and certainly looks suspicious... </sarc>

Someone123411y ago

Wait, what? Outbound HTTP is enough to get your account shut down? The hell! What if your web-app is utilising any API in the world, the vast majority are over HTTP/S utilising JSON or XML (and sometimes you need back end API access rather than client API access, like updating a product database).

1 more reply

donatj11y ago

I had a similar issue with azure where one of my drives just up and disappeared. Took days(!) to get it back. I'd never use Azure again.

blablabla12311y ago

There is probably a lot to complain about Azure, especially when it comes to proactive communication, dirty hacks and Windows weirdness. But their support was actually really nice to me...

Touche11y ago

I had a similar situation with DigitalOcean. Droplets were taken offline with no notification whatsoever, only answered after I submitted a ticket. This has happened to me twice.

debacle11y ago

Why were they taken offline? I had a droplet taken offline for security reasons once and they were very communicative and responsive.

1 more reply

atmosx11y ago

Days? What do you "days"???? Azure was down for like 9 hours.

garretraziel11y ago

I think that he is not referring to this incident.

tiagocesar11y ago

I love how deleted comments actually never disappear.

1 more reply

silverbax8811y ago· 9 in thread

The idea of cloud storage being down is less of an issue - I don't like it, but I understand it. What bothers me about this is:

1. I was never notified of the outage. I noticed it myself when attempting to log into one of my VMs and then started looking for status updates. Sadly, the best status updates I got were here on Hacker News.

2. When my servers did come back up, at least one of my IP addresses had changed, which meant I had to update all of the relevant DNS entries (which, as everyone here no doubt knows, can take up to 48 hours to propagate). I was never notified of this change in any way.

Maarten8811y ago

2. Azure does not guarantee that you keep your ip address by default. You should configure a cname if you use Azure Websites or get a reserved ip address, available with Cloud Services

silverbax8811y ago

Actually, I am supposed to have a fixed IP. And I have CNAMES configured. I will review to determine if there is some other way I can set this up, but my issue is that I would never allow my products to be offline for hours without notifying my customers.

1 more reply

badgersandjam11y ago

Also as I found out, set the TTL on your CNAME very low.

Also this is a PITA if you use the @ entry in your DNS.

1 more reply

shaydoc11y ago

Do you having monitoring setup on your cloud vm, because you should have alerts triggering emails to notify you when this is happening.

Secondly, you are using an IP address and expecting that to be static? The recommended approach is to use a CNAME so you don't hit that issue, alternatively, you can have up to 5 Reserved-IPs per subscription and attach that Reserved-IP to your VM : New-AzureReservedIP from powershell

Edit : see http://azure.microsoft.com/blog/2014/05/14/reserved-ip-addre...

silverbax8811y ago

Yes, I do. That's why I was checking my VMs. This still doesn't absolve my cloud server provider from notifying me when they have a global system failure.

yourad_io11y ago

> which, as everyone here no doubt knows, can take up to 48 hours to propagate

I think that's been largely dispelled.

I can't find the link right now unfortunately but I remember a post looking into DNS propagation realities from either this or last year, and they found that overwhelming majority of DNS servers they tried (99%+) respected the TTLs set exactly as they should. *

My personal rule of thumb is, if it hasn't propagated within an hour, I need to look at it again because I messed up.

Tools like this [1]are invaluable when you're paranoid about whether your new record has propagated.

[1] https://www.whatsmydns.net

* ugh. Does anyone know which post I'm talking about? My google-fu is failing me hard.

vijaykiran11y ago

Perhaps not the link you are looking for - but there was some discussion on here sometime ago - https://news.ycombinator.com/item?id=3397253

hosay12311y ago

Sadly this is a common situation. They appear to hold off making any public notice (including often on their own "service dashboards") until support forums are screaming with upset users.

Google App Engine has had numerous outages like this, the only one I can find any public documentation for being a 6 hour outage in 2012: http://googleappengine.blogspot.co.uk/2012/10/about-todays-a... (and let's not forget the old-style Datastore corruption incident, where every App Engine user got to manually merge split-brain database tables after a messed up failover)

dewitt11y ago

Hi hosay123,

The App Engine team has a proactive policy about posting about downtime:

https://groups.google.com/forum/#!forum/google-appengine-dow...

Since the team highlights basically anything that looks like it is impacting customers, the issues don't always warrant a stand-alone blog post, but you'll notice that generally speaking the last post in each thread is a full public post-mortem with diagnosis and remediation.

Let me know if there's more you think might be useful for you as a GAE customer. Thanks!

1 more reply

inglor11y ago· 8 in thread

Our sites have been down for more than 3 hours now.

EDIT2: Now the databases are down, this is costing us a lot of money. EDIT: Just went up again.

It would be great if anyone knows how to mitigate these in the future - what can I do to protect myself against this in the future? (Except leave Azure)

joshuak11y ago

Major outages should absolutely weigh into your decisions as to what platform to use. That being said you can mitigate the effect of instability by engineering your app to failover to other availability zones or even to another cloud platform (depending on your app) if the entire platform goes down.

Obviously there is a segnifigant cost associated with engineering this level of cross platform redundancy which is why reliability is an important factor in making your platform choices. If you can tolerate some downtime, you can be more flexible, otherwise it will costs one way or the other.

In any case you should consider having a user notification site setup on a completely different service (or two) so that when things go wrong you can redirect everyone to that site to keep your customers informed. This is especially important when you have partial outages that could create inconstancies in your database or application state if you where to continue to allow users to interact with it in a degraded state.

inglor11y ago

Thanks! This is very helpful.

Our big hosted site is hosted in Europe is actually working but our blogs and a news website are both down. We offer a paid service at 600$ a year and if the main site was down it would be very bad for our reputation.

Our DNS points to Azure on all these domains and things are hosted as "Azure Web Site" - how would notifications work if Azure itself is failing? Would I need to proxy the traffic through elsewhere?

Are there any services that solve this problem for me? I really don't mind paying a few dollars every month and not worry about this.

1 more reply

jfroma11y ago

Our main cluster is on azure west us but we have another cluster on amazon east and route53 on top of that. When the main clusters fails, route53 switch to secondary, so we where not affected at all this time.

The only manual step was to delay the switch back until our vms where working fine and had all resources. We do this changing route53 health check to one that is always failing.

We had also to purge our crashed mongo nodes because the journal was broken.

https://auth0.com/availability-trust/img/auth0-infrastructur...

barkingllama11y ago

On site disaster recovery? Off site disaster recovery? Split your hosting between multiple providers?

It really depends on how much risk you're willing to accept, and how much that is worth to you. It can be quantified via revenue lost, but reputation is much harder to put a number on.

inglor11y ago

If I had to quantify this - 3 hours * 3 people who can't work and publish posts + about a week of marketing costs for damaged rep (apologies, PR, ads for exposure). I'd say that for the very least this cost us at least 1000$ and probably north of 3000$.

This is not the first time this has happened in the last two months (after a relatively reliable year). The problem is I'm not sure any other hosting provider would do any better.

1 more reply

duncans11y ago

Look into Cloudflare. They can act as a kind of reverse proxy to keep static stuff online. Obviously doesn't help if the transactional part of the site/database goes down, but end users will see a friendly message rather than it timing out.

toomuchtodo11y ago

As others have mentioned, multiple cloud providers, service checks, and withdrawing bad providers at DNS.

photorized11y ago

Azure + AWS

sphildreth11y ago· 5 in thread

So much for the idea of 99.999% uptime with the magical "cloud" buzzword. I noticed during this downtime in North America that Word Online wasn't functioning as my daughter tried to use it to do some homework.

shaydoc11y ago

If you want 99.999% uptime, use Azure TrafficManager and set up failover loadbalancing to different datacenters.

We have failover loadbalancing running between multiple datacenters, no issue here!

edit : 99.99%

equoid11y ago

The SLA mentions 99.9% or 99.99% for Database connectivity. Where does 99.999% come from?

Do Microsoft say this about Traffic Manager or are you suggesting you have to pay for extra services to get the advertised reliability figure?

matthewmacleod11y ago

So much for the idea of 99.999% uptime with the magical "cloud" buzzword

Who was selling that to you? Because I'm pretty sure it wasn't Microsoft…

nnx11y ago

They are selling 99.99% availability over a monthtly cycle for their Storage service. 99.95% connectivity for their Virtual Machine service.

http://azure.microsoft.com/en-us/support/legal/sla/

9 hours of downtime means they are down to at most 98.75% for this cycle.

sphildreth11y ago

Oh its 99.9% from Microsoft see http://azure.microsoft.com/en-us/support/legal/sla/ which means they get 9h a year to hit the SLA see http://uptime.is/

1 more reply

zaroth11y ago· 4 in thread

That status table with all the randomly located green checks is painful to look at... I guess a green check in the 'Global' column implies a green check in all location specific columns? But what about all the rows which have no Global green check, but most columns are still empty? Are those regions where the service is not deployed? Can we gray out those boxes or something if they are 'N/A'?

Also, funny if you try to zoom out in Chrome to see the whole thing, the row headers get out of alignment.

Why would I want to 'X' out specific rows/columns in the table? It was so complicated to begin with, someone thought adding more complication through end-user customization was a good idea? I just noticed, you can even expand some of the rows...

Seriously, a status page should tell you either "It's up" or "What's down". It's not even showing history over time, this is just a snapshot. The text at the top directly contradicts the icons in the table, making the whole thing even more ridiculous.

The footnote at the bottom is the best, "The Australia Regions are available only to customers with billing addresses in Australia and New Zealand." Thanks for that useful nugget! /s

tawy111y ago

History: http://azure.microsoft.com/en-us/status/#history

unclesaamm11y ago

Running Chrome 38.0.2125.111 m, zoomed out row headers look fine

ballstothewalls11y ago

Are you confusing row headers and column headers? I have the same version as you and the row headers got funky when zoomed out.

razzberryman11y ago

http://i.imgur.com/Uv81QYi.png

coldtea11y ago· 4 in thread

Reality call: ANY and ALL Cloud services, be it Google, Azure, AWS etc, will be down for hours at some point every few years.

ExpiredLink11y ago

Reality call: ANY and ALL services, be it local or remote, will be down for hours at some point every few years.

Drakim11y ago

If it's my own fault, I can at least curse my own lack of knowledge and expertise, and I can strive to do better in the future.

When the cloud is down, all we can do is fiddle our thumbs and hope it doesn't happen again. Or maybe we could send an angry letter to Microsoft, and hope somebody reads it.

2 more replies

grey-area11y ago

From experience, that's not true. Things typically go wrong when something changes, or more rarely when something runs out of space/memory. Big cloud providers change stuff all the time, and typically do break things now and then. When they do all your can do is wait.

If you're using your own servers, or even VPS, you do have control over infrastructure, and can plan for changes and mitigate problems quickly, and you can run for years without downtime if nothing is changing significantly. Depending on your staff, funding, etc that might be attractive or not. Each has its own advantages, and disadvantages.

rlpb11y ago

When it's local, you can control your own at-risk periods. For example: you can avoid doing risky work when an important company deadline is looming.

jedgrant11y ago· 4 in thread

Regretting the decision to go with Azure. Talk about terrible timing. We have media outlets interested in our site, we send info and the site is dead. Talk about a crap first impression.

craigvn11y ago

It is totally frustrating, but at the end of the day similar outages happen with all cloud providers.

razzberryman11y ago

Looks like Azure has been down more hours in the past week than AWS has been down all year.

https://cloudharmony.com/status-1week-for-azure

https://cloudharmony.com/status-1year-for-aws

ZoF11y ago

Can you point to an example of this happening on AWS in multiple regions simultaneously?

1 more reply

photorized11y ago

I feel your pain. My startup was featured on TechNet today, I was quoted saying how great Azure worked for us so far... lots of folks were checking us out, and then Azure went down hard, taking our production systems with them. Talk about negative publicity.

Beached11y ago· 4 in thread

This may be greater then just west Europe. I personally have servers in US East that are unreachable, and there are a few reports of others in US region reporting partial unavailability for the US based servers.

I wonder how many customers Azure just lost do to their unexpected 2 day fiasco

ceejayoz11y ago

> I wonder how many customers Azure just lost do to their unexpected 2 day fiasco

Amazon had a number of EBS fiascoes and survived just fine. I'd expect Azure to do the same.

LamaOfRuin11y ago

Each of Amazon's high profile failures did have many people formulating previously non-existent escape plans though, and there are now several alternatives in this space that can offer the same scale.

It's obviously not going to destroy anyone's business, but there is a lot more competition than there used to be.

Beached11y ago

After hours of no response on the microsoft forums, VM's just started working again. I didnt change a thing, just poof.

chuckouellet11y ago

FYI We had to reboot some virtual machines via the portal due to the issues in East US yesterday but since then they all work.

jmnicolas11y ago· 4 in thread

Judging by how cloud services "frequently" go down when everything is normal, it makes me wonder what would happen in case of a real problem (volcano eruption, social unrest, nuclear disaster, alien invasion ...). I still don't get the cloud infatuation, and no you don't have to get off my lawn, I'm "only" 36 (yeah I know, in IT I'm already a dinosaur).

freehunter11y ago

What would happen to your own datacenter in case of a similar disaster? Your servers would go down and you would spin up from your disaster recovery site. Cloud doesn't mean you don't need a DR plan anymore.

Put your servers in different regions, use Azure/Google, BlueMix/AWS, or even hybrid cloud, do something. Have a DR plan.

jmnicolas11y ago

I'm thinking as the little guy here : not data center but personal computers.

If the disaster strikes my region, I probably have better things to do than IT things (like running for my life :-).

But with the cloud the disaster could be thousand of kilometers away and still affect me. That's the problem with the cloud : why should I stop working in my remote French town because there's a landslide in Ireland (or wherever they put the European cloud data centers) ?

I don't say the cloud doesn't have it's uses (especially as a redundant backup far far away) but the all cloud model has way more risks than what people think ... and vendors don't rush to explain that.

I'm one of those guy that think the future will be more and more harsh for the western civilization (think collapse of the Soviet Union). There will be less money for everything, infrastructure in particular, things will fail and you will have to deal with it locally and the DIY way.

gregd11y ago

We used to have a locked "oh shit" box. I was supposed to put our DR (which we did actually have) and a host of other things in it (it was suggested we even put a loaded gun in it) to get by with in the case of a total disaster. We were supposed to then ship it off to Iron Mountain. That oh shit box sat empty for years on the premises...

jmnicolas11y ago

You must have really interesting stories to tell, but I guess it's classified.

patwhite11y ago· 3 in thread

So, the worst part about this is that zero communication has come out of Microsoft - we first started seeing issues on Sunday and filed a ticked, had an open ticket while this larger outage happened, and haven't gotten a single email saying there's an outage. I found out about it from, sigh, buzzfeed.

Question - are AWS or GCE better at proactively messaging when there's an outage?

crb11y ago

Google's operations groups are not only regularly updated during an outage, there's a root-cause analysis with remediation and prevention information posted a couple of days after any issue.

See https://groups.google.com/forum/#!forum/gce-operations and https://groups.google.com/forum/#!forum/google-appengine-dow....

nkvoll11y ago

I've never ever received a message from AWS when they've had outages that have been affected us significantly. On the contrary, there's been multiple cases where we've experienced issues, contacted them and it's taken a few hours before they realize they're actually having infrastructure problems. Many of these don't even get an entry on their service status pages. So there's still a lot of room for improvement on AWS's side of things as well.

bad_user11y ago

I can confirm this. I remember once when half of the Internet was down and the status reported for EC2 was yellow - experiencing some minor issues :-)

And I find out about it by yelling at Heroku - they told me that Amazon is having issues before Amazon's status turned yellow.

1 more reply

jread11y ago· 3 in thread

I run a site that monitors cloud service availability. Based on VMs and Blob storage containers I maintain and monitor, the outage affected every US Azure region with 1-2 hours of downtime: https://cloudharmony.com/status-for-azure

razzberryman11y ago

Wow, comparing that to AWS is staggering!

https://cloudharmony.com/status-for-aws

etha11y ago

It's not as if AWS has never gone down (http://aws.amazon.com/message/65648/). It just hasn't had a major outage in the last 30 days.

2 more replies

maxsec11y ago

you need to add in the new Australian Clusters

teovall11y ago· 3 in thread

The postmortem for this should make for a good read. How does storage go down in eleven regions at once?

ohyesyodo11y ago

Just apply same buggy network patch to all DCs at once? They use software networking so causing something like this should be easy. Or mess up network routing for *.blob.core.windows.net which pretty much all of Azure relies on.

icebraining11y ago

Isn't applying the same patch everywhere at once a major anti-pattern?

2 more replies

photorized11y ago

I suspect a human config error. I don't see how else multiple services, in multiple regions, can all be affected at once.

Looking forward to the post mortem.

BeeDunc11y ago· 3 in thread

This outage exposes the clowns that actually chose Azure as their cloud provider. If you use AMZN and it goes down, at least you're in good company, with the likes of Netflix, Twitter, Instagram, and so on. It's like yeah, I'm big like they are. So what, it went down, so is Netflix.

What does your client/customer think of you being on Azure? That you chose the crappy solution because your low-tech infrastructure still uses windows, which does not carry a lot of tech cred.

keithwarren11y ago

Over 80% of the Fortune 500 run on Azure.

20% of Azure VMs are Linux.

You are not well informed.

brongondwana11y ago

"run on" - I suspect you're being fed a unicode pile of poo here.

More likely the have _something_ which runs on Azure. Fortune 500s are, pretty much by definition, quite large - and probably have tons of departments and sub departments. And at least one of those departments probably has a task of trying out new things, like Azure, by running something on it.

What surprises me is that nearly 20% of Fortune 500s _don't_ have something running on Azure.

(I wonder what percentage "run on" Amazon)

2 more replies

nullrouted11y ago

Someone is lying to you. Can you provide your source for this?

codeshaman11y ago· 2 in thread

As more and more services and apps depend on 'the cloud', I'm wondering, how many of them would survive a major cloud outage: the cloud company going bankrupt, stock market crash or economic meltdown, a malware exploiting a major server-side bug (like heartbleed or shellshock, but worse) wiping or encrypting the data on the infrastructure/user machines.

How much of the user's data would be forever lost in such an event ?

The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.

Anyway, the point I'm trying to make is that we should design our services or apps with this in mind - the cloud can and will fail from time to time, maybe forever. So, if possible, use the cloud as a 'bonus' feature, a means to back up data and store user's data offline for when the dark day comes at least the user still has his data.

maccard11y ago

> The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.

Is havin your stuff stores locally any more secure in that situation. If someone wants your data they'll knock on your door and beat you and your family until you give it to them

jacalata11y ago

If you have the only copy, you can destroy it.

1 more reply

scoj11y ago· 2 in thread

My VM is still down (US East). Is anyone else still experiencing issues?

photorized11y ago

My VMs appear to be up mostly, but they are primarily in North Central.

scoj11y ago

Thanks, I just restarted it and it took a while (5 minutes or so), but after that, it appears fine.

Nmachine11y ago· 2 in thread

"Everything is running great"

smoyer11y ago

It's obviously "All Good!"

ExpiredLink11y ago

Just a data dump for the NSA. Nothing serious!

gwgwegewg11y ago· 1 in thread

Microsoft are refusing to help us with our downed servers because we don't have a support contract. The outage is their issue not ours!!

ownagefool11y ago

If your app is down, it sounds very much like it's your problem.

While you're obviously going to be unhappy with downtime, this is a genuine part of calculation you should have made when you decided to outsource all your eggs into one basket.

inglor11y ago· 1 in thread

We've been noticing ups and downs for the last few hours of our VM powering an important database in West Europe.

Seriously considering another layer above azure to mitigate this in the future. Very disappointing to see.

At least initially their status indicated they're handing the problem but lately it's just been "All Good" and they said they resolved it on twitter but it's not at 100% yet: http://azure.microsoft.com/en-us/status/

joshmlewis11y ago

Someone else mentioned routing to divert traffic to working data centers. That might be an option for you.

Varcht11y ago· 1 in thread

Oh no, did we break the status page too? Sorry Azure team, really didn't mean to pile on!

bengali311y ago

keeping the load light? <html><head></head><body>The page cannot be displayed because an internal server error has occurred.</body></html>

elpool211y ago· 1 in thread

Yup! Azure websites and Storage are down in multiple regions.

andrea_s11y ago

VMs too... At least for western europe

us0r11y ago· 1 in thread

My VMs are down. This much be something major.

plasma11y ago

I think because the disks are backed by blob storage.

duedl0r11y ago· 1 in thread

come on! give them some slack.. they probably aren't very experienced at managing their linux servers! ;)

hyperliner11y ago

Clearly neither did the developers of these apps which are now down who thought of spending as few pennies as possible and save a few other pennies with load balancing failover, and then expecting magic!

silverbax8811y ago· 1 in thread

This is nice. My web site server IP was changed when the server came back up. So now I have to update all of the site DNS settings.

ohyesyodo11y ago

Hmm. You should be using CNAME records rather than IP addresses. Or are you using the new fixed IP features?

matthewking11y ago

The most damaging part to me is that "All good! Everything is running great." message on the status page.

Mistakes happen, services go down, I can get over that. What matters is how its dealt with. At the moment I would not want to be an Azure customer dealing with 9 hours+ downtime whilst MS are saying everything is great. At the very least change it to "Having some issues" or similar!

kelvin011y ago

You should try Google's App Engine (paid premium account) tech support when your critical files disappear. Can't be any worse than this ... That's the problem with these hosted cloud solutions, your systems are at the mercy of the bad tech support. Try explaining that to your own customers ...

nnx11y ago

Actual link to status page: http://azure.microsoft.com/en-us/status/#current

(not that convenient to copy paste the OP link from a mobile device)

toddgardner11y ago

Our VMs and websites on USEast are unreachable, however our storage seems to be working fine. There is something very backwards with how they are communicating this outage.

syassami11y ago

Storage, Websites and Visual Studio Online - Multiple Regions - Partial Service Interruption 5 mins agoStarting at 19 Nov 2014 00:52 UTC we are experiencing a connectivity issue to Azure Services including Storage, Websites and Visual Studio Online. The next update will be provided in 60 minutes.

wenbertOP11y ago

Well, their status page is telling lies.

bursteg11y ago

Storage is the source of the outage, and most of the services rely on it, so they are all impacted.

plasma11y ago

Still down even 2 hours later, regardless of the status page saying its OK.

scientist11y ago

iancarroll11y ago

Seems to be back up now, my site (https://ian.sh) was down for a while.

scoj11y ago

There really isn't anything I can do either. My VM isn't back up yet. I'd go to sleep and just expect it to be online in the morning (when it really matters), but I'm afraid a drive won't reattach or something like that. Meanwhile, twiddling thumbs...hit F5...twiddle thumbs...)

csbowe11y ago

The page cannot be displayed because an internal server error has occurred.

Their error pages are less graceful than mine.

jsudhams11y ago

This is why i have server class refurbished machine handy as working backup so that you can restore if ther service is not restored with in few minutes. Or have another copy of vm/db in other provider like rackspace or something

sspies11y ago

Do you run multi-region or maybe multi-provider setups? How do you migrate your instances from failed regions to healthy ones? How do you route users to the healthy regions? DNS? Do you think anycast could be an alternative?

NicoJuicy11y ago

My website, my webapplication for member management + my clients are down :s, i really don't like this...

Didn't receive any calls yet, but i don't think that will take long.

NinjaTime11y ago

Disgusting Virtual service

Disgusting management interface

Abysmal support

Way to fuck up a mustard sandwich Microsoftie

We moved everything we had away from that Virus named Azure.

Aoyagi11y ago

So did anyone receive a call "The cloud is down"? Or at least an e-mail?

superuser211y ago

Every time this happens, ask yourself... Are you outage-proof? Do you have a rational reason to believe that internally-managed infrastructure would never have a problem like this?

damian200011y ago

I'm guessing the reason that this site is down I was trying to load ... http://www.dotnetrocks.com/

csbowe11y ago

Maybe they unknowingly upgraded to Intel's latest SSDs in their storage array. https://news.ycombinator.com/item?id=8626928

j / k navigate · click thread line to collapse

172 comments

130 comments · 44 top-level

photorized11y ago· 10 in thread

(yes we do have AWS, too)

Sigh.

photorized11y ago

Microsoft, you've got to be kidding me. Just tried opening a billing ticket, completed the forms in detail, attached screenshots, clicked Submit...

'Unable to Submit Request We are unable to complete the incident submission process at this time. Please refer to this page for phone numbers to call for Azure support.'

Encosia11y ago

1 more reply

razzberryman11y ago

They should run their support system on AWS since they're likely to get a lot of ticket requests if Azure is down. :)

higherpurpose11y ago

Azure has had quite a few outages this year. I'd say it's already lower than that 99.999 percent uptime or w/e they are advertising.

rbanffy11y ago

According to https://cloudharmony.com/status-1year-for-azure, they didn't reach 99.99%.

toyg11y ago

0.001% of 365 days is 8.76 hours. So yeah, shot for the year; but of course they'll do some "hollywood timekeeping" (or just ignore the matter altogether) and keep advertising...

1 more reply

cddotdotslash11y ago

And the year isn't even over yet!

cddotdotslash11y ago

Just curious - if you have AWS too, then why did it take everything down? Can't you just swap the DNS?

photorized11y ago

We're changing that now, will need to replicate across different cloud providers, too. We're changing a lot because of last night's outage.

philwelch11y ago

Does DNS propagate quickly enough to alleviate an outage or is it just a matter of ensuring that you recover within a few hours rather on waiting on an outage resolution that might take longer?

Alternately, can't you just have multiple A records to distribute your load across cloud platforms and just drop the one for whichever platform is having an outage?

1 more reply

pkorzeniewski11y ago· 9 in thread

breischl11y ago

Interesting, I actually just got one of those "suspicious activity" emails last week. But they just threatened to shut down my services, they haven't actually done it (yet).

Someone123411y ago

1 more reply

donatj11y ago

I had a similar issue with azure where one of my drives just up and disappeared. Took days(!) to get it back. I'd never use Azure again.

blablabla12311y ago

There is probably a lot to complain about Azure, especially when it comes to proactive communication, dirty hacks and Windows weirdness. But their support was actually really nice to me...

Touche11y ago

I had a similar situation with DigitalOcean. Droplets were taken offline with no notification whatsoever, only answered after I submitted a ticket. This has happened to me twice.

debacle11y ago

Why were they taken offline? I had a droplet taken offline for security reasons once and they were very communicative and responsive.

1 more reply

atmosx11y ago

Days? What do you "days"???? Azure was down for like 9 hours.

garretraziel11y ago

I think that he is not referring to this incident.

tiagocesar11y ago

I love how deleted comments actually never disappear.

1 more reply

silverbax8811y ago· 9 in thread

The idea of cloud storage being down is less of an issue - I don't like it, but I understand it. What bothers me about this is:

Maarten8811y ago

2. Azure does not guarantee that you keep your ip address by default. You should configure a cname if you use Azure Websites or get a reserved ip address, available with Cloud Services

silverbax8811y ago

1 more reply

badgersandjam11y ago

Also as I found out, set the TTL on your CNAME very low.

Also this is a PITA if you use the @ entry in your DNS.

1 more reply

shaydoc11y ago

Do you having monitoring setup on your cloud vm, because you should have alerts triggering emails to notify you when this is happening.

Edit : see http://azure.microsoft.com/blog/2014/05/14/reserved-ip-addre...

silverbax8811y ago

Yes, I do. That's why I was checking my VMs. This still doesn't absolve my cloud server provider from notifying me when they have a global system failure.

yourad_io11y ago

> which, as everyone here no doubt knows, can take up to 48 hours to propagate

I think that's been largely dispelled.

My personal rule of thumb is, if it hasn't propagated within an hour, I need to look at it again because I messed up.

Tools like this [1]are invaluable when you're paranoid about whether your new record has propagated.

[1] https://www.whatsmydns.net

* ugh. Does anyone know which post I'm talking about? My google-fu is failing me hard.

vijaykiran11y ago

Perhaps not the link you are looking for - but there was some discussion on here sometime ago - https://news.ycombinator.com/item?id=3397253

hosay12311y ago

Sadly this is a common situation. They appear to hold off making any public notice (including often on their own "service dashboards") until support forums are screaming with upset users.

dewitt11y ago

Hi hosay123,

The App Engine team has a proactive policy about posting about downtime:

https://groups.google.com/forum/#!forum/google-appengine-dow...

Let me know if there's more you think might be useful for you as a GAE customer. Thanks!

1 more reply

inglor11y ago· 8 in thread

Our sites have been down for more than 3 hours now.

EDIT2: Now the databases are down, this is costing us a lot of money. EDIT: Just went up again.

It would be great if anyone knows how to mitigate these in the future - what can I do to protect myself against this in the future? (Except leave Azure)

joshuak11y ago

inglor11y ago

Thanks! This is very helpful.

Our DNS points to Azure on all these domains and things are hosted as "Azure Web Site" - how would notifications work if Azure itself is failing? Would I need to proxy the traffic through elsewhere?

Are there any services that solve this problem for me? I really don't mind paying a few dollars every month and not worry about this.

1 more reply

jfroma11y ago

The only manual step was to delay the switch back until our vms where working fine and had all resources. We do this changing route53 health check to one that is always failing.

We had also to purge our crashed mongo nodes because the journal was broken.

https://auth0.com/availability-trust/img/auth0-infrastructur...

barkingllama11y ago

On site disaster recovery? Off site disaster recovery? Split your hosting between multiple providers?

It really depends on how much risk you're willing to accept, and how much that is worth to you. It can be quantified via revenue lost, but reputation is much harder to put a number on.

inglor11y ago

This is not the first time this has happened in the last two months (after a relatively reliable year). The problem is I'm not sure any other hosting provider would do any better.

1 more reply

duncans11y ago

toomuchtodo11y ago

As others have mentioned, multiple cloud providers, service checks, and withdrawing bad providers at DNS.

photorized11y ago

Azure + AWS

sphildreth11y ago· 5 in thread

shaydoc11y ago

If you want 99.999% uptime, use Azure TrafficManager and set up failover loadbalancing to different datacenters.

We have failover loadbalancing running between multiple datacenters, no issue here!

edit : 99.99%

equoid11y ago

The SLA mentions 99.9% or 99.99% for Database connectivity. Where does 99.999% come from?

Do Microsoft say this about Traffic Manager or are you suggesting you have to pay for extra services to get the advertised reliability figure?

matthewmacleod11y ago

So much for the idea of 99.999% uptime with the magical "cloud" buzzword

Who was selling that to you? Because I'm pretty sure it wasn't Microsoft…

nnx11y ago

They are selling 99.99% availability over a monthtly cycle for their Storage service. 99.95% connectivity for their Virtual Machine service.

http://azure.microsoft.com/en-us/support/legal/sla/

9 hours of downtime means they are down to at most 98.75% for this cycle.

sphildreth11y ago

Oh its 99.9% from Microsoft see http://azure.microsoft.com/en-us/support/legal/sla/ which means they get 9h a year to hit the SLA see http://uptime.is/

1 more reply

zaroth11y ago· 4 in thread

Also, funny if you try to zoom out in Chrome to see the whole thing, the row headers get out of alignment.

The footnote at the bottom is the best, "The Australia Regions are available only to customers with billing addresses in Australia and New Zealand." Thanks for that useful nugget! /s

tawy111y ago

History: http://azure.microsoft.com/en-us/status/#history

unclesaamm11y ago

Running Chrome 38.0.2125.111 m, zoomed out row headers look fine

ballstothewalls11y ago

Are you confusing row headers and column headers? I have the same version as you and the row headers got funky when zoomed out.

razzberryman11y ago

http://i.imgur.com/Uv81QYi.png

coldtea11y ago· 4 in thread

Reality call: ANY and ALL Cloud services, be it Google, Azure, AWS etc, will be down for hours at some point every few years.

ExpiredLink11y ago

Reality call: ANY and ALL services, be it local or remote, will be down for hours at some point every few years.

Drakim11y ago

If it's my own fault, I can at least curse my own lack of knowledge and expertise, and I can strive to do better in the future.

When the cloud is down, all we can do is fiddle our thumbs and hope it doesn't happen again. Or maybe we could send an angry letter to Microsoft, and hope somebody reads it.

2 more replies

grey-area11y ago

rlpb11y ago

When it's local, you can control your own at-risk periods. For example: you can avoid doing risky work when an important company deadline is looming.

jedgrant11y ago· 4 in thread

Regretting the decision to go with Azure. Talk about terrible timing. We have media outlets interested in our site, we send info and the site is dead. Talk about a crap first impression.

craigvn11y ago

It is totally frustrating, but at the end of the day similar outages happen with all cloud providers.

razzberryman11y ago

Looks like Azure has been down more hours in the past week than AWS has been down all year.

https://cloudharmony.com/status-1week-for-azure

https://cloudharmony.com/status-1year-for-aws

ZoF11y ago

Can you point to an example of this happening on AWS in multiple regions simultaneously?

1 more reply

photorized11y ago

Beached11y ago· 4 in thread

I wonder how many customers Azure just lost do to their unexpected 2 day fiasco

ceejayoz11y ago

> I wonder how many customers Azure just lost do to their unexpected 2 day fiasco

Amazon had a number of EBS fiascoes and survived just fine. I'd expect Azure to do the same.

LamaOfRuin11y ago

It's obviously not going to destroy anyone's business, but there is a lot more competition than there used to be.

Beached11y ago

After hours of no response on the microsoft forums, VM's just started working again. I didnt change a thing, just poof.

chuckouellet11y ago

FYI We had to reboot some virtual machines via the portal due to the issues in East US yesterday but since then they all work.

jmnicolas11y ago· 4 in thread

freehunter11y ago

Put your servers in different regions, use Azure/Google, BlueMix/AWS, or even hybrid cloud, do something. Have a DR plan.

jmnicolas11y ago

I'm thinking as the little guy here : not data center but personal computers.

If the disaster strikes my region, I probably have better things to do than IT things (like running for my life :-).

gregd11y ago

jmnicolas11y ago

You must have really interesting stories to tell, but I guess it's classified.

patwhite11y ago· 3 in thread

Question - are AWS or GCE better at proactively messaging when there's an outage?

crb11y ago

Google's operations groups are not only regularly updated during an outage, there's a root-cause analysis with remediation and prevention information posted a couple of days after any issue.

See https://groups.google.com/forum/#!forum/gce-operations and https://groups.google.com/forum/#!forum/google-appengine-dow....

nkvoll11y ago

bad_user11y ago

I can confirm this. I remember once when half of the Internet was down and the status reported for EC2 was yellow - experiencing some minor issues :-)

And I find out about it by yelling at Heroku - they told me that Amazon is having issues before Amazon's status turned yellow.

1 more reply

jread11y ago· 3 in thread

razzberryman11y ago

Wow, comparing that to AWS is staggering!

https://cloudharmony.com/status-for-aws

etha11y ago

It's not as if AWS has never gone down (http://aws.amazon.com/message/65648/). It just hasn't had a major outage in the last 30 days.

2 more replies

maxsec11y ago

you need to add in the new Australian Clusters

teovall11y ago· 3 in thread

The postmortem for this should make for a good read. How does storage go down in eleven regions at once?

ohyesyodo11y ago

icebraining11y ago

Isn't applying the same patch everywhere at once a major anti-pattern?

2 more replies

photorized11y ago

I suspect a human config error. I don't see how else multiple services, in multiple regions, can all be affected at once.

Looking forward to the post mortem.

BeeDunc11y ago· 3 in thread

What does your client/customer think of you being on Azure? That you chose the crappy solution because your low-tech infrastructure still uses windows, which does not carry a lot of tech cred.

keithwarren11y ago

Over 80% of the Fortune 500 run on Azure.

20% of Azure VMs are Linux.

You are not well informed.

brongondwana11y ago

"run on" - I suspect you're being fed a unicode pile of poo here.

What surprises me is that nearly 20% of Fortune 500s _don't_ have something running on Azure.

(I wonder what percentage "run on" Amazon)

2 more replies

nullrouted11y ago

Someone is lying to you. Can you provide your source for this?

codeshaman11y ago· 2 in thread

How much of the user's data would be forever lost in such an event ?

The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.

maccard11y ago

> The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.

Is havin your stuff stores locally any more secure in that situation. If someone wants your data they'll knock on your door and beat you and your family until you give it to them

jacalata11y ago

If you have the only copy, you can destroy it.

1 more reply

scoj11y ago· 2 in thread

My VM is still down (US East). Is anyone else still experiencing issues?

photorized11y ago

My VMs appear to be up mostly, but they are primarily in North Central.

scoj11y ago

Thanks, I just restarted it and it took a while (5 minutes or so), but after that, it appears fine.

Nmachine11y ago· 2 in thread

"Everything is running great"

smoyer11y ago

It's obviously "All Good!"

ExpiredLink11y ago

Just a data dump for the NSA. Nothing serious!

gwgwegewg11y ago· 1 in thread

Microsoft are refusing to help us with our downed servers because we don't have a support contract. The outage is their issue not ours!!

ownagefool11y ago

If your app is down, it sounds very much like it's your problem.

While you're obviously going to be unhappy with downtime, this is a genuine part of calculation you should have made when you decided to outsource all your eggs into one basket.

inglor11y ago· 1 in thread

We've been noticing ups and downs for the last few hours of our VM powering an important database in West Europe.

Seriously considering another layer above azure to mitigate this in the future. Very disappointing to see.

joshmlewis11y ago

Someone else mentioned routing to divert traffic to working data centers. That might be an option for you.

Varcht11y ago· 1 in thread

Oh no, did we break the status page too? Sorry Azure team, really didn't mean to pile on!

bengali311y ago

keeping the load light? <html><head></head><body>The page cannot be displayed because an internal server error has occurred.</body></html>

elpool211y ago· 1 in thread

Yup! Azure websites and Storage are down in multiple regions.

andrea_s11y ago

VMs too... At least for western europe

us0r11y ago· 1 in thread

My VMs are down. This much be something major.

plasma11y ago

I think because the disks are backed by blob storage.

duedl0r11y ago· 1 in thread

come on! give them some slack.. they probably aren't very experienced at managing their linux servers! ;)

hyperliner11y ago

silverbax8811y ago· 1 in thread

This is nice. My web site server IP was changed when the server came back up. So now I have to update all of the site DNS settings.

ohyesyodo11y ago

Hmm. You should be using CNAME records rather than IP addresses. Or are you using the new fixed IP features?

matthewking11y ago

The most damaging part to me is that "All good! Everything is running great." message on the status page.

kelvin011y ago

nnx11y ago

Actual link to status page: http://azure.microsoft.com/en-us/status/#current

(not that convenient to copy paste the OP link from a mobile device)

toddgardner11y ago

Our VMs and websites on USEast are unreachable, however our storage seems to be working fine. There is something very backwards with how they are communicating this outage.

syassami11y ago

wenbertOP11y ago

Well, their status page is telling lies.

bursteg11y ago

Storage is the source of the outage, and most of the services rely on it, so they are all impacted.

plasma11y ago

Still down even 2 hours later, regardless of the status page saying its OK.

scientist11y ago

iancarroll11y ago

Seems to be back up now, my site (https://ian.sh) was down for a while.

scoj11y ago

csbowe11y ago

The page cannot be displayed because an internal server error has occurred.

Their error pages are less graceful than mine.

jsudhams11y ago

sspies11y ago

NicoJuicy11y ago

My website, my webapplication for member management + my clients are down :s, i really don't like this...

Didn't receive any calls yet, but i don't think that will take long.

NinjaTime11y ago

Disgusting Virtual service

Disgusting management interface

Abysmal support

Way to fuck up a mustard sandwich Microsoftie

We moved everything we had away from that Virus named Azure.

Aoyagi11y ago

So did anyone receive a call "The cloud is down"? Or at least an e-mail?

superuser211y ago

Every time this happens, ask yourself... Are you outage-proof? Do you have a rational reason to believe that internally-managed infrastructure would never have a problem like this?

damian200011y ago

I'm guessing the reason that this site is down I was trying to load ... http://www.dotnetrocks.com/

csbowe11y ago

Maybe they unknowingly upgraded to Intel's latest SSDs in their storage array. https://news.ycombinator.com/item?id=8626928

j / k navigate · click thread line to collapse