Microsoft Azure data deleted because of DNS outage (opens in new tab)

(nakedsecurity.sophos.com)

217 pointsstonewhite7y ago84 comments

84 comments

49 comments · 5 top-level

lenticular7y ago· 18 in thread

I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.

> I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.

Google Cloud is fair competition – provided they have the service you need. AWS and Azure both beat them in number of services. If Google has it, then it should behave as expected, and some are downright impressive (GKE and VM auto migration on GCE).

Azure is... infuriating. Inconsistent, unreliable APIs, surprising behavior everywhere (attach an internal load balancer, lose internet connectivity!?), lots of restrictions on which features can be used with which SKUs.

I see improvements and it is difficult to beat them in the enterprise, but speaking as an engineer, man Azure is infuriating.

dharmab7y ago

> attach an internal load balancer, lose internet connectivity!?

Ah, yes, the "TCP and UDP egress work unless you define an _ingress_ load balancing rule for either protocol, at which point the other protocol breaks until you create a dummy rule for the other protocol that will trigger your security team to send you tickets every few months."

TheIronYuppie7y ago

Disclosure: I work at Azure.

We're doing our best, but we're not going to suggest there's not more to do. Every major cloud provider has had issues at one point or another (I formerly worked at Amazon and Google), and I'll just say - we hear you, and we are fiercely committed to earning your trust.

salex897y ago

I work with Azure daily, my company uses tons of IaaS on all cloud providers, very dynamically. I'm mostly concentrated on Azure.

First I want to say, thank you for your service, keep it up, there is a lot to be done, but I see progress.

Secondly, please, get your teams together and start communicating. We encounter a lot of issues with things that should just work or should be much simpler. Sometimes we contact support and just get handed from team to team without actually finishing something.

Third, please, oh, please, get your SDKs (especially the Python one) fixed. It looks like every new build breaks something, sometimes even the same version on multiple installations since there is a lot of variable versioning done under the hood...

Sometimes I get a feeling a lot if things are "leaking" towards the customer. Wanna change an instance type? You get a "instance not available in cluster" error, or something similarly undocumented. Wanna copy a snapshot between regions? Good luck with that, and hope you've got some retry logic and a hell of a timeout.

Keep pushin'! :)

lenticular7y ago

I'm sure you are, and I wish you the best of luck. I'd rather spend money with Microsoft than Amazon, everything else being equal. One way that Azure could really improve on Amazon is docs/usability, for which AWS is terrible.

1 more reply

itsdrewmiller7y ago

We use Azure and love it. We've had more problems over the past few years from downstream services being broken by AWS (S3 outage, etc.) than our primary apps being broken by Azure.

gnulinux7y ago

> S3 outage

Wait what? S3 is 11 nines. That's like 1 file every 8 PB/year-years.

2 more replies

shanemhansen7y ago

Azure's ILBs are still just bizarre to me. It's the first load balancer I've ever worked with where sometimes a member of the pool can't reach the load balancer. At the packet level given their implementation, this makes sense but tell me how many times you've ever had to write a stats endpoint where the stats nodes had to do workarounds to send their own stats?

Source: https://docs.microsoft.com/en-us/azure/load-balancer/load-ba...

outworlder7y ago

Do you have any information on their implementation? I only noticed that something was amiss once I connected a load balancer and my instances suddenly lost internet access. And also due to the limitation on how many load balancers you can have.

It seems that once you add a load balancer, all traffic gets funneled through it, doesn't matter if it was addressed to it or not. Which is unlike any other load balancer I have ever seen.

Coming from other clouds, this was a shock.

The only thing comparable is AWS's NLB. Because that load balancer is so transparent, clients appear to be connecting directly, with the original source ip. Which caused issues when I wanted to deploy my own Elasticsearch and use an internal NLB for master discovery (whenever a request got routed to the same machine packets got discarded by the kernel). But you can just switch to another load balancer then.

1 more reply

briffle7y ago

Try google cloud?

random37y ago

I'm trying it... starting to have second thoughts though. They make it look like it's all ready to go and then you start uncovering bits and pieces of things that seem to be half-done, half-legacy, half-dead at the same time. TBH it looks like they got spread too thin and can see how a good amount of services are solid while many are just there.

5 more replies

hbcondo7147y ago

There are other large software companies with competing cloud offerings, including Oracle[1], IBM[2], Alibaba[3] and more.

[1] https://cloud.oracle.com/home

[2] https://www.ibm.com/cloud/

[3] https://us.alibabacloud.com/

setquk7y ago

Why on earth would you want oracle cloud? Pay in collected souls?

2 more replies

lenticular7y ago

None that can match the maturity or performance of AWS. I always hear really terrible things about Oracle and IBM.

edit: Mostly I've heard it just generally is not great, plus you have to deal with typical Oracle badness.

2 more replies

rossng7y ago

I briefly tried to use Oracle Cloud for a small project. Its UI is close to the worst I have ever used: the whole product is inconsistent and confusing. Despite having multiple thousands of dollars free credit available, I quickly switched to GCP!

Dayshine7y ago

Well, I'm not sure that's true.

I only hear bad things about AWS and Google Cloud, and I hear nothing much about Azure.

stevenjohns7y ago

That's because Azure is a poor service with mediocre support.

My anecdotal experience: I spent a couple of weeks (!) setting up our environment (Bitbucket, Django, Ubuntu, Dockerized) on Azure App Service and Azure Pipelines. Their documentation was incomplete, out-of-date and MS support staff struggled to help if you didn't have a Windows machine (their RDP software doesn't support Linux, Skype for Business doesn't support Linux and normal Skype for Linux doesn't support screen sharing).

Little things like trying to SSH into any machine so that you can execute commands on your docker container (for, say, database migrations or to check logs) is almost impossible. If it wasn't for the help of a lot of people on #docker in Freenode I would probably still be working on it.

I had to use Google Hangouts with a Microsoft support person's personal gmail account, while he was connected over VPN (since he was based in Shanghai), so I could show my issue. The support person was extremely pleasant to deal with and understanding, though, and he went above and beyond to help get my issue resolved even though it turned out to not be from his department.

However, after getting set up, I noticed I was getting 12 second (!) responses from an API I had written just to retrieve a logged-in user's first name, last name and email in JSON. This API resolves locally in 20ms - including layers of authentication.

This turned out to be a known issue when running a managed "Azure Database for PostgreSQL" service and was common on MS support forums.

After reaching out to Microsoft support for Azure Database for PostgreSQL, their response was this, copy-and-pasted:

> As you are currently using Basic Tier (2 vCores, 51200MB), the bad performance is expected.

> When comparing with the performance in your VM, the on-prem is supposed to be better than cloud even within the same hardware environment.

> Please give it a test in higher tier and configure it with a compatible settings compared with your VM. In the meanwhile, you can monitor the slow queries via Query Performance Insight to find out what queries were running at a long time when those API were called.

> Pricing tier information can be found at https://docs.microsoft.com/en-us/azure/postgresql/concepts-p... .

...they tried to upsell me on the higher tier database 3 times in that email chain, believing that this level of performance was acceptable for my database tier.

Of course the next tier up from the $60/month that I was on was $160/month, and since we only have maybe two concurrent users at most it didn't make sense to triple our costs just to avoid 12 second database calls.

I moved the entire service to AWS last week. The set up was painless and swift. Using equivalently priced services, the API now resolves in 50ms.

I don't think I'll ever go back. Not even for free.

rachelagy7y ago

I’m sorry you had that experience; your feedback is very helpful to us in pointing to where we can improve. Our storage layer which underlies the Basic tier has variable IOPS, and in this case it sounds like you were on the receiving end of a noisy neighbor. Our General Purpose and Memory Optimized tiers deliver more predictable IO and do not show the same behavior as the Basic tiers. That said we’re always working to improve all of our offerings, including Basic, and hope to provide a better experience for you in the future.

-Rachel, from Azure Database for PostgreSQL

1 more reply

m0zg7y ago· 11 in thread

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport. I feel this should be rephrased for cloud computing at this point. The more people rely on cloud, the more these global fuck-ups are going affect them. Makes me feel pretty good about that server rack in my garage that addresses most of my own (and my business') compute needs.

outworlder7y ago

If you already have a rack available with spare capacity, and your business is not expected to blow up overnight, and can tolerate failures(and the turnaround time for your or your employees to fix stuff, order spare parts, etc), sure, why not?

The capacity is there, might as well use it.

That said, if you didn't have said rack, I'm not so sure it would be worth it to even make a purchase order. Sure things outside of your control may break when you are using a cloud. But guess what, things outside your control will also break, on-prem. Particularly hardware, and network connectivity. There is no way your networking can be better than, say, GCP's own networking, or that you can deploy redundant workloads across availability zones (or even regions!) yourself.

By the time a purchase order for a new server can arrive, we can have a production-ready system running, with redundancy across availability zones, automatic failover, CDNs, backups, the works.

Basically, I don't care if someone knocks out power in my block, if someone cuts a network cable, or even if a machine goes up in flames.

One thing I would say is: even if you are very happy with your current setup, if you have some time to automate a similar setup on the cloud (keyword: automate), then I would suggest doing just that, and offload backups to the cloud too. Even if only as a business continuity thing.

m0zg7y ago

My networking is good enough for my needs: 10GbE copper. Not 100GbE, but 100GbE is hard to actually fully utilize without jumping through major hoops.

My business is mainly deep learning R&D. Current cloud GPU, networking, and storage pricing gives me ulcers given my compute needs and the size of my datasets.

I do run my website in cloud, with redundancy and all that. I also use cloud storage for backup, and for K8S registry. If I was selling e.g. inference services, I'd be running them in cloud too (passing the costs onto the clients). Most of my local workloads could easily be shifted right across to any decent K8S provider.

But the fact is, my lone rack has been humming along with zero unscheduled downtime for 3 years now. I can count several global outages in each of the three major cloud services in this timeframe (most of them during US work hours, BTW), so I'm inarguably better off with the setup I have now than I would be if I moved it all to the cloud. Not to mention it already paid for itself several times over even though I burn through several hundred dollars a month in electricity.

1 more reply

mattmanser7y ago

This shows an utter ignorance of the market, you can get a dedicated server provisioned for you in minutes from most hosting companies.

1 more reply

userbinator7y ago

Another perhaps more relevant way to think about this is that security and availability are at odds with each other, and in this case a system designed for security made a secure choice.

The more secure a system is designed to be, the more likely it is to treat unusual conditions as an attack and possibly perform some destructive action to thwart the assumed attacker. Think of phones configured to delete all data after X incorrect password attempts, HSMs with anti-tamper switches, etc.

danillonunes7y ago

I don’t quite get what security is being accomplished in this case. You can’t access the key store, so you nuke the encrypted database?

boulos7y ago

Disclosure: I work on Google Cloud.

I’ve always enjoyed this quote, but my problem with [the description of] this outage is the third-party dependency.

Packets can’t get from your cloud provider to downstream users of CenturyLink? That’s fair.

Your cloud provider can’t send packets to/from CenturyLink, so they nuke your database? I literally don’t understand.

Is the service described actually a third-party service that’s been white boxed? (I mean this in the most honest way possible. I do not understand the details, and I found the article surprising).

m0zg7y ago

Yes, sounds like bad design to me. But the uber-problem here is with complexity. Google had its own share of screwups (most of which users never saw, but some of which they did), albeit not quite to the extent of destroying data at scale (and when GMail did destroy data at scale, it had tape backups, so it was never actually lost).

The root cause of nearly all of these screwups is that large, complex systems can't be fully understood or observed, and that a good chunk of knowledge about such systems is institutional, rather than explicit. So from time to time people _will_ make assumptions that don't match reality, and reality will punish them for it. Which is what, I strongly suspect, happened in this case.

Waterluvian7y ago

I feel that while valid for you, this sentiment is highly out of touch with the reality of the needs, resources, and capabilities of most people who need these kinds of systems.

It reminds me of a friend who wonders why his parents don't just install Ubuntu because windows is so awful.

bdibs7y ago

Because nothing could happen to your rack (or garage)?

m0zg7y ago

Bad things can happen to it, but at least nobody futzes with it all the time. Most of e.g. Google's major outages (Cloud and Borg) are configuration error or human error. I run my own systems very conservatively. I use conservative technology choices, my kernels are years behind bleeding edge (although I do install security updates, of course), and I don't try to fix or "improve" what's not broken because my paycheck does not depend on that. And in the unlikely event something does get fucked up, I can fix it within less than an hour and I'm the only one affected. It's also extremely unlikely to have any kind of "global" outage.

Don't get me wrong, I use cloud (GCP, if you must know) too, and if my business grew massively, I'd probably use it more. But frankly I'm more satisfied with my own "on prem" solution. Single rack which basically pays for itself every 3 months or so in cloud costs, what's not to like?

2 more replies

mrmondo7y ago

I can’t say for sure, but I’m assuming they meant because it is (in comparison) not a distributed (or complex) system.

excalibur7y ago· 9 in thread

> The deletions were automated, triggered by a script that drops TDE database tables when their corresponding keys can no longer be accessed in the Key Vault, explained Microsoft in a letter reportedly sent to customers.

By what logic is this NOT a terrible idea?

booi7y ago

It sounds like they do this for a data security and compliance reasons but it seems like sloppy engineering to not consider unreachability as possible temporary error.

sowbug7y ago

It's reasonable to delay deleting encrypted data (which can take a long time) and just delete its keys (which is very fast) upon a user request to delete the data. If you believe in encryption, then once you delete the only remaining copies of keys, the encrypted data is as good as deleted.

So that's why it's a great idea to implement data deletion as a two-phase sequence of synchronous key deletion, then asynchronous low-priority block scrubbing (or marking free for reclamation).

But not handling the case where your system is confused whether the keys are deleted (versus just temporarily unavailable) is less of a great idea.

microtherion7y ago

Well, it's certainly one way to guarantee database consistency after a network partition…

cal5k7y ago

Hey now, we have guaranteed eventual consistency!

...at the heat death of the universe

tanilama7y ago

eventual elimination..

mh8h7y ago

At least it needs better error handling. A "not found" and "cannot resolve domain name" should behave differently.

rcostin2k27y ago

Perhaps that was exactly the response they've got in that moment.

LoSboccacc7y ago

The whole "managed resources have a mandatory public dns and ip" idea is insane

Yeah they come with a firewall but still. Imagine competing with everyone else on a single namespace.

At least for the s3 bucket is justified because those are meant to be accessible, but the databases?

derefr7y ago

If the "public dns" are arbitrarily-nested subdomains, and the "ips" are IPv6 addresses, I don't see what the problem is. We've got essentially-unbounded amounts of both of those.

1 more reply

cddotdotslash7y ago· 5 in thread

Sounds like they built a dead-man's switch and then broke the process through which the man and the switch communicate.

ajross7y ago

It was garbage collection. They were deleting data that couldn't be accessed (because no key existed in the system to decrypt it), but the DNS failure fooled the detection into thinking that a failed lookup meant "no key exists". Yikes.

To be clear: it was ultimately only a 5 minute loss (and the fact that the DNS outage was simultaneous probably meant there wasn't much data being stored anyway) because they had a regular snapshot facility. So defense in depth saved them.

Still, yikes. That's a pretty disastrous bug.

tybit7y ago

I’m really curious as to whether this would be because of REST APIs. Conflating ‘I can’t find that endpoint’ with ‘the endpoint says that resource has been deleted’ via a 404.

dtech7y ago

Would be a pretty ineffective dead mans switch if a backup from 5 minutes ago is available

eridius7y ago

It's presumably intended as way to reclaim space now taken up by effectively-random garbage rather than a security measure.

spydum7y ago

The keys didn’t get deleted. Only the encrypted data. If keys are gone - they are GONE. AFAIK you cannot restore the keys

snockerton7y ago· 1 in thread

It appears that the SLA guaranteed uptime for Azure SQL Database is 99.9% or 99.99%, depending on tier. That equates to the following allowable downtime per month (which I think is what they base SLA fulfillment on):

99.9: 43m 49.7s

99.99: 4m 23.0s

Sounds like they need to cough up some money for their four 9s customers...

kthejoker27y ago

As the article indicates, MSFT is offering 3 months of free service to affected customers.

j / k navigate · click thread line to collapse

84 comments

49 comments · 5 top-level

lenticular7y ago· 18 in thread

I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.

outworlder7y ago

> I just don't hear any good things about Azure. That is unfortunate, because I'd love AWS to have some competition.

I see improvements and it is difficult to beat them in the enterprise, but speaking as an engineer, man Azure is infuriating.

dharmab7y ago

> attach an internal load balancer, lose internet connectivity!?

TheIronYuppie7y ago

Disclosure: I work at Azure.

salex897y ago

I work with Azure daily, my company uses tons of IaaS on all cloud providers, very dynamically. I'm mostly concentrated on Azure.

First I want to say, thank you for your service, keep it up, there is a lot to be done, but I see progress.

Keep pushin'! :)

lenticular7y ago

1 more reply

itsdrewmiller7y ago

We use Azure and love it. We've had more problems over the past few years from downstream services being broken by AWS (S3 outage, etc.) than our primary apps being broken by Azure.

gnulinux7y ago

> S3 outage

Wait what? S3 is 11 nines. That's like 1 file every 8 PB/year-years.

2 more replies

shanemhansen7y ago

Source: https://docs.microsoft.com/en-us/azure/load-balancer/load-ba...

outworlder7y ago

It seems that once you add a load balancer, all traffic gets funneled through it, doesn't matter if it was addressed to it or not. Which is unlike any other load balancer I have ever seen.

Coming from other clouds, this was a shock.

1 more reply

briffle7y ago

Try google cloud?

random37y ago

5 more replies

hbcondo7147y ago

There are other large software companies with competing cloud offerings, including Oracle[1], IBM[2], Alibaba[3] and more.

[1] https://cloud.oracle.com/home

[2] https://www.ibm.com/cloud/

[3] https://us.alibabacloud.com/

setquk7y ago

Why on earth would you want oracle cloud? Pay in collected souls?

2 more replies

lenticular7y ago

None that can match the maturity or performance of AWS. I always hear really terrible things about Oracle and IBM.

edit: Mostly I've heard it just generally is not great, plus you have to deal with typical Oracle badness.

2 more replies

rossng7y ago

Dayshine7y ago

Well, I'm not sure that's true.

I only hear bad things about AWS and Google Cloud, and I hear nothing much about Azure.

stevenjohns7y ago

That's because Azure is a poor service with mediocre support.

This turned out to be a known issue when running a managed "Azure Database for PostgreSQL" service and was common on MS support forums.

After reaching out to Microsoft support for Azure Database for PostgreSQL, their response was this, copy-and-pasted:

> As you are currently using Basic Tier (2 vCores, 51200MB), the bad performance is expected.

> When comparing with the performance in your VM, the on-prem is supposed to be better than cloud even within the same hardware environment.

> Pricing tier information can be found at https://docs.microsoft.com/en-us/azure/postgresql/concepts-p... .

...they tried to upsell me on the higher tier database 3 times in that email chain, believing that this level of performance was acceptable for my database tier.

I moved the entire service to AWS last week. The set up was painless and swift. Using equivalently priced services, the API now resolves in 50ms.

I don't think I'll ever go back. Not even for free.

rachelagy7y ago

-Rachel, from Azure Database for PostgreSQL

1 more reply

m0zg7y ago· 11 in thread

outworlder7y ago

The capacity is there, might as well use it.

By the time a purchase order for a new server can arrive, we can have a production-ready system running, with redundancy across availability zones, automatic failover, CDNs, backups, the works.

Basically, I don't care if someone knocks out power in my block, if someone cuts a network cable, or even if a machine goes up in flames.

m0zg7y ago

My networking is good enough for my needs: 10GbE copper. Not 100GbE, but 100GbE is hard to actually fully utilize without jumping through major hoops.

My business is mainly deep learning R&D. Current cloud GPU, networking, and storage pricing gives me ulcers given my compute needs and the size of my datasets.

1 more reply

mattmanser7y ago

This shows an utter ignorance of the market, you can get a dedicated server provisioned for you in minutes from most hosting companies.

1 more reply

userbinator7y ago

Another perhaps more relevant way to think about this is that security and availability are at odds with each other, and in this case a system designed for security made a secure choice.

danillonunes7y ago

I don’t quite get what security is being accomplished in this case. You can’t access the key store, so you nuke the encrypted database?

boulos7y ago

Disclosure: I work on Google Cloud.

I’ve always enjoyed this quote, but my problem with [the description of] this outage is the third-party dependency.

Packets can’t get from your cloud provider to downstream users of CenturyLink? That’s fair.

Your cloud provider can’t send packets to/from CenturyLink, so they nuke your database? I literally don’t understand.

Is the service described actually a third-party service that’s been white boxed? (I mean this in the most honest way possible. I do not understand the details, and I found the article surprising).

m0zg7y ago

Waterluvian7y ago

I feel that while valid for you, this sentiment is highly out of touch with the reality of the needs, resources, and capabilities of most people who need these kinds of systems.

It reminds me of a friend who wonders why his parents don't just install Ubuntu because windows is so awful.

bdibs7y ago

Because nothing could happen to your rack (or garage)?

m0zg7y ago

2 more replies

mrmondo7y ago

I can’t say for sure, but I’m assuming they meant because it is (in comparison) not a distributed (or complex) system.

excalibur7y ago· 9 in thread

By what logic is this NOT a terrible idea?

booi7y ago

It sounds like they do this for a data security and compliance reasons but it seems like sloppy engineering to not consider unreachability as possible temporary error.

sowbug7y ago

So that's why it's a great idea to implement data deletion as a two-phase sequence of synchronous key deletion, then asynchronous low-priority block scrubbing (or marking free for reclamation).

But not handling the case where your system is confused whether the keys are deleted (versus just temporarily unavailable) is less of a great idea.

microtherion7y ago

Well, it's certainly one way to guarantee database consistency after a network partition…

cal5k7y ago

Hey now, we have guaranteed eventual consistency!

...at the heat death of the universe

tanilama7y ago

eventual elimination..

mh8h7y ago

At least it needs better error handling. A "not found" and "cannot resolve domain name" should behave differently.

rcostin2k27y ago

Perhaps that was exactly the response they've got in that moment.

LoSboccacc7y ago

The whole "managed resources have a mandatory public dns and ip" idea is insane

Yeah they come with a firewall but still. Imagine competing with everyone else on a single namespace.

At least for the s3 bucket is justified because those are meant to be accessible, but the databases?

derefr7y ago

If the "public dns" are arbitrarily-nested subdomains, and the "ips" are IPv6 addresses, I don't see what the problem is. We've got essentially-unbounded amounts of both of those.

1 more reply

cddotdotslash7y ago· 5 in thread

Sounds like they built a dead-man's switch and then broke the process through which the man and the switch communicate.

ajross7y ago

Still, yikes. That's a pretty disastrous bug.

tybit7y ago

I’m really curious as to whether this would be because of REST APIs. Conflating ‘I can’t find that endpoint’ with ‘the endpoint says that resource has been deleted’ via a 404.

dtech7y ago

Would be a pretty ineffective dead mans switch if a backup from 5 minutes ago is available

eridius7y ago

It's presumably intended as way to reclaim space now taken up by effectively-random garbage rather than a security measure.

spydum7y ago

The keys didn’t get deleted. Only the encrypted data. If keys are gone - they are GONE. AFAIK you cannot restore the keys

snockerton7y ago· 1 in thread

99.9: 43m 49.7s

99.99: 4m 23.0s

Sounds like they need to cough up some money for their four 9s customers...

kthejoker27y ago

As the article indicates, MSFT is offering 3 months of free service to affected customers.

j / k navigate · click thread line to collapse