Google Cloud is fair competition – provided they have the service you need. AWS and Azure both beat them in number of services. If Google has it, then it should behave as expected, and some are downright impressive (GKE and VM auto migration on GCE).
Azure is... infuriating. Inconsistent, unreliable APIs, surprising behavior everywhere (attach an internal load balancer, lose internet connectivity!?), lots of restrictions on which features can be used with which SKUs.
I see improvements and it is difficult to beat them in the enterprise, but speaking as an engineer, man Azure is infuriating.
Ah, yes, the "TCP and UDP egress work unless you define an _ingress_ load balancing rule for either protocol, at which point the other protocol breaks until you create a dummy rule for the other protocol that will trigger your security team to send you tickets every few months."
We're doing our best, but we're not going to suggest there's not more to do. Every major cloud provider has had issues at one point or another (I formerly worked at Amazon and Google), and I'll just say - we hear you, and we are fiercely committed to earning your trust.
First I want to say, thank you for your service, keep it up, there is a lot to be done, but I see progress.
Secondly, please, get your teams together and start communicating. We encounter a lot of issues with things that should just work or should be much simpler. Sometimes we contact support and just get handed from team to team without actually finishing something.
Third, please, oh, please, get your SDKs (especially the Python one) fixed. It looks like every new build breaks something, sometimes even the same version on multiple installations since there is a lot of variable versioning done under the hood...
Sometimes I get a feeling a lot if things are "leaking" towards the customer. Wanna change an instance type? You get a "instance not available in cluster" error, or something similarly undocumented. Wanna copy a snapshot between regions? Good luck with that, and hope you've got some retry logic and a hell of a timeout.
Keep pushin'! :)
Wait what? S3 is 11 nines. That's like 1 file every 8 PB/year-years.
Source: https://docs.microsoft.com/en-us/azure/load-balancer/load-ba...
It seems that once you add a load balancer, all traffic gets funneled through it, doesn't matter if it was addressed to it or not. Which is unlike any other load balancer I have ever seen.
Coming from other clouds, this was a shock.
The only thing comparable is AWS's NLB. Because that load balancer is so transparent, clients appear to be connecting directly, with the original source ip. Which caused issues when I wanted to deploy my own Elasticsearch and use an internal NLB for master discovery (whenever a request got routed to the same machine packets got discarded by the kernel). But you can just switch to another load balancer then.
[1] https://cloud.oracle.com/home
edit: Mostly I've heard it just generally is not great, plus you have to deal with typical Oracle badness.
I only hear bad things about AWS and Google Cloud, and I hear nothing much about Azure.
My anecdotal experience: I spent a couple of weeks (!) setting up our environment (Bitbucket, Django, Ubuntu, Dockerized) on Azure App Service and Azure Pipelines. Their documentation was incomplete, out-of-date and MS support staff struggled to help if you didn't have a Windows machine (their RDP software doesn't support Linux, Skype for Business doesn't support Linux and normal Skype for Linux doesn't support screen sharing).
Little things like trying to SSH into any machine so that you can execute commands on your docker container (for, say, database migrations or to check logs) is almost impossible. If it wasn't for the help of a lot of people on #docker in Freenode I would probably still be working on it.
I had to use Google Hangouts with a Microsoft support person's personal gmail account, while he was connected over VPN (since he was based in Shanghai), so I could show my issue. The support person was extremely pleasant to deal with and understanding, though, and he went above and beyond to help get my issue resolved even though it turned out to not be from his department.
However, after getting set up, I noticed I was getting 12 second (!) responses from an API I had written just to retrieve a logged-in user's first name, last name and email in JSON. This API resolves locally in 20ms - including layers of authentication.
This turned out to be a known issue when running a managed "Azure Database for PostgreSQL" service and was common on MS support forums.
After reaching out to Microsoft support for Azure Database for PostgreSQL, their response was this, copy-and-pasted:
> As you are currently using Basic Tier (2 vCores, 51200MB), the bad performance is expected.
> When comparing with the performance in your VM, the on-prem is supposed to be better than cloud even within the same hardware environment.
> Please give it a test in higher tier and configure it with a compatible settings compared with your VM. In the meanwhile, you can monitor the slow queries via Query Performance Insight to find out what queries were running at a long time when those API were called.
> Pricing tier information can be found at https://docs.microsoft.com/en-us/azure/postgresql/concepts-p... .
...they tried to upsell me on the higher tier database 3 times in that email chain, believing that this level of performance was acceptable for my database tier.
Of course the next tier up from the $60/month that I was on was $160/month, and since we only have maybe two concurrent users at most it didn't make sense to triple our costs just to avoid 12 second database calls.
I moved the entire service to AWS last week. The set up was painless and swift. Using equivalently priced services, the API now resolves in 50ms.
I don't think I'll ever go back. Not even for free.
-Rachel, from Azure Database for PostgreSQL
The capacity is there, might as well use it.
That said, if you didn't have said rack, I'm not so sure it would be worth it to even make a purchase order. Sure things outside of your control may break when you are using a cloud. But guess what, things outside your control will also break, on-prem. Particularly hardware, and network connectivity. There is no way your networking can be better than, say, GCP's own networking, or that you can deploy redundant workloads across availability zones (or even regions!) yourself.
By the time a purchase order for a new server can arrive, we can have a production-ready system running, with redundancy across availability zones, automatic failover, CDNs, backups, the works.
Basically, I don't care if someone knocks out power in my block, if someone cuts a network cable, or even if a machine goes up in flames.
One thing I would say is: even if you are very happy with your current setup, if you have some time to automate a similar setup on the cloud (keyword: automate), then I would suggest doing just that, and offload backups to the cloud too. Even if only as a business continuity thing.
My business is mainly deep learning R&D. Current cloud GPU, networking, and storage pricing gives me ulcers given my compute needs and the size of my datasets.
I do run my website in cloud, with redundancy and all that. I also use cloud storage for backup, and for K8S registry. If I was selling e.g. inference services, I'd be running them in cloud too (passing the costs onto the clients). Most of my local workloads could easily be shifted right across to any decent K8S provider.
But the fact is, my lone rack has been humming along with zero unscheduled downtime for 3 years now. I can count several global outages in each of the three major cloud services in this timeframe (most of them during US work hours, BTW), so I'm inarguably better off with the setup I have now than I would be if I moved it all to the cloud. Not to mention it already paid for itself several times over even though I burn through several hundred dollars a month in electricity.
The more secure a system is designed to be, the more likely it is to treat unusual conditions as an attack and possibly perform some destructive action to thwart the assumed attacker. Think of phones configured to delete all data after X incorrect password attempts, HSMs with anti-tamper switches, etc.
I’ve always enjoyed this quote, but my problem with [the description of] this outage is the third-party dependency.
Packets can’t get from your cloud provider to downstream users of CenturyLink? That’s fair.
Your cloud provider can’t send packets to/from CenturyLink, so they nuke your database? I literally don’t understand.
Is the service described actually a third-party service that’s been white boxed? (I mean this in the most honest way possible. I do not understand the details, and I found the article surprising).
The root cause of nearly all of these screwups is that large, complex systems can't be fully understood or observed, and that a good chunk of knowledge about such systems is institutional, rather than explicit. So from time to time people _will_ make assumptions that don't match reality, and reality will punish them for it. Which is what, I strongly suspect, happened in this case.
It reminds me of a friend who wonders why his parents don't just install Ubuntu because windows is so awful.
Don't get me wrong, I use cloud (GCP, if you must know) too, and if my business grew massively, I'd probably use it more. But frankly I'm more satisfied with my own "on prem" solution. Single rack which basically pays for itself every 3 months or so in cloud costs, what's not to like?
By what logic is this NOT a terrible idea?
So that's why it's a great idea to implement data deletion as a two-phase sequence of synchronous key deletion, then asynchronous low-priority block scrubbing (or marking free for reclamation).
But not handling the case where your system is confused whether the keys are deleted (versus just temporarily unavailable) is less of a great idea.
Yeah they come with a firewall but still. Imagine competing with everyone else on a single namespace.
At least for the s3 bucket is justified because those are meant to be accessible, but the databases?
To be clear: it was ultimately only a 5 minute loss (and the fact that the DNS outage was simultaneous probably meant there wasn't much data being stored anyway) because they had a regular snapshot facility. So defense in depth saved them.
Still, yikes. That's a pretty disastrous bug.
99.9: 43m 49.7s
99.99: 4m 23.0s
Sounds like they need to cough up some money for their four 9s customers...