Horrors of Using Azure Kubernetes Service in Production (opens in new tab)

(movingfulcrum.com)

371 pointspdeva17y ago136 comments

136 comments

93 comments · 20 top-level

QiKe7y ago· 27 in thread

(Eng lead for AKS here) While lots of people have had great success with AKS, we're always concerned when someone has a bad time. In this particular case the AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet. As part of this investigation we increased the system reservation for both Docker and kubelet to ensure that in the future if a user over schedules their nodes the kernel will only terminate their applications and not the critical system daemons.

wpietri7y ago

Does it seem weird to anybody else that a vendor would semi-blame the customer in public like this? I can't imagine seeing a statement like this from a Google or Amazon engineer.

It also doesn't seems to ignore a number of the points, especially how support was handled. I think it's bad form to only respond to the one thing that can be rebutted, ignoring the rest. And personally, I would have apologized for the bad experience here.

ummonk7y ago

While it might be phrased in a way that implies the customer is partly to blame, the actual details would indicate the main problem was with Azure Kubernetes Service. Critical system daemons going down because the application uses too much memory is not a reasonable failure mode (and the AKS team rightfully fixed it).

2 more replies

ickler97y ago

Well, this is on the front page, the top comment is misinformation, the posters left out details that made them look bad, and they seem to be going on a smear campaign out of spite on every platform they have. at what point is any of this in good faith?

1 more reply

shaklee37y ago

I would prefer a vendor responds publicly rather than request a private message. It's possible that one side was angry, and writing a blog post that makes it on HN will surely get a ton of negative attention. If that's the case, they should have the right to clear anything they'd like. I didn't read it as blame, but explanation.

1 more reply

brudgers7y ago

Steve Jobs told customers they were holding their phones wrong, so, to me, not really.

1 more reply

wizardofmysore7y ago

No. If the customer is at fault there is no problem in blaming them especially if they run a smear campaign.

jschwartzi7y ago

Well, we really don't know what was said as the blog didn't actually provide any of the original communications. It's a he said she said thing at this point. Frankly the author comes across as having a huge axe to grind. That may be with good reason but it's hard for me to judge the quality of the Azure support when we never see any of their communications, just paraphrases.

po7y ago

AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet.

I'm a bit confused why the cluster nodes don't come configured like this out of the box... kubernetes users aren't supposed to have to worry about OOM of the underlying system killing ops-side processes are they?

dilyevsky7y ago

they do if cluster admin didn't setup proper system-reserved and kube-reserved (both are kubelet flags) and configured enforcement.

1 more reply

wokwokwok7y ago

'my stuff didnt work on AKS' is one thing; 'my stuff brought AKS and the dashboard down' is an fundamental failure that is in no way mitigated by this comment, and it feels very dishonest to try to redirect the blame for it.

My experience with azure has been reasonably positive, but even I've seen some weird stuff where things randomly don't work (AAD) or the dashboard just refuses to show anything for a while.

That this is a widespread endemic problem in Azure seems entirely plausible...

pdeva1OP7y ago

it is unclear what this response hopes to achieve. it is mentioned in the post that our containers do crash. that should under no condition cause the underlying node to go down. this has even been pointed out by others responding to this thread. it is interesting though that none of the other issues in the blog post are bought up.

HelloNurse7y ago

Setting aside the workarounds and safety margins discussed in other comments, I would expect a reasonable operating system to allow explicitly prioritizing processes so that the important ones can only run out of memory after all user processes have been preemptively terminated to reclaim their memory. I would also expect a good container platform to restart system processes reliably, even if they crash.

1 more reply

baaym7y ago

My guess is that the system reservation change was very welcome for me as well.

Note that a service as AKS also draws in new customers that may not yet have years of Kubernetes experience. I'm one of those for example, and I created an AKS cluster so we could deploy short-lived environments for branches of our product. We're using GitLab and the 'Review Apps' integration with Kubernetes.

The instability experienced by the author of this article is something I experienced as well, and I have spent a lot of time draining, rebooting, and scaling nodes to try and find out what is happening. I would not have been able to guess the absence of resource limits could possibly kill a node.

Fortunately these instabilities disappeared a couple of weeks ago after a redeployment of the AKS instance, and it has been stable ever since. I guess the system reservation change was included there? From my perspective that was also the moment AKS truly started feeling like a GA product.

crunchlibrarian7y ago

Sounds like you're still beta testing

exikyut7y ago

Ah, and Hyper-V supports dynamic memory, so the system reservation backing can effectively be thin provisioned. That's nice. (Hm, dynamic memory probably got switched on from the start.)

Thanks for posting this here. It would be cool for there to be a way to hold application users to account without needing to chase viral Internet posts and do your best to pin some accurate reporting on slightly after the fact. A tricky general problem.

If there's one thing I miss with Azure (and AWS), it's the perpetually-free 600MB RAM KVM VM GCloud gives everyone to play with. It only has 1GB outbound, but inbound bandwidth is free, and I can do pretty much whatever I want with it. But anyways...

AaronFriel7y ago

I don't think Azure ever uses dynamic memory for VMs - if I SSH into a VM I see the full allocation of whatever size it was supposed to be out of the bat.

I think this has to do with cgroups and ensuring the OOM killer doesn't target what is essentially the `init` process of a Kubernetes cluster - the docker daemon or kubelet.

specialp7y ago

This is a pretty bad mistake from the customer if this is true. If not done already it would probably be good to expose Prometheus metrics on CPU/Memory usage per node.

markbnj7y ago

Yes, this is true on it's face: it's bad to deploy containers to k8s without appropriate resource limits. However, this should in no way affect the operation of the node, so the implied transfer of responsibility for this incident from AKS to the customer is invalid imo.

1 more reply

dilyevsky7y ago

lol so aks forgot to provision enough resources and possibly setup enforcement and you are blaming the user? the user should be able to run as close to edge of "allocatable" as possible or even go over it and be oom kill'ed without bringing down the entire node. this functionality is even built into kubelet already. there's no way you can twist this to make it into user error.

1 more reply

QiKe7y ago

We are indeed working on more convenient container monitoring and logging on Azure portal.

bengale7y ago

Last time I tried to use AKS I just got cryptic errors about the size of VMs available in Europe so I gave up and used GCP.

trhway7y ago

>the AKS engineering team spent over a day helping identify that the user had over scheduled their nodes, by running applications without memory limit, resulting in the kernel oom (out of memory) killer terminating the Docker daemon and kubelet.

sounds like a bunch of people have just learned for the first time about OOM killer. I mean the production systems with overcommits and the running loose OOM killer and I bet without swap ... And they blame the customer. Sounds like a PaaS MVP quickly slapped together by an alpha state startup. You may want to look into man pages, in particular oom scoring and the code -17.

praseodym7y ago

Actually Kubelet should already be adjusting OOM scores to make sure that user pods (containers) get killed over Kubelet or the Docker daemon. Why didn't that work here?

1 more reply

technofiend7y ago

Mental note - create a new cgroup for docker and kubelet.

h4b4n3r07y ago

At Google you can’t even run anything on Borg until you specify how much memory it will use. You also have to specify how many cores you need and how much local (ephemeral) disk. And memory limit is hard: your task is killed without any warning if it attempts to exceed the limit. I was actually puzzled to discover that these limits are not required on k8s. Not only this leads to screwups like this one, it also makes it impossible to optimally schedule workloads, because you simply don’t know how much of each resource each job is going to use.

dilyevsky7y ago

that's not actually how this works on Borg these days (and by "these days" i mean past 5+ years) and there's nothing about k8s not requiring limits by default that lead to this.

1 more reply

JediPig7y ago

you just killed any hope for azure running k8s. seriously. killed it with that statement.

summarity7y ago· 20 in thread

My $DAYJOB is leading a team which develops applications and gateways (for the 1k+ employee B2B market) that integrate deeply with Azure, Azure AD and anything that comes with it. We do have Microsoft employees (who work on Azure) on our payroll, too.

I can tell you, as I'm sure anyone in my team can, that Azure is one big alpha-stage amalgation of half-baked services. I would never ever recommend Azure to literally any organization no matter the size. Seeing our customers struggle with it, us struggle with it, and even MS folks struggle with even the most basic tasks gets tiring really fast. We have so many workarounds in our software for inconsistency, unavailability, questionable security and general quirks in Azure that it's not even funny anymore.

There are some days where random parts of Azure completely fail, like customers not being able to view resources, role assignments or even their directory config.

An automatic integration test of one of our apps, which makes heavy use of Azure Resource Management APIs, just fails dozens of times a week not because we have a bug, but because state within Azure didn't propagate (RBAC changes, resource properties) within a timeout of more than 15 minutes!

Two weeks back, the same test managed to reproducibly produce a state within Azure that completely disabled the Azure Portal resource view. All "blades" in Azure just displayed "unable to access data". Only an ultra-specific sequence of UI interactions and API calls could restore Azure (while uncovering a lot of other issues).

That is the norm, not the exception. In 1.5 years worth of development, there has never been a single week without an Azure issue robbing us of hours of work just debugging their systems and writing workarounds.

/rant

On topic though, we've had good experiences with these k8s runtimes:

- GKE

- Rancher + DO

- IBM Cloud k8s (yeah, I know!)

blablabla1237y ago

Haha... I have experience with Azure as well, seen both good and bad things. As I read the title, I was already quite sure to read such a post. When Kubernetes became popular, I tried it with Azure and both scripts and documentation were broken. When I found out, I stopped trying.

Regarding Azure in general: Azure Websites is c*. Having used Heroku and App Engine for some time before, this feels like a joke. Deployments sometimes work, sometimes they don't. Have to deal with node gyps? Don't, just don't. If you ever are forced to use Azure Websites (free startup package? ;)), learn Ansible as soon as possible and convince your team to switch to VMs.

The VMs are okay, you can't do much wrong with. I don't really know where the complexity of Azure Websites comes from, maybe from the fact that it runs on Windows, but this cannot be the full explanation. I have seen people work with node on Windows (even without Ubuntu on Windows) and they were fine. For anyone interested, this is the Azure Websites backend: https://github.com/projectkudu/kudu

Disclaimer: my long adventure with it was years ago, maybe the service has changed 100% but I doubt it

RoadieRoller7y ago

My Team uses CosmosDB heavily (so far, and not too far anymore) and it is another half-baked service. The support people are Indians (Microsoft outsourced Azure support to an Indian company - MindTree) and are not very knowledgeable on the CosmosDB service. They always point us to the URI of a web article (of course from Microsoft) and says everything will work if you follow the article, and close the ticket. Over the time, we understood that we know more than them on CosmosDB, and used to ignore their replies, but nevertheless raise tickets to make sure that they are aware.

megaman227y ago

I generally like Microsoft, but their official support channels are pretty terrible. Just go looking for something in the MSDN forums - a large percentage of the posts from "Microsoft" people are telling the customers/developers that they posted in the wrong forum (often incorrectly), or suggesting some inane thing that the original post already specified and then closing the thread. GitHub issues are slightly better, although if you get off the beaten path of the new hotness, responses get very thin.

You're much better off trying to find some backchannels via MVPs on Twitter or through blogs, or figure out the developers or evangelists that give talks on this kind of stuff and contact them directly.

gamblor9567y ago

My company uses Azure quite heavily, but not Kubernetes.

No crashes, ever. Way more reliable than AWS ever was. (GCP is our failover.)

So it seems that your experience is, from my POV, the exception. Maybe there's something wrong with the way you guys have Azure set up?

politician7y ago

Which provisioning model are you using -- ASM or ARM? The last time I used Azure we used the (deprecated) ASM model which was pretty stable instead of the newer and often broken ARM model. We ended up staying on the deprecated model until we moved to AWS (for unrelated reasons).

1 more reply

jgalentine0077y ago

This has been my experience as well, I was surprised to read about some of the issues that were posted. Been using azure app service for about 50 .net/core services for over a year with 100% uptime. Guess I'm just lucky!

partiallypro7y ago

In most cases the cause is on the DevOps team and not on Azure, GCS or AWS. I can attest to that in screwing up some configs early on. That being said this is a very new offering from Microsoft and is possible it has some kinks to work out.

Spoom7y ago

The biggest red flag I saw when I was working with Azure is that I noticed that a lot of their CLI commands ("az do-a-thing") actually have the pattern of "retry until it works"... and the first few tries often actually fail!

brianafrank7y ago

Thanks for the shout out to IKS (ibm cloud). you have no idea how obsessed we are with this service so it's always great to see someone noticed. :)

merqurio7y ago

We were very surprised with the quality of IBM's Kubernetes service too ! We had a cluster there for almost a year and everything runned very smoothly.

We missed having more instance types to choose from, but it was a nice experience.

danberg7y ago

We are constantly working to add more machine-types based on customer requests. Recently we have added several new flavors including bare metal systems. You may want to check them out. https://console.bluemix.net/docs/containers/cs_clusters.html...

mijoharas7y ago

So, just to understand, are you writing the software because the large business clients are already tied to azure and gonna keep using it? Trying to understand why there is a market, and why people would want to use it.

ripberge7y ago

Can't speak to the Azure Resource Management API's but I've had a very different experience with Web Apps, Azure SQL, Storage, Traffic Manager. Other than a few short-lived bouts of missed writes into Service Bus and Cosmos (their worst product, IMO) the platform has achieved 100% uptime for us for several years. Pretty amazing. From what you're describing I'm guessing you're REALLY not their average user and they de-prioritize quality on your use cases. Probably better off on another platform if you can help it.

blablabla1237y ago

Some stuff works really good, I had very good experience with the Table and Blob Storage, Traffic Manager as well. But seriously, these are pretty basic services. ;-) What did you host on Web Apps if I can ask?

1 more reply

lovich7y ago

I've had wildly different results. My shop wasn't large by any means but azure worked pretty much perfectly for us. The only issue we ever had was when changing the size on an azure swl db went from a 20 minute operation to taking up to an hour sometimes. Other than that it let us scale as we wanted and have the engineers duplicate environments with their local changes arbitrarily. Gave us a 400 dollar/month bill for something that would have taken a full devops to handle with 10 engineers

Bombthecat7y ago

With DO you mean digital Ocean?

lloeki7y ago

I guess so since RancherOS is available as a fully supported option (as well as Fedora Atomic and CoreOS).

FWIW we're running a simple, custom cluster made of Debian droplets set up using kubeadm.

alexpi7y ago

I think DO stands for DC/OS in this context

1 more reply

eip7y ago

> That is the norm, not the exception.

Have you never used any other Microsoft software?

I mean it is the same software company that made Windows ME, Vista, 7, and 10, along with countless other chocolate covered turds.

JediPig7y ago

i rarely comment. I was going to give a thumbs up on testing azure for k8s. I am removing azure from the list perm. After reading these horror stories, azure just killed itself as a cloud provider. GC & AWS. AWS support engineers bend over backwards for us.

mgalgs7y ago· 5 in thread

FWIW, Amazon's hosted Kubernetes offering (EKS) isn't stable either (DNS failures, HPA is known to be broken, etc.).

shamsalmon7y ago

I know HPA is a legit issue but DNS failures seem to be fairly normal in kubernetes. Scaling up kube-dns has helped us resolve that particular issue as well as moving away from Alpine and into minimal Debian images. Alpine has its own DNS issues that caused us much pain.

atombender7y ago

We've had issues with KubeDNS, too. Lots of retries and timeouts on the client side, and lots of conntrack entries.

Libc has pretty slow retries (5s, I think) by default, and until 1.11 hits you can't easily set up resolver configs, though you can inject an envvar separately into each. And musl-based distros like Alpine don't even support some of libc's options, iirc.

We ended up scaling up KubeDNS to 2 replicas and moving them to a dedicated nodepool just to make sure they weren't competing with other nodes. That fixed our issues for now.

praseodym7y ago

Kube-dns (or CoreDNS in newer clusters) is pretty stable in my experience. It's still a very good idea to run more than one replica so that you can tolerate a single node failure, but if DNS failures are "fairly normal" that definitely warrants some additional investigation.

2 more replies

ivelichkovich7y ago

EKS HPA workaround https://medium.com/eks-hpa-workaround/k8s-hpa-controller-6ac...

ivelichkovich7y ago

EKS HPA Workaround https://medium.com/eks-hpa-workaround/k8s-hpa-controller-6ac...

ageitgey7y ago· 4 in thread

Here's a fun fact about Azure Kubernetes:

1. Deploy your Linux service on k8s with redundant nodes

2. Create a k8s VolumeClaim and mount it on your nodes to give your application some long-lived or shared disk storage (i.e. for processing user-uploaded files).

3. Wait until the subtle bugs start to appear in your app.

Because persistent k8s volumes on Azure are provided by Azure disk storage service behind the scenes, lots of weird Windows-isms apply. And this goes beyond stuff like case insensitivity for file names.

For example, if a user tries to upload a file called "COM1" or "PRN1", it will blow up with a disk write error.

Yes, that's right, Azure is the only cloud vendor that is 100% compatible with enforcing DOS 1.0 reserved filenames - on your Linux server in 2018!

manigandham7y ago

You're not using Azure Disks because they are attached to your VM as a block device and have no knowledge of the file system. PVs in AKS using Azure Disks can only be attached to a single node, as clearly stated in the documentation: https://docs.microsoft.com/en-us/azure/aks/azure-disks-dynam...

>> An Azure disk can only be mounted with Access mode type ReadWriteOnce, which makes it available to only a single AKS node. If needing to share a persistent volume across multiple nodes, consider using Azure Files.

So you must be using a file share across multiple nodes using Azure Files, which is a SMB file share service that may have compatibility issues with the Samba protocol as described in the (arguably hard to find) docs: https://docs.microsoft.com/en-us/rest/api/storageservices/na...

>> Directory and file names are case-preserving and case-insensitive.

>> The following file names are not allowed: LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, PRN, AUX, NUL, CON, CLOCK$, dot character (.), and two dot characters (..).

colemickens7y ago

The only way this would make any sense is if you were using Azure Files, rather than Azure Disks. There's virtually never a time when it makes sense to use Azure Files over Azure Disks, and even when it does, a change in the application would be better advised than using Azure Files.

>Yes, that's right, Azure is the only cloud vendor that is 100% compatible with enforcing DOS 1.0 reserved filenames - on your Linux server in 2018!

This is hyperbolic bordering on flatly false. This is more reasonable and accurate:

"Azure is the only cloud vendor that serves their Samba product from Windows boxes, and thus leak Win/NTFS-isms into their Samba shares [that shouldn't be used anyway]."

How would an ext4 filesystem, mounted under Linux, attached as a block device to a VM, be subjected to Windows-isms? What you're implying doesn't even make sense.

parasubvert7y ago

I really would have to disagree with the statement one should never use Azure Files over Azure disks.

1. Most Azure VM types have very stringent limits on attached disks; a K8s worker can easily blow past this limit.

2. You have tremendous complexity to deal with: pick Azure managed disks vs unmanaged disks on storage accounts (you can’t mix them on the same cluster). You have to understand the trade of of standard vs premium storage and how they bill (premium rounds up and charges by capacity, not consumption). And you need the right VM types for premium.

3. Managed disks each create a resource object in your resource group. A resource group last I checked had hard limits on the number of resources (like 4000?). Each VM is minimum 3 to 4 resources (with a NIC, image, and disk)... at scale this gets difficult.

4. Azure disks require significant time to create. , mount and remount. A StatefulSet pod failure will sometimes take 3-5 minutes for it’s PV to move to a different worker. And worse when your Azure region has allocation problems. Azure files are near instantaneous unmoubt/remount.

5. Azure disks are block storage and thus only ReadWriteOnce. Azure files are RWM.

So, sure, if you’re running a cluster database with dedicated per node PVs and limited expected redeployments... use Azure disks. If you need a PV for any other reason... especially for application tiers that churn frequently.. use Azure files.

1 more reply

AaronFriel7y ago

Is this with the managed-disk volume (which IIRC is formatted ext4) or with `AzureFiles`, which is essentially SMB/CIFS?

curiousDog7y ago· 4 in thread

This is sad. From what I hear, one of the founders of k8s works on AKS.

Only a matter of time before GCP becomes the #1/2 cloud provider.

h4b4n3r07y ago

I’ve seen at least 4 “founders of Kubernetes” by now. How many are there in total?

rhencke7y ago

At least three to five.

https://en.wikipedia.org/wiki/Kubernetes#History

cheeze7y ago

Agree. The quality of their software is so much better than the other major players.

swozey7y ago

Yeah, Brendan Burns is at ms

rcconf7y ago· 4 in thread

Hmm, this isn't great. Currently using Azure Kubernetes Service and we haven't had many issues so far, but we just made the shift.

Hope I don't have to move over to Google cloud.

RantyDave7y ago

It's heaps and heaps better than Azure.

zeroone1017y ago

You'll be fine. We run a number of AKS clusters and they have all been rock solid. I think the problem is people hear "managed cluster" and so they don't think they need to understand how k8s works. Follow best practices (resource limits, etc.) and you'll be just fine. We've even tested out the upgrade flow on a live production cluster and it was butter smooth.

outworlder7y ago

Why? GKE works perfectly over there.

specialp7y ago

Many people (me included) don't trust Google to stay with any business besides advertising. There's been too many times that they have ended services with not much time to get off.

2 more replies

spicyusername7y ago· 3 in thread

This is probably why they're releasing OpenShift on Azure. To let Red Hat engineers manage the kubernetes part.

https://azure.microsoft.com/en-us/blog/openshift-on-azure-th...

netdur7y ago

I like openshift, I am not sure why developers are not hot about it.

wnsire7y ago

> I am not sure why developers are not hot about it.

Because it's designed entirely for enterprise customers. If you have a startup you have very little reason to choose OpenShift compared to Heroku or AWS honnestly.

I still love Redhat tho.

federicoponzi7y ago

What do you like about openshift that lacks in kubernetes?

taherchhabra7y ago· 2 in thread

Had a similar experience with azure cosmos graph API. The API is half baked. Doesn't support all gremlin operations. Even supported operations give non standard output. Switched to aws Neptune immediately when they launched

wnsire7y ago

> The API is half baked

Doesn't surprise me. Cosmos was too good too be true :

- Serverless - Infinite Scalability - Mongo API or Gremlin API or SQL API

It's obvious that it can't hold up to all it's promises.

the_new_guy_297y ago

"Noone can give you what i can promise you"

AaronFriel7y ago· 2 in thread

There are definitely growing pains with using Kubernetes on Azure. I've wondered a few times if other platforms have similar issues and have seen more than a few complaints about EKS.

Microsoft has some great people working on Azure, but I do feel like AKS was released to GA too soon. Without a published roadmap and scant acknowledgment of issues, I'm not sure I could recommend it to my clients or employer. It's disappointing, because I've had few issues with other Azure services.

Full disclosure: I receive a monthly credit through a Microsoft program for Azure.

markbnj7y ago

> I've wondered a few times if other platforms have similar issues and have seen more than a few complaints about EKS.

I can't speak to EKS but we've been running production workloads on GKE for over a year with very good results. There have been a very few really troublesome "growing pains" type issues (an early example: loadbalancer services getting stuck with a pending external IP assignment for days) but Google has been awesome about support, even to the extent of getting Tim Hockin and Brendan Burns on the phone with us at various times to gather information about stuff like the example I gave above. I give them high marks and would recommend the service without hesitation.

curiousDog7y ago

Brendan Burns is the lead engineer on AKS AFAIK

1 more reply

partiallypro7y ago· 2 in thread

It's a very new offering, the Linux App Services are still in beta, I have no idea why you would roll it into production expecting no hiccups. AWS is also new on this. Give it 6 months and let the kinks work out before migrating workloads. Seems like common sense.

verst7y ago

App Service on Linux is an unrelated service that actually runs on top of Service Fabric (a stateful microservice and container orchestration platform).

apurvajo7y ago

App Service on Linux is not in Beta, it is GA product for over year now with SLA of 99.95%. It does not use Service Fabric in backend, it uses it's own custom Orchestrator (which essentially removes the quirks of learning about orchestration away from the user)

paxys7y ago

Worth nothing that both Microsoft and Amazon's Kubernetes offerings are very new (literally weeks since GA). While "officially" ready, it is pretty naive to rely on them for production-critical workloads just yet, at least compared to Google Kubernetes Engine which has been running for years.

If you absolutely need managed Kubernetes, stick to GCP for now.

nojvek7y ago

I believe this is a cultural problem with Microsoft. Probably similar to other companies but it was very evident at Microsoft. People responsible to allocate resources (The management chain) rarely dogfood the product.

While the Engineers and PM would complain a lot about quality issues, management wants to prioritize more features. It was a running joke at Microsoft: No one gets promoted for improving existing things, if you want a quick promo, build a new thing.

So when you see a bazillion half baked things in Azure. That’s because someone got promoted for building each of those half baked things and moving on to the next big thing.

Going from 0-90% is the same amount of work as 90-99% and the same amount of work as 99.0% - 99.99%. Making things insanely great is hard and requires a lot of dedicated focus and a commitment to set a higher bar for yourself.

hb3b7y ago

I joined a healthcare startup in 2014 that had a small infrastructure on Azure. Back then AWS weren't signing BAAs and Azure was the only player in town. Being an early startup, the company didn't purchase a support plan from Azure. One day Azure suffered a major outage (may have been storage related) and over an hour later, I reached out to Microsoft for written confirmation that we could forward to customers. Since we didn't have a support plan they flat-out refused to provide any documentation whatsoever about the issue. They wanted $10,000.

Azure - never again. Company moved to AWS within a quarter.

parasubvert7y ago

DNS failures were almost certainly related to all the k8s system services on the cluster not having CPU or memory reservations, and KubeDNS was flaking.

In general AKS is a vanilla k8s cluster and expects you know what you’re doing. MS arguably should enforce some opinions about how things like system services have reservations, etc, but none of this is vanilla. The trouble is that K8s defaults are pretty poor from a security (no seccomp profiles or apparmor/se profiles) and performance perspective (no reservations on key system DaemonSets).

We’ve had this interesting industry pendulum swing between extreme poles of “we hate opinionated platforms! Give me all the knobs!” And “this is too hard, we need opinions and guard rails!”. I think the success of K8s is exposing people to the complexity of supplying all of the config details yourself and we will see a new breed of opinionated platforms on top of it very shortly. It reminds me of the early Linux Slackware and SLS and Debian days where people traded X11 configs and window manager configs like they were treasured artifacts before Red Hat, Gnome and KDE, SuSE, and eventually Ubuntu, started to force opinions.

FlorianRappl7y ago

We have a larger migration project going on for months. So far not a single failure occurred and our TEST environment is already fully migrated (quite responsive and rock solid) since 2 weeks.

However, I do share that Azure indeed has released a lot of half-baked features and services lately (last 1.5 to 2 years). I hope this trend does not continue.

stefanatfrg7y ago

Couple of questions to the OP:

1. What version of docker / container runtime is being used?

2. What base image for your containers is being used? eg. alpine has known DNS issues [1]

[1] https://www.youtube.com/watch?v=ZnW3k6m5AY8

bsaul7y ago

Side question : what are the best practices for development ? Are you suppose to run a local kubernetes deployment ( it looks like it's pretty hard to set up) , or do you run everything outside of containers when developping and then deal with k8 packaging and deployment as a completely separated issue ( which looks like it could lead to discovering a lot of issues on the preproduction environment) ?

gercheq7y ago

Azure is not bad but there are definitely some rough edges. We're having trouble with their Bizspark Sponsorship biling https://news.ycombinator.com/item?id=17698948

rdl7y ago

Key Vault (their HSM product) is even worse.

ubuntunero7y ago

interesting. thanks

j / k navigate · click thread line to collapse

136 comments

93 comments · 20 top-level

QiKe7y ago· 27 in thread

wpietri7y ago

Does it seem weird to anybody else that a vendor would semi-blame the customer in public like this? I can't imagine seeing a statement like this from a Google or Amazon engineer.

ummonk7y ago

2 more replies

ickler97y ago

1 more reply

shaklee37y ago

1 more reply

brudgers7y ago

Steve Jobs told customers they were holding their phones wrong, so, to me, not really.

1 more reply

wizardofmysore7y ago

No. If the customer is at fault there is no problem in blaming them especially if they run a smear campaign.

jschwartzi7y ago

po7y ago

dilyevsky7y ago

they do if cluster admin didn't setup proper system-reserved and kube-reserved (both are kubelet flags) and configured enforcement.

1 more reply

wokwokwok7y ago

My experience with azure has been reasonably positive, but even I've seen some weird stuff where things randomly don't work (AAD) or the dashboard just refuses to show anything for a while.

That this is a widespread endemic problem in Azure seems entirely plausible...

pdeva1OP7y ago

HelloNurse7y ago

1 more reply

baaym7y ago

My guess is that the system reservation change was very welcome for me as well.

crunchlibrarian7y ago

Sounds like you're still beta testing

exikyut7y ago

Ah, and Hyper-V supports dynamic memory, so the system reservation backing can effectively be thin provisioned. That's nice. (Hm, dynamic memory probably got switched on from the start.)

AaronFriel7y ago

I don't think Azure ever uses dynamic memory for VMs - if I SSH into a VM I see the full allocation of whatever size it was supposed to be out of the bat.

I think this has to do with cgroups and ensuring the OOM killer doesn't target what is essentially the `init` process of a Kubernetes cluster - the docker daemon or kubelet.

specialp7y ago

This is a pretty bad mistake from the customer if this is true. If not done already it would probably be good to expose Prometheus metrics on CPU/Memory usage per node.

markbnj7y ago

1 more reply

dilyevsky7y ago

1 more reply

QiKe7y ago

We are indeed working on more convenient container monitoring and logging on Azure portal.

bengale7y ago

Last time I tried to use AKS I just got cryptic errors about the size of VMs available in Europe so I gave up and used GCP.

trhway7y ago

praseodym7y ago

Actually Kubelet should already be adjusting OOM scores to make sure that user pods (containers) get killed over Kubelet or the Docker daemon. Why didn't that work here?

1 more reply

technofiend7y ago

Mental note - create a new cgroup for docker and kubelet.

h4b4n3r07y ago

dilyevsky7y ago

that's not actually how this works on Borg these days (and by "these days" i mean past 5+ years) and there's nothing about k8s not requiring limits by default that lead to this.

1 more reply

JediPig7y ago

you just killed any hope for azure running k8s. seriously. killed it with that statement.

summarity7y ago· 20 in thread

There are some days where random parts of Azure completely fail, like customers not being able to view resources, role assignments or even their directory config.

/rant

On topic though, we've had good experiences with these k8s runtimes:

- GKE

- Rancher + DO

- IBM Cloud k8s (yeah, I know!)

blablabla1237y ago

Disclaimer: my long adventure with it was years ago, maybe the service has changed 100% but I doubt it

RoadieRoller7y ago

megaman227y ago

gamblor9567y ago

My company uses Azure quite heavily, but not Kubernetes.

No crashes, ever. Way more reliable than AWS ever was. (GCP is our failover.)

So it seems that your experience is, from my POV, the exception. Maybe there's something wrong with the way you guys have Azure set up?

politician7y ago

1 more reply

jgalentine0077y ago

partiallypro7y ago

Spoom7y ago

brianafrank7y ago

Thanks for the shout out to IKS (ibm cloud). you have no idea how obsessed we are with this service so it's always great to see someone noticed. :)

merqurio7y ago

We were very surprised with the quality of IBM's Kubernetes service too ! We had a cluster there for almost a year and everything runned very smoothly.

We missed having more instance types to choose from, but it was a nice experience.

danberg7y ago

mijoharas7y ago

ripberge7y ago

blablabla1237y ago

1 more reply

lovich7y ago

Bombthecat7y ago

With DO you mean digital Ocean?

lloeki7y ago

I guess so since RancherOS is available as a fully supported option (as well as Fedora Atomic and CoreOS).

FWIW we're running a simple, custom cluster made of Debian droplets set up using kubeadm.

alexpi7y ago

I think DO stands for DC/OS in this context

1 more reply

eip7y ago

> That is the norm, not the exception.

Have you never used any other Microsoft software?

I mean it is the same software company that made Windows ME, Vista, 7, and 10, along with countless other chocolate covered turds.

JediPig7y ago

mgalgs7y ago· 5 in thread

FWIW, Amazon's hosted Kubernetes offering (EKS) isn't stable either (DNS failures, HPA is known to be broken, etc.).

shamsalmon7y ago

atombender7y ago

We've had issues with KubeDNS, too. Lots of retries and timeouts on the client side, and lots of conntrack entries.

We ended up scaling up KubeDNS to 2 replicas and moving them to a dedicated nodepool just to make sure they weren't competing with other nodes. That fixed our issues for now.

praseodym7y ago

2 more replies

ivelichkovich7y ago

EKS HPA workaround https://medium.com/eks-hpa-workaround/k8s-hpa-controller-6ac...

ivelichkovich7y ago

EKS HPA Workaround https://medium.com/eks-hpa-workaround/k8s-hpa-controller-6ac...

ageitgey7y ago· 4 in thread

Here's a fun fact about Azure Kubernetes:

1. Deploy your Linux service on k8s with redundant nodes

2. Create a k8s VolumeClaim and mount it on your nodes to give your application some long-lived or shared disk storage (i.e. for processing user-uploaded files).

3. Wait until the subtle bugs start to appear in your app.

For example, if a user tries to upload a file called "COM1" or "PRN1", it will blow up with a disk write error.

Yes, that's right, Azure is the only cloud vendor that is 100% compatible with enforcing DOS 1.0 reserved filenames - on your Linux server in 2018!

manigandham7y ago

>> Directory and file names are case-preserving and case-insensitive.

colemickens7y ago

>Yes, that's right, Azure is the only cloud vendor that is 100% compatible with enforcing DOS 1.0 reserved filenames - on your Linux server in 2018!

This is hyperbolic bordering on flatly false. This is more reasonable and accurate:

"Azure is the only cloud vendor that serves their Samba product from Windows boxes, and thus leak Win/NTFS-isms into their Samba shares [that shouldn't be used anyway]."

How would an ext4 filesystem, mounted under Linux, attached as a block device to a VM, be subjected to Windows-isms? What you're implying doesn't even make sense.

parasubvert7y ago

I really would have to disagree with the statement one should never use Azure Files over Azure disks.

1. Most Azure VM types have very stringent limits on attached disks; a K8s worker can easily blow past this limit.

5. Azure disks are block storage and thus only ReadWriteOnce. Azure files are RWM.

1 more reply

AaronFriel7y ago

Is this with the managed-disk volume (which IIRC is formatted ext4) or with `AzureFiles`, which is essentially SMB/CIFS?

curiousDog7y ago· 4 in thread

This is sad. From what I hear, one of the founders of k8s works on AKS.

Only a matter of time before GCP becomes the #1/2 cloud provider.

h4b4n3r07y ago

I’ve seen at least 4 “founders of Kubernetes” by now. How many are there in total?

rhencke7y ago

At least three to five.

https://en.wikipedia.org/wiki/Kubernetes#History

cheeze7y ago

Agree. The quality of their software is so much better than the other major players.

swozey7y ago

Yeah, Brendan Burns is at ms

rcconf7y ago· 4 in thread

Hmm, this isn't great. Currently using Azure Kubernetes Service and we haven't had many issues so far, but we just made the shift.

Hope I don't have to move over to Google cloud.

RantyDave7y ago

It's heaps and heaps better than Azure.

zeroone1017y ago

outworlder7y ago

Why? GKE works perfectly over there.

specialp7y ago

Many people (me included) don't trust Google to stay with any business besides advertising. There's been too many times that they have ended services with not much time to get off.

2 more replies

spicyusername7y ago· 3 in thread

This is probably why they're releasing OpenShift on Azure. To let Red Hat engineers manage the kubernetes part.

https://azure.microsoft.com/en-us/blog/openshift-on-azure-th...

netdur7y ago

I like openshift, I am not sure why developers are not hot about it.

wnsire7y ago

> I am not sure why developers are not hot about it.

Because it's designed entirely for enterprise customers. If you have a startup you have very little reason to choose OpenShift compared to Heroku or AWS honnestly.

I still love Redhat tho.

federicoponzi7y ago

What do you like about openshift that lacks in kubernetes?

taherchhabra7y ago· 2 in thread

wnsire7y ago

> The API is half baked

Doesn't surprise me. Cosmos was too good too be true :

- Serverless - Infinite Scalability - Mongo API or Gremlin API or SQL API

It's obvious that it can't hold up to all it's promises.

the_new_guy_297y ago

"Noone can give you what i can promise you"

AaronFriel7y ago· 2 in thread

There are definitely growing pains with using Kubernetes on Azure. I've wondered a few times if other platforms have similar issues and have seen more than a few complaints about EKS.

Full disclosure: I receive a monthly credit through a Microsoft program for Azure.

markbnj7y ago

> I've wondered a few times if other platforms have similar issues and have seen more than a few complaints about EKS.

curiousDog7y ago

Brendan Burns is the lead engineer on AKS AFAIK

1 more reply

partiallypro7y ago· 2 in thread

verst7y ago

App Service on Linux is an unrelated service that actually runs on top of Service Fabric (a stateful microservice and container orchestration platform).

apurvajo7y ago

paxys7y ago

If you absolutely need managed Kubernetes, stick to GCP for now.

nojvek7y ago

So when you see a bazillion half baked things in Azure. That’s because someone got promoted for building each of those half baked things and moving on to the next big thing.

hb3b7y ago

Azure - never again. Company moved to AWS within a quarter.

parasubvert7y ago

DNS failures were almost certainly related to all the k8s system services on the cluster not having CPU or memory reservations, and KubeDNS was flaking.

FlorianRappl7y ago

We have a larger migration project going on for months. So far not a single failure occurred and our TEST environment is already fully migrated (quite responsive and rock solid) since 2 weeks.

However, I do share that Azure indeed has released a lot of half-baked features and services lately (last 1.5 to 2 years). I hope this trend does not continue.

stefanatfrg7y ago

Couple of questions to the OP:

1. What version of docker / container runtime is being used?

2. What base image for your containers is being used? eg. alpine has known DNS issues [1]

[1] https://www.youtube.com/watch?v=ZnW3k6m5AY8

bsaul7y ago

gercheq7y ago

Azure is not bad but there are definitely some rough edges. We're having trouble with their Bizspark Sponsorship biling https://news.ycombinator.com/item?id=17698948

rdl7y ago

Key Vault (their HSM product) is even worse.

ubuntunero7y ago

interesting. thanks

j / k navigate · click thread line to collapse