Kubernetes Failure Stories | Better HN

236 comments

122 comments · 15 top-level

m0zg7y ago· 28 in thread

It's not for everyone and it has significant maintenance overhead if you want to keep it up to date _and_ can't re-create the cluster with a new version every time. This is something most people at Google are completely insulated from in the case of Borg, because SRE's make infrastructure "just work". I wish there was something drastically simpler. I don't need three dozen persistent volume providers, or the ability to e.g. replace my network plugin or DNS provider, or load balancer. I want a sane set of defaults built-in. I want easy access to persistent data (currently a bit of a nightmare to set up in your own cluster). I want a configuration setup that can take command line params without futzing with templating and the like. As horrible and inconsistent as Borg's BCL is, it's, IMO, an improvement over what K8S uses.

Most importantly: I want a lot fewer moving parts than it currently has. Being "extensible" is a noble goal, but at some point cognitive overhead begins to dominate. Learn to say "no" to good ideas.

Unfortunately there's a lot of K8S configs and specific software already written, so people are unlikely to switch to something more manageable. Fortunately if complexity continues to proliferate, it may collapse under its own weight, leaving no option but to move somewhere else.

aprdm7y ago

In places worked we usually had a vmware cluster, load balancer, NFS for shared data when necessary and DNS set up (e.g: through consul).

This setup is very, very simple and scalable. There is very little to gain IMO on moving to Kubernetes.

Consul, VSphere and load balancers have APIs and you can write tools to do everything that K8s does.

merb7y ago

How do you load balance? i mean load balance the "public ip"

In some networks DNS failover is really not that great, so at least a virtual ip needs to be used.

dilyevsky7y ago

Scalable NFS, riiite.

threeseed7y ago

> leaving no option but to move somewhere else

Many of the major infrastructure/platform vendors are rolling out their own distribution of Kubernetes either as a cloud service e.g AWS, Azure, GCP or on premise e.g. RedHat.

So I suspect they are going to try and differentiate on features and ease of use and make it as hard as possible to move anywhere else.

cookiecaper7y ago

k8s is meant to be hard to use. You're supposed to rent space on a k8s cluster from Google. Google has been pumping millions into marketing k8s as a mechanism to improve GCP adoption and establish a foothold in the cloud provider space.

malkia7y ago

It took me a while to get comfortable in Borg (and in general that your binary can take hundredths of verbosely written command-line arguments (coming from gamedev, I was in a bit of shock state for a while)... But then got used to it - still I felt I could never fully internalize the evaluation rules - but the other tooling (diffing) really helped in that respect.

One thing I've really appreciated, was how one could enable/disable things based on the binary version rolled in, and if it's rolled back the state goes back.

Basically something like this:

    {
       new_exp_feature = (binary_compiled_after_changelist( 123456789 ) || binary_compiled_with_cherrypicks( { 123456795, 1234567899 } )
    }

Since piper is changelist based (like perforce/svn), each "CL" goes up atomically, so you can use this to say - this specific flag should get turned ON only if my binary has been compiled with base CL > 12345789 or if it was compiled with earlier, had these cherrypicks (e.g. individual Changelists) built with it. But this was heavily integrated with the whole system - e.g. each binary would basically be built at some @base_cl and additional @{cherry_pick_cl1, chery_pick_cl2, ..} maybe applied. For example the team decides to release with verison @base_cl, but during the release bugs were found, and rather than rolling to a new @base_cl, just individual cherry picks maybe be pushed - so basically you can then control (in your configuration) how to act (configuration could be pushed indepedntly of your binary, ... though some systems would bundle them together)... And then if you have to rollback, the Borgcfg would re-evaluate all this, and decide to flip the switch back (that switch would simply emit something like --new_exp_feature=true or --new_exp_feature=false (or --no-new_exp_feature, it was long time ago so I could be wrong)).

With git/hg - you no longer have such monotonic order, but also that monotonic order worked best with monorepos (or maybe I'm just too narrow-sighted here)...

akhilcacharya7y ago

From this comment thread I’m beginning to think I’m one of the few people on HN that hasn’t used Borg.

All of this seems way more complicated than the tools we use at my company. Is there a specialized need here I’m not seeing?

justicezyx7y ago

You seem confuse Borg and borgcfg.

The evaluation rules are merely a borgcfg artifact.

Disclaimer: I maintain borgcfg.

trhway7y ago

An open source having complexity of an enterprise monster - this is what generates the half million plus salaries. An old enterprise software trick. Simplification of it would serve no interests of anybody in the position to do the simplification.

m0zg7y ago

I'm going to voice a contrarian viewpoint here and say that "half million plus" salaries are actually a good thing. Rising tide lifts all boats and since a lot of technies live in areas with exorbitant cost of living, that money re-enters the economy at a rapid clip anyway. But such salaries are only good if commensurate value is being delivered for the money. Which in a large IT shop it might be, but as a small business owner K8S is a hard slog, hence my suggestion to simplify. I'm pretty sure 80/20 breakdown still applies, and 80% of K8S complexity could be removed without affecting anything much. One might suggest that I use GKE and bypass the problem entirely, but I need to run a lot of GPUs 24x7, and the pricing on those in any cloud is insane.

verisimilidude7y ago

Simplified Kubernetes is a thing that exists. OpenShift (and the open source version, OKD) jumps out as the immediate example. There are other non-k8s tools that cover some of the same territory, like Docker Swarm or Cloud Foundry.

There's still a learning curve, but it's much more humane than Kubernetes.

I think you meant to write "(and the upstream community version, OKD)", because OpenShift is also fully open source.

justicezyx7y ago

Hmm, I think you and many others do not get how complex a general purpose infrastructure can and should be.

Kubernetes is very simple. And it will become much more complex with the growing hardware, network, and applications it's trying to manage.

What's missing is that there is a layer of complexity on top of k8s are still left for figuring out. And I think the operator's pattern is the right abstraction for service jobs. Some kind framework is still needed to handle the batch/offline workloads though.

Quite the opposite, I want it to be flexible and pluggable for other use cases other than the most simple. I've gotten a lot of benefit from adding custom features.

I'm not sure but would something like docker swarm qualify?

dilyevsky7y ago

Re configuration: ksonnet is an option (although I personally find jsonnet a “lipstick on a pig” kind of solution).

There’s some work going on to have something more user-friendly (think Google’s Piccolo) - https://github.com/stripe/skycfg (disclaimer - I contributed to this project)

atombender7y ago

There's also Kubecfg [1], which uses Jsonnet, but has a much smaller surface area than Ksonnet.

[1] https://github.com/ksonnet/kubecfg

Could you describe Piccolo a bit? Can't find anything on it.

wires7y ago

> I wish there was something drastically simpler

Have you tried Nomad?

https://www.nomadproject.io

merb7y ago

> _and_ can't re-create the cluster with a new version every time

actually I used kubeadm and the higher the version was going the better it worked for major upgrades.

At the moment with the new master upgrade methods I did not have any problems so far. on two clusters.

Sadly I created my cluster with an "external" etcd, beside that it is internal and also tried to maintain my own certificates, which is now a pita. (at the time cert handling wasn't as good in kubeadm as it is now).

Also I have a CloudConfig/Ignition Config creator which can bootstrap all necessary configs to bootstrap a kubeadm cluster on ContainerLinux/Flatcar Linux. So if I really have time I can just recreate a new cluster and move everything over. (I.e. the only thing which is problematic in "moving" over is the database created with kubedb)

Also you can use keepalived as your kubeadm load balancer.

lykr0n7y ago

Nomad is drastically simpler than Kubernetes. All you need is consul and nomad to get a running cluster

I think Istio (https://istio.io) is a nice effort to create both an abstraction on top of k8s and to package a set of commonly needed functionality out of the box. Unsure of its production status or overhead though.

Also I'd only go with a managed k8s solution and I'm not sure I'd consider k8s for older or non-microservice/containerized architectures. In the later case though I don't think there's anything better out there in terms of orchestration.

nvarsj7y ago

I have pretty mixed feelings about Istio. It's trying to solve a lot of fundamental problems by introducing yet another layer of stuff. It's basically the middleware box all over again.

bvm7y ago

Lots of magic for me, I've broken a (dev) k8s cluster by installing Istio via Gitlab k8s integration. The overhead appeared to be non-negligible, but I noped-out-of-there pretty quick, so I don't have the data to back that up.

thetechlead7y ago

we used to maintain our own k8s cluster and it's a pain in the ass given we have no dedicated ops. the cluster crashed every one or two month and we never tried making it up to date.

I suggest every startup use a hosted k8s solution, which takes care of most things like authentication, networking, monitoring, updating, etc.

also keep away from templating system such as jsonnet which is a huge overkill. you will end up writing a lot code you will hate to read later. instead write your own yaml builder in CI, together with parts that do docker image building, and code that deploys the microservices

imo Google did a really smart move with open sourcing k8s, as a latecomer of cloud provider. now infrastructure become so insignificant since everything runs on docker and pods.

ldng7y ago

There was Rancher 1.6 with Cattle was the sweet spot for us. Rancher 2 went full kubernetes. Probably makes sense for their customers. We're looking for a replacement in that sweet spot.

rawoke0836007y ago

Very true !!! There is no greater culprit responsible for "complex systems" than the act of "extensible/future proof" in software design !

I think that the unix philosophy of focused and relatively simple tools that are easy to glue together is a better way to future-proof. Yet to do that you need to have a stable substrata to provide the basis of composition. In k8s case it seems that k8s _is_ the basis where the composition is to happen upon.

dvnguyen7y ago· 20 in thread

Having used Docker Compose/Swarm for last two years, I remember having problems with them twice. One of which was an MTU setting which I didn't really understand why, but overall I was relatively happy with them. Since Kubernetes seems to have won, I decided to learn it but got some disappointments.

The first disappointment is setting up a local development environment. I failed to get minikube running on a Macbook Air 2013 and a Ubuntu Thinkpad. Both have VTx enabled and Docker and VirtualBox running flawlessly. Their online interactive tutorial was good though, enough for the learning purpose.

Production setup is a bigger disappointment. The only easy and reliable ways to have a production grade Kubernetes cluster are to lock yourself into either a big player cloud provider, or an enterprise OS (Redhat/Ubuntu), or introduce a new layer on top of Kubernetes [1]. Locking myself into enterprise Ubuntu/Redhad is expensive, and I'm not comfortable with adding a new, moving, unreliable layer on top of Kubernestes which is built on top of Docker. One thing I like about the Docker movement is that they commoditize infrastructure and reduce lock-ins. I can design my infrastructure so it can utilize an open source based cloud product first and easily move to others or self-host if needed. With Kubernetes, things are going the other way. Even if I never moved out of the big 3 (AWS/Azure/GCloud), the migration process could be painful since their Kubernetes may introduce further lock-ins for logging, monitoring, and so on.

[1]: https://kubernetes.io/docs/setup/pick-right-solution/

> The only easy and reliable ways to have a production grade Kubernetes cluster are to lock yourself into either a big player cloud provider, or an enterprise OS (Redhat/Ubuntu), or introduce a new layer on top of Kubernetes [1]

I think you might have misunderstood that page. The standard and universal way to deploy Kubernetes on to either your own bare metal or any cloud provider is to use kubeadm. However, if you would like a simpler and more automated solution and/or one backed by a vendor, you are welcome to pick any of the hosted platforms, distributions, or installers. CNCF has certified 70 conformant solutions: https://www.cncf.io/certification/software-conformance/

> Even if I never moved out of the big 3 (AWS/Azure/GCloud), the migration process could be painful since their Kubernetes may introduce further lock-ins for logging, monitoring, and so on.

If you choose open source solutions for logging and monitoring like Fluentd and Prometheus, then you can avoid locking into anyone's value added services and remain completely portable. If you decide to go with a vendor's solution, you may trade convenience for higher switching costs.

[1]: https://kubernetes.io/docs/setup/pick-right-solution/

Disclosure: I'm executive director of CNCF and run the conformance program.

The point is not about the minimum conformance, but rather the lock-in provided by the maximum configuration / extensions of each vendor.

Take AWS EKS as an example. Their feature page[1] does mention conformance. Then it mentions 20 other non-conformance focused features that create an effective lock-in.

k8s is becoming like OpenStack in this regards. You need to embrace a vendor version of k8s in order to have a functional cluster without a massive team.

[1] - https://aws.amazon.com/eks/features/

DavidWoof7y ago

I've had small-ish docker swarms in production for a couple of years as well, and I really don't understand why it doesn't seem to be popular at all. I feel like I need to move to K8S just because swarm seems to be going away, but I'm really not seeing the technical advantages at all.

If someone could point me to an article explaining why k8s is so much better than swarm, I'd really appreciate it. Are the big advantages only at 100-node scales?

I'm also constantly surprised at how unpopular docker swarm is given that everyone already uses docker itself. Why do you think swarm is going away though? I love the idea of just using my docker compose file as my deployment config.

zerotolerance7y ago

Swarm is not going anywhere.

romeisendcoming7y ago

Or you could just learn how to manage infrastructure the old fashioned way, which was never broken for small business and mid-sized enterprise environments. The only time you need the complexity and overhead of something like kubernetes is when you are truly large or when you have caught the in-fashion disease.

It's quite simple for a 20 year SA to stand up a highly integrated environment with modular monitoring, directory service, virtualization and hybrid cloud options for all services in a week. Why don't you hire one of these for the job instead of recipe/containering yourself into 'doesn't work, I dunno' posts.

>Why don't you hire one of these for the job instead of recipe/containering yourself into 'doesn't work, I dunno' posts.

Because every.single.one of these "integrated environments" I've ever come across was an objective mess, poorly documented and littered with tech-debt.

It was clear the "20-year SA" had forced 20-year old administration abstractions and ideas on top of modern infrastructure and application concerns. It was cheaper/better/easier to throw it out and rebuild on something like k8s than to make any attempting at "scaling" the existing solution.

You're simply trading the "in-fashion" disease for the "I'm a 20-year Linuxbeard I know best and no one tells me different" disease.

manigandham7y ago

Why not hire an SA? Because K8S is free and runs well from small to large and is available on every cloud where the IT infrastructure already is.

Why is it better to spend money to rebuild a fraction of K8S with a patchwork of infrastructure put together by a single person?

hjacobsOP7y ago

I never used Docker Swarm (so can't compare), but I don't fully understand your point about Kubernetes cloud lock-in. Certainly there are important differences in networking, load balancing, persistent volumes, and other cloud features, but that's not something any platform can just hide/eliminate (e.g. think about AWS ELB/ALB/NLB vs Google Load Balancer). The Kubernetes concepts (Deployment, Ingress, Service) still work mostly the same for the user across clouds. Some other details like non-standardized Ingress annotations are obviously due to not having them agreed in Kubernetes core API (nginx ingress supports other annotations than say Google LB or Skipper).

fhrow44847y ago

> I don't fully understand your point about Kubernetes cloud lock-in

The kubernetes folks describe a tentative solution to cloud lock-in here: https://kubernetes.io/docs/concepts/cluster-administration/f... OP isn't the only one with those concerns.

It would be nice when you can switch your cluster load from any of the cloud providers, or your own on-prem setup as you go. For instance, I could see people wanting to have a default small cluster on their on-prem setup, and be ready to scale on cloud when needed.

Most of the kubernetes toolchain provides nice support for delineating the requirements from separate cloud providers, too. Compared to most alternatives, a little HELM magic to support hybrid cloud installations is a piece of cake.

jordanbeiber7y ago

The kubeadm api for ”phases” going beta in 1.12 -> 1.13 has actually made rolling k8s clusters (almost) a breeze.

It used to be really clunky, but these days all you need is a simple bash script or ansible play (or whatever you’re comfortable with) to get going.

But yeah, no unix philosphy vibes from k8s as a whole...

Kubernetes does not provide functionalities like logging & monitoring. The way how this works is totally a bunch of open source solutions like Prometheus & Fluentd.

Actually, I barely saw any Kubernetes cloud provider provides meaningful service which can lock myself in, they are basically managed Kubernetes clusters with their cloud services as plugins. You can verify this by comparing GKE/AKE/EKS, you'll find they are almost same thing.

Out of curiosity, why do you need RHEL or subscription from Canonical for the production Kubernetes setup? What's wrong with plain Ubuntu or CentOS?

parasubvert7y ago

It’s common to pay for things to make them easier to configure/manage.

Red Hat OpenShift on RHEL, Pivotal Container Service on Ubuntu, Red Hat’s nextgen CoreOS based Kubernetes, Canonical’s Charmed Kubernetes Distribution on Ubuntu, etc. all have different config management , install, upgrade, patching mechanisms that vary from Ansible, to Terraform, to BOSH, to Juju. Some handle PXE bare metal, some don’t. Etc.

There usually are free / no pay versions of the above that you can use self-supported, but then you’ll also need to coordinate your own upgrades and use community forums for q&a rather than being able to contractually have someone looking out for you and answering your questions.

If you’d prefer to avoid lock-in, All of that plumbing would otherwise have to be configured and scripted yourself with your chosen toolchain plus the newer “k8s small tools” like Kubeadm, Kops, Kube-spray, etc.

As the old saying goes, open source is only free (as in beer) if your time has no value.

y4mi7y ago

You probably don't want to configure kubernetes manually... Kops is a thing though, but it's still a lot error potential, if you want to go to production

tyingq7y ago

Digital Ocean's K8S offering is out of beta now: https://www.digitalocean.com/products/kubernetes/

Migrated my very small cluster from GKE to DigitalOcean's K8s a few weeks ago. I was using 3 nodes on GKE with 1 core & 3.75GB RAM per node, and the cost was around 100 $ per month including load balancer for the cheapest region, `us-central1-a`. Now, on DigitalOcean, I have 3 nodes with 1 core & 2GB RAM per node. The cost is exactly 40$ including load balancer.

I am a pretty basic user, I have started using k8s on this project as a learning and 100$ was too much for the learning price, but now on DO I get a similar cluster for less than half of GKE price and I feel like it is worth it, considering all the simplicity and observability of deployments. Also, DO allows me to select regions without any price difference, so I was able to select Amsterdam to get 10 times better latency from where I live. My setup is quite basic, my app with aroud 8-10 pods, + additional stuff such as cert-manager and prometheus.

YMMV, but so far I am really happy with DO's offering, both in terms of performance, simplicity and performance. I am not a power user and definitely operate at no scale, but using DO in general is much simpler than using GCP with GKE.

hjacobsOP7y ago

DO K8s is pretty neat, but last time I checked it did not have the metrics server (CPU/mem metrics, also "kubectl top") yet.

clvx7y ago

It’s good, but the storage layer has some bugs. For example, if you create a pvc, then resize the volume according to their docs, the new size doesn’t reflect in k8s. Also you can create pv’s manually and it won’t show up in the dashboard.

manigandham7y ago· 10 in thread

I don't understand all the negative comments here, K8S solves many problems regardless of scale. You get a single platform that can run namespaced applications using simple declarative files with consolidated logging, monitoring, load-balancing, and failover built-in. What company would not want this?

this is too broad. i think that may actually be the problem: in theory it can do a lot of things, but in the real world it’s hard to get all those theoretical benefits.

for me, if you’re in the cloud you don’t need k8s. your favorite cloud provider has already figured out logging and monitoring and the basic things you need to get going. (another story if you run on bare metal)

if you’re not running a legacy app you don’t really need containers either. containers are great for legacy apps, for poorly written software or if you like overengineering. the abstraction you need is called a vm. use it. (again if you are in the cloud).

your app/service/thing is not as complicated as you think it is (or at least it should not be). I see a lot of people feeling like they need to experiment with new technology, on the job, on whatever they are doing now. actually building something that works and is simple as fuck seems to take a backseat and these types of people will create a narrative around using the new flashy thing. this is how you end up with production systems leveraging tools in beta and you end up closing shop when you finally figure out that you don’t have the resources to understand and maintain what you’ve created.

there is a time and place to experiment and learn. on small projects or on your own time. it takes experience to understand the hype cycle and to distinguish good tech from the hype.

as for k8s? yes, it solves some problems but it also creates others. do you like basically spending the time you’ve saved on setup and deployment to maintain/troubleshoot/upgrade your cluster? knock yourself out.

manigandham7y ago

There is a very big gap between IaaS and PaaS. K8S is an abstraction on top of VMs so you can have a customizable PaaS that runs on YAML code. It has nothing to do with how complex your app is because K8S is about running it with less work in a declarative fashion. I'm currently in and have worked with dozens of startups that have saved lots of time by removing all the ops overhead with K8S because it runs the servers and we can just deploy our apps.

It seems like most of the problems are actually about installing and running K8S software itself, but then 95% of companies won't be doing that and using the managed offerings instead. This is no different than companies using the cloud over running their own DCs.

From a developers' perspective, k8s feels like a holy Grail. Having fully embraced it with my latest rails app, I can say confidently that I've never had a more straightforward and enjoyable experience than K8s. It's absolutely the correct abstraction layer for me; it gives me all the power I could ask for in as concise a definition as I could possibly expect.

I think a lot of the complaints against K8s are from the ops side of things. In my org, I don't actually run or upgrade the K8s cluster myself, so those pain points aren't mine to bear. When you're running your own k8s, the operational complexity of managing the cluster itself is not trivial and the change in mindset for traditional sysadmin types is a substantial hurdle.

My own take: K8s (or something very much like it) is absolutely the future, but the operational challenges of migrating to it at this time should not be ignored if you want to run it yourself and have existing ops experience. This will only get easier over time as tooling improves and sysadmins start seeing that this is the future they have to embrace.

manigandham7y ago

Agree, although I don't think the Ops portion is that hard either, at least certainly not that different from all the other complex software that used to be installed and maintained. I feel it's just the usual pushback against change and general commoditization of IT that's leading to most of the complaints.

I very much agree that kubernetes is useful in an environment that doesn’t need to scale, but do tell how it enables consolidated logging and monitoring, since my medium/small shop is spending quite some time setting up our own infrastructure for it.

013a7y ago

Installing a managed log ingestor is stupidly easy in Kubernetes. For example, on GCP here's the guide to getting it done [1]. Two kubectl commands, and you get centralized logging across hundreds of nodes in your cluster and thousands of containers within them. Most other platforms (like Datadog) have similar setups.

Infrastructure level monitoring is also very easy. For example, if you're on Datadog, you flip KUBERNETES=true as an environment variable in the datadog agent, and you'll instantly get events for stopped containers, with stopped reason (OOM, evictions, etc), which you can configure granular alerting on.

Let's say you're in a service-oriented environment and you want detailed network-level metrics between services (request latency, status codes, etc). No problem, two commands and you have Istio [2]. Istio has Jaeger built-in for distributed tracing, with an in-cluster dashboard, or you can export the OpenTracing spans to any service that supports OpenTracing. You can also export these metrics to Datadog or most other metrics services you use.

[1] https://kubernetes.io/docs/tasks/debug-application-cluster/l...

[2] https://istio.io/docs/setup/kubernetes/quick-start/

I run a Filebeat container with privileges to read stdout/stderr of all other pods, which then forwards to ElasticSearch. (https://www.elastic.co/guide/en/beats/filebeat/master/runnin...). It's fairly straight forward, then Kibana + Watcher can ship alters to PagerDuty based on log patterns / queries / limits, etc. I think Watcher is open-source/free now?

I also have Prometheus + grafana, which similarly collects lots of stats from around the cluster, but I'm fairly sure I'm the only person who uses that dashboard, since the only things hooked up to Prometheus are databases and such, no internal applications (yet!).

Being able to aggregate stdout/stderr across dozens of machines previously would have cost either a lot of Chef setup time or a contract with some provider. Now I get a fairly straight forward open-source stack that can be refined over time, and the yaml re-used very easily in any cluster. Plus, the metadata collected from Kubernetes about each log line is extremely useful (For example, out of the box you can query by Kubernetes labels for your graphs etc)

manigandham7y ago

Are you running it yourself? The K8S dashboard gives you logs and basic monitoring out of the box, or you can get logs directly from kubectl.

The typical approach is to setup Fluentd for logging. You set it up as a daemonset, and have it mount /var/docker from the host. That gives it access to all container logs, which you then stream to your desired store.

Logging and monitoring are not built in to K8S, at least not something you would rely on for operational purposes.

I believe most people use an EFK/ELK stack for centralized logging and Prometheus for Monitoring.

cygned7y ago· 9 in thread

I am a developer and I find k8s frustrating. To me, its documentation is confusing and scattered among too many places (best example: overlay networks). I have read multiple books and gazillions of articles and yet I have the feeling that I am lacking the bigger picture.

I was able to set it up successfully a couple of times, with more or less time required. Last time, I gave up after four days because I realized that what I need was a "I just want to run a simple cluster" solution and while k8s might provide that, its flexibility makes it hard for me to use it.

FridgeSeal7y ago

Have you used other google products? I find their documentation routinely incomprehensible and difficult.

innocentoldguy7y ago

Agreed! I am an engineer and have written documentation off and on throughout my career. I'm continuously dismayed at the incomprehensible documentation generated by most companies. Google's documentation is particularly bad though.

I have a theory that the type of people who make it past the google interview are smart people who are bad at teaching. Like they get all the concepts, algos etc.. but when it comes to distilling it into an Explain-Like-Im-5 tutorial, it just goes to hell very quickly.

What they need to do is hire some people who are great teachers, explainers etc.. Avoid people who rely on already attained technical knowledge, design patterns, algos etc.. to pattern match on new tech to instantly grok it. The 'noob' people who question the engineers who designed the tools and ask a ton of dumb questions about how it works so they can then translate it into everyday tutorial paragraphs.

jacques_chester7y ago

Kubernetes has always had an identity crisis.

Who is aimed at, app developers or platform operators? Clear, obvious contracts between the two roles are valuable, even if you decide to combine them.

I'm moderately hopeful that Knative will help in that regard, as it is more conclusively oriented towards the developer. But I am wary that since it leaves the implementation details completely visible, it may not achieve that goal.

Disclosure: I work for Pivotal, we have products based on both of these.

> app developers or platform operators

Definitely not the former. The YAML-based configuration is not a pleasant app deployment experience. Companies end up needing to do some sort of auto-generation for it to make it sane for app devs.

App developers want experiences similar to heroku. They want to git push and have applications safely roll out without downtime or configuration.

cygned7y ago

Achieving that is a difficulty task, though. Personally, I’d not like to rely on an abstraction on top of a system with that level of complexity for production because I expect to run into situations that can only be solved with deep knowledge of k8s.

hjacobsOP7y ago

Kelsey has the answer for you: "Kubernetes is a platform for building platforms. It's a better place to start; not the endgame." (https://twitter.com/kelseyhightower/status/93525292372179353...)

As an application developer, you probably also don't work with the Kernel and syscalls directly (anymore), so I guess you can expect higher abstractions and a smoother experience for Kubernetes in the future.

Kubernetes doesn't specify anything about overlay networks. That's up to the CNI provider. Are you referring to flannel's documentation?

quickthrower27y ago

Perhaps there is a good plural sight course? Sounds like a tech you have to invest many hours to learn.

tnolet7y ago· 9 in thread

I'd be interested in a related "microservices failure stories". Must be a big overlap with this.

I have two. One was caused by data inconsistency between services and regions. One is more hypothetical: the microservices had gotten to the point that no one knew how to start the system if all services are down, and it's possible that services have circular dependencies to the point that it would be incredibly hard to do a cold start.

nicobn7y ago

I've actually seen your hypothetical in action, but the bug was even more subtle. Assume service A, B and C. A and C both need information from each other which is usually cached. Normally, you'd deploy one service at a time so the call chain would go A -> B -> C -> A or A -> C then A -> B -> C but in this particular instance, A and C's caches were cold, causing an explosion of service calls that took both services down.

kronin7y ago

You handle the hypothetical by validating a new environment can be built and bootstrapped. Doing this on a regular basis, either by tearing down and rebuilding dev, or in a separate environment just for this purpose, is not a "nice to have". This same problem exists with monoliths with complicated dependencies, nothing new here.

Just like backups, if infrastructure as code isn't tested, it's worthless.

sgt1017y ago

I think that if there is a genuine circular dependency then the services won't start ever. But I think it is possible to introduce services that assume other services are up and have an apparent dependency circularity. The trick is to have all your services resilient to it's start requirements not being met - basically the service has to back off and wait if information it needs isn't yet in the environment... and then ask again - so that when other services are up everything syncs and comes up.

andyidsinga7y ago

> One is more hypothetical: the microservices had gotten to the point that no one knew how to start the system if all services are down,

this is a good one and have been thinking about this myself. Even with smallish projects that might have 10s of container based applications / services. In my case I end up with what is essentially a 3 tier architecture with each tier being a group of containers/machines with their own rules for startup/shutdown.

parasubvert7y ago

I’d think the best way to handle a cold start is to have each microservice fail fast and be wrapped in a supervisor. They will converge to uptime.

Unless you have a truly circular dependency, at which point they probably should be collapsed into a single service.

coredog647y ago

Microservices failure stories? “All of them. The End”

tnolet7y ago

I'm not convinced all microservices ventures are failures. Having worked in the space a bit as a founder/CTO of a vendor in that space I've just seen many examples of misguided attempts due to fashion / CV-driven development / hype driven development. The pattern is valid, just not for everyone at every time.

I'm consulting on a micro services back end right now with mostly prior experience with monoliths. What is the selling point that drives companies down this direction? It's insane, and my client keeps trying to hire new developers and bring on more consultants to build this thing, but the amount of knowledge required is more than any one person can handle. I have similar issues with their choice of db (nosql) and its inflexibility.

bdcravens7y ago· 8 in thread

I've started the planning phase of a Kubernetes course, geared toward developers more so than the enterprise gatekeepers. As I read stories like these, I jump between different thoughts and feelings:

1) no matter what I think I know, there's too many dark corners to create an adequate course

2) K8S is such a dumpster fire that I shouldn't encourage others

3) there's a hell of an opportunity here

Thoughts? Worth pursuing? Anything in particular that should be included that usually isn't in this kind of training?

parasubvert7y ago

All three. It’s a gold rush, but as with any gold rush, conditions are hard going - that’s why there’s an opportunity.

Best way to think of Kubernetes is that it was designed to be a successful open source project that was widely adopted as a standard foundation to build products. It wasn’t designed to be a useable product on its own.

We are at the equivalent stage of Slackware and SLS and Debian Red Hat pre-1.0 stages of GNU/Linux distros circa 1994. Red Hat eventually ran away most of the money by the late 90s, but in the meantime, lots of opportunity to fill an unmet need.

romeisendcoming7y ago

Don't forget SuSE the sole surviving competitor. Best Buy SuSE Linux gecko box 2.2.14 kernel veteran.

BossingAround7y ago

As a person who loves tech writing, who owes his career to free coursera courses and online tutorials, and who is eager to teach people, this is an insanely difficult thing to get right.

Writing an ok tutorial isn't good enough. Writing an amazing tutorial is fine, if it is on a platform people know (such as LinuxAcademy, Pluralsight, or something similar).

I once wrote an article on getting started with a static website generator. I received a ton of praise in the comments, saying how great the step-by-step instructions are, and I felt great... Only to discover that I made a typo in one of the commands, and that if you actually went through the tutorial, there's no way you'd get past that one step, unless you knew what you were doing (in which case you wouldn't go through a getting started guide, most likely).

All I'm saying is, unless you can write an amazing content on a platform where people go to learn and advance their career, no one's gonna use it, I'm afraid.

plasma7y ago

I think a good course would be setting up a k8s cluster for a simple "hello world" production app, which then includes topics perhaps about monitoring, upgrading, etc all the kind of stuff you want to know for getting an app up and running.

A better example would be a Wordpress installation. Hello World means skipping much of the important stuff, like storage and persistence.

As a bonus, show how to use Gitlab for deploying and managing the app. Gitlab + Kubernetes could be the holy grail for modern, self-hosted development, however a good, complete tutorial/documentation is very hard to come by. One has to pick the pieces from a lot of different places with sometimes conflicting information.

I'd happily pay 100 Euros for such a course.

swampthinker7y ago

If it was easy, there would be dozens of courses out there already! Sounds like you've found a pain point you can solve.

I think it's ok not to know something to make a course for it. Even if only to learn it better. But I'd be careful given 2, pursuing this could lead to burnout.

There is an opportunity for anything infrastructure related.

meddlepal7y ago

The answer is always 3.

stunt7y ago· 7 in thread

Kubernetes solves a problem that most of the companies don't have. That is why I don't understand why the hype around it is so big.

For the majority, it just adds a little value when you compare to added complexity to infrastructure and the cost of a learning curve and the ongoing operation and maintenance.

013a7y ago

I disagree, almost entirely. Kubernetes solves problems that every single cloud-based software company has.

What's the alternative? We spin up VMs, templated with AMIs, provisioned with an ASG? That works fine. But we want centralized logging. We want graceful restarts. We want automated rollbacks. The list goes on. These are not Google scale desires, these are "cost of doing business" asks for any cloud company. You can start building all of this on that core architecture of AMIs, or your cloud provider's equivalent, but all you're going to do re-invent what Kubernetes does, probably worse.

Kubernetes' problem isn't that it solves problems most companies don't have. The problem is that these problems most companies have could be solved in a simpler way than Kubernetes, because most companies have the exact same problems.

threeseed7y ago

> Kubernetes solves a problem that most of the companies don't have

Actually most medium to large companies do have this problem.

There are often a lot of different languages, libraries, versions, deployment methods etc. And the appeal of Docker was that you can treat them all as block boxes. And the appeal of Kubernetes is that you have this rich support infrastructure to run them all hands-off at scale.

It definitely solves a problem. Just not particularly well.

jordanbeiber7y ago

In my experience most companies lack common conventions and automations.

Kubernetes "done right" is almost a part of your application. It becomes this "machine" that you throw stuff into and good stuff happens.

You'll need a team to integrate it into the pieces you require (auth, secrets, loadbalancers, permissions/app identities, monitoring and logging) but many places lack bits and pieces, and in my opinion k8s gives you a fast track to create a uniform application delivery platform.

What I don't like is that it kind of is the opposite of "the unix philosophy" and in that regard I prefer the hashicorp stack.

romeisendcoming7y ago

Those are your startups and web/app tier shops. Yes, they suck at sysadmin routinely and they need to be bottle fed a solution that fits the scatter/gather shape of their business. They don't want OPs discipline. They want a programmable solution that performs systems magic with a single toolset to learn.

itronitron7y ago

>> I don't understand why the hype around it is so big

Probably because it started at Google, if it was created by IBM then we'd only hear about it on TV ads.

koffiezet7y ago

Don't agree. My current client is rather small, but I just had a meeting which would have been concluded with "we'll have to set up 15 vm's before you can start developing and have 5 alignment meetings before they're correctly set up", versus "I just created a namespace for you guys to do whatever you want in, and mailed you the link to the docs for the CI/CD and deploy guidelines".

They run a self-hosted OpenShift cluster, which is managed internally by a team of 4. Not only makes this situation it a lot easier to spin up new environments, it also forces devs to include the ops team from the start for stuff they don't know, and corrections can be made early on.

manigandham7y ago

Because you can run your apps using simple YAML files with monitoring, logging, rolling updates, load balancing, failover, persistence built in?

awinter-py7y ago· 6 in thread

Beyond strictly runtime failures, 2018 feels like the year that most of my friends tried kube but not everybody stayed on.

The adoption failures are mostly networking issues specific to their cloud. Performance and box limits vary widely depending on cloud vendor and I still don't quite understand the performance penalty of the different overlay networks / adapters.

lykr0n7y ago

Network is a high performance system, and each layer you add adds latency.

Consider a traditional monolithic application. In comes your HTTP request in one end, a bunch of cross thread communication happens, and database queries come out the other end. With that, you have 2 points of network communication.

Now with a micro-service, you might have 4 or 5 applications that are needed to replace the above monolith. Throw in a service mesh on top of your cloud providers SDN, you've turned 2 points of network communication into 20 or more. The 5 micro-services talking to each other and the service meshes talking to each other. Add on top the additional processing overhead of maybe 1 to 2ms, you've just added at best 10ms round trip time to get to your databases and some more CPU. And to what benefit? TLS? You can do this in your application, or trust your private network is private. Tracing? You can do this with PID matching and watching the kernel's networking stack.

devereaux7y ago

So true. For some low latency applications, anything above the bare minimal virtualization is not acceptable.

For what I do, in theory, many things should not impact results. In practice, anything that upon measurement impact results is stripped away. Think A/B testing but for every single component - including the major version of say the python interpreter.

That's how you end up running many things baremetal.

I'll say the future is not serverless but cloudless

kronin7y ago

Trusting your private network is private turns that whole network into a candy store once a beach head in that network has been established.

Defense in depth exists for a reason.

hjacobsOP7y ago

> adoption failures are mostly networking issues specific to their cloud

Do you have any pointers/write-ups with more information or plans in this direction? I would be interested to learn more.

awinter-py7y ago

this article is about feature completeness of the different managed kubes

https://kubedex.com/google-gke-vs-microsoft-aks-vs-amazon-ek...

It's pretty easy to dig up speed tests of the overlay networks, but a lot of these are just rating userspace overlay networks. The new hotness is the plugins provided by the cloud vendor which integrate with their SDN, and I haven't seen a good benchmark for those yet.

Most interesting reading will be to look up managed kube networking plugins on github and look for open/closed issues with lots of stars.

ravedave57y ago

A team at my work has spent a stupid amount of time trying to nail down networking issues with hand rolled k8 in AWS. HAd to move away from using node ports to fix it. Total pain in the ass.

hjacobsOP7y ago· 3 in thread

Christian already followed the example and created a similar list for Serverless: https://github.com/cristim/serverless-failure-stories

gspetr7y ago

Is there also a list for Docker failure stories?

hjacobsOP7y ago

IMHO this would be less interesting, some people already run other container runtimes such as containerd with Kubernetes (e.g. Datadog: https://www.youtube.com/watch?v=2dsCwp_j0yQ) --- so Docker might stay as some user interface for local development, but I would not know what "Docker failures" would be in the future.

SlowRobotAhead7y ago

The third example on that list is a nice little short story. Simple mistake, gets right to the point.

I’m setting up a lambda test right now so I find it perfectly timed!

peterwwillis7y ago· 3 in thread

Dang. I wish I had my SRE Wiki up and running already, or I'd add a "public postmortems" section.

hjacobsOP7y ago

Looking forward to your public postmortems.. (either yours or whatever you find in the wild)

alien_7y ago

Just put it on Github like this and the Serverless one I also created after I saw this.

alien_7y ago

Just saw it already exists: https://github.com/danluu/post-mortems

nisa7y ago· 2 in thread

The k8s hype feels like the Hadoop hype from a few years ago. Both solve problems that most don't have and there is a lot of complexity - some due to the nature of the problem, some because everything is new and moving.

Of course it's 2019 and you have to migrate Hadoop to run on k8s now :)

My impression is that if you are a small shop and have the money, use k8s on google and be happy, but don't attempt to set it up for yourself.

If you only have a few dedicated boxes somewhere just use Docker Swarm and something like Portainer.

lugg7y ago

Docker swarm is really nice. I wish it had more traction. I fear it's going to be dropped and leave me holding a bag full of bugs.

BretFisher7y ago

Swarm isn’t going anywhere. It has a growing community and the team is activly working in the repos. See my updates: https://www.bretfisher.com/the-future-of-docker-swarm/

AaronFriel7y ago· 2 in thread

I just went through all of the post-mortems for my own company's purposes of evaluating Kubernetes. I've been running Kubernetes clusters for about a year and a half and have run into a few of these, but here's what I found striking:

* About half of the post-mortems involve issues with AWS load balancers (mostly ELB, one with ALB) * Two of the post-mortems involve running control plane components dependent on consensus on Amazon's `t2` series nodes

This was pretty surprising to me because I've never run Kubernetes on AWS. I've run it on Azure using acs-engine and more recently AKS since its release, and on Google Cloud Platform using GKE; and it's a good reminder not to to run critical code on T series instances because AWS can and will throttle or pause these instances.

hjacobsOP7y ago

Nice observation, I haven't done statistics on the linked postportems myself yet. Please note that your observation might also be due to the fact that AWS has a far larger market share and did not provide managed Kubernetes until recently (so people roll their own). We can therefore assume that any random sample of Kubernetes postmortems would be biased towards seeing more incidents with Kubernetes on AWS (compared to other cloud providers).

AaronFriel7y ago

That's a good point. In 2017 there weren't widely available managed Kubernetes deployments, and now each platform has their own and much more reliable integrations.

stonewhite7y ago

I managed multiple mesos+marathon clusters on production a little over 1.5 years, and when I switched over to the K8s the only thing that felt like an improvement was the kubectl cli.

I really liked/missed the beauty of simplicity in marathon that everything was a task, the load balancer, autoscaler, app servers everything. I think it failed because provisioning was not easy, lack of first-class integrations with cloud vendors and horrible horrible documentation.

Kind of sad to see it lost the hype battle, and since then even Mesosphere had to come up with a K8s offering.

dcomp7y ago

I run a single node cluster at home. In order to handle updates. I just wipe the cluster with kubeadm reset. Then kubeadm init; followed by running a simple bash script. which loops of files in nested subdirectories applying yaml configs. Only have to make sure I only ever edit the yaml files and not mess with kubectl edit etc.

for f in /.yaml ...

with a directory structure of:

  drwxrwsrwx+ 1 root 1002 176 Jan 20 21:15 .
  drwxrwsrwx+ 1 root 1002 194 Nov 17 20:06 ..
  drwxrwsrwx+ 1 root 1002  68 Jan 20 20:50 0-pod-network
  drwxrwsrwx+ 1 root 1002 104 Nov  1 11:18 1-cert-manager
  drwxrwsrwx+ 1 root 1002  34 Jul 11  2018 2-ingress
  -rwxrwxrwx+ 1 root 1002  93 Jan 20 21:15 apply-config.sh
  drwxrwsrwx+ 1 root 1002  22 Jul 14  2018 cockpit
  drwxrwsrwx+ 1 root 1002  36 Jul  3  2018 samba
  drwxrwsrwx+ 1 root 1002  76 Jul  6  2018 staticfiles

hjacobsOP7y ago

There is now a Kubernetes podcast episode with me about the topic: https://kubernetespodcast.com/episode/038-kubernetes-failure...

j / k navigate · click thread line to collapse