Our nightmare on Amazon ECS (opens in new tab)

(appuri.com)

270 pointsmaslam9y ago127 comments

127 comments

98 comments · 19 top-level

cddotdotslash9y ago· 32 in thread

What I don't understand is why AWS squandered this opportunity. Given the popularity of Lambda, they clearly saw the market for completely managed services. They could have designed a platform where users upload containers and AWS runs them. No servers, no crazy settings, etc. Instead, they created this entire platform where you still have to run the entire EC2 infrastructure, there is no service discovery, etc. They essentially created a half baked Mesos or Kubernetes clone. I'm still shocked when I hear companies going "all in" on ECS.

derefr9y ago

I think you misunderstand the purpose of AWS (as do most people.) AWS isn't for greenfield projects (or, at least, that's not where they make the majority of their revenue); AWS is literal "virtual infrastructure", in the sense that it's the stuff you hand an ops team to manage. You can take an experienced internal ops team that was managing physical infrastructure for some BigCorp (servers, switches, etc.), and migrate them to AWS instead, and there will be a 1:1 mapping between their problems and AWS's offerings.

ECS isn't for people who wanted 'managed infrastructure' in the "so we don't have to have a dedicated ops-team at all" sense. ECS is for ops-teams who were previously managing e.g. an OpenStack Nova cluster with an LXC-driver backend, and now want the virtual equivalent of that. Just like EC2 is built for ops-teams who were using VSphere or Xen.

dmourati9y ago

I disagree with your assessment. I say this as an ops person, actively migrating workload from on premise physical infrastructure to the public cloud.

EC2 is not built for ops teams at all. It is built for developers. The VSphere/Xen analogy doesn't hold up. Amazon is explicitly abstracting away the physical and software stack that ops teams would build and mantain onsite. AWS gives you back an API and a reference client. Nothing to install, no physical hosts to buy, no network switches to configure. Simply focus on your compute requirements and go from there.

derefr9y ago

Maybe "ops" is an imprecise designation for the organizational role AWS is intended to target, but "developer" is definitely incorrect.

My position is that AWS is designed for a particular usage pattern: one where a large enterprise organization has a particular department that consumes, configures, and manages AWS's virtual infrastructure (just like an ops team consumes, configures, and manages physical infrastructure), and then, in turn, provide an internal IT-services abstraction to the rest of their organization, using AWS as the "backend" for (some or all of) those services.

None of AWS's services are created from the perspective of use by application developers; they don't present application-level interfaces, nor do they expect (unlike PaaS services) that the application can be rewritten to conform to the shape of the infrastructure. The application is taken as a given. IaaS APIs are created such that an internal IT services department can receive a request from an application developer to provision a cluster for their (fixed) application, and can respond to that request by hitting AWS APIs, instead of by paging ops staff.

In smaller organizations, these lines blur under "devops" sensibilities, where the developers effectively do their own IT services. But in organizations where this boundary is clear, AWS lives inside it, not anywhere where developers can see or touch it. AWS's 'idiomatic' use is to be a transparent drop-in backend replacement that the application developers in the company won't even be aware that ops has switched over to—except to notice that there are now likely fewer ops people overall.

1 more reply

jamiesonbecker9y ago

> ... there will be a 1:1 mapping between their problems and AWS's offerings

I don't know if you meant it that way, but I laughed out loud when I read that!

1 more reply

010a9y ago

I disagree completely. How do you reconcile products like Lambda, Dynamo, Elastic Beanstalk, IoT, Lumberyard... They have a laundry list of products which do meet your criteria of being designed for devops people to bring existing infra in, but they have plenty which clearly have app level components that devs would need to interface with. AWS has something for everyone.

dozzie9y ago

> You can take an experienced internal ops team that was managing physical infrastructure for some BigCorp (servers, switches, etc.), and migrate them to AWS instead, and there will be a 1:1 mapping between their problems and AWS's offerings.

Well, you should have said: "you take an experienced internal ops team that was managing physical infrastructure and software running on it, migrate them to AWS, and then still need an experienced ops team, just without managing the hardware". It's not that you migrate to AWS and magically don't need to manage software (or that your programmers magically learn how to do that).

skywhopper9y ago

Two things, I think. One, AWS is putting a huge bet on getting huge Enterprise customers to migrate their datacenters to the cloud. So a lot of the new services in the last couple of years have been centered around making it easier to transition big Fortune 500 companies' infrastructure before MS or Google acquires those companies and the tens of billions of future ongoing guaranteed revenue they represent.

But number two, I get the impression that AWS has some severe underlying technical debt that comes from original architecture decisions from 10 years ago, and that they are pouring a lot of current resources into addressing those problems, and getting customers off the old crusty stuff. For example, there's a major transition coming to make EC2 IDs much longer so they can remain universally unique, but that's a painful transition. Then there's weird stuff like the "classic" non-VPC accounts where security groups are identified by name. And all the services that don't have tags, and the fact that it's easier to operate multiple AWS accounts than to partition permissions within one account. But they are working on fixes for all these things, but a lot of them are buried so deep in the architecture that it's taking some time. But in a couple of years when some of the new things on the horizon come to fruition, I think we'll see more aggressive iteration on managed serverless infrastructure options.

maslamOP9y ago

There are lots of cases where AWS builds services that are not revolutionary but purely evolutionary. Consider Amazon Redshift, their data warehouse. You would _think_ Redshift, with its poor elasticity and tight coupling of compute and storage would suck compared to BigQuery, which is a truly new way of doing things. However, the combination of "it's just Postgres(-ish)" and rapid feature iteration make it a pretty good system to use. Most importantly, the cost model is very good. TL;DR it really depends on the service.

yeukhon9y ago

I actually find it interesting you would Redshift as the example to justify evolutionary of AWS service offering, also plus the cost. The cost of running Redshift is very expensive, even with reserved instance.

maslamOP9y ago

@yeukhon - for the performance you get, Redshift is quite affordable. Keep in mind it's an always-on data warehouse - there's really no way to make it much cheaper than AWS does because they're incurring compute costs. What are you comparing it to? Hadoop?

1 more reply

boulos9y ago

Amusingly RedShift was licensed:

> It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian.

so I think it's hard to draw conclusions from it.

[0] quote is from the Wikipedia entry (https://en.m.wikipedia.org/wiki/Amazon_Redshift)

madeofpalk9y ago

Have you seen ElasticBeanstalk's ECS setup? We've went 'all in' on that and it's been fairly successful for us so far. We use a combination of long-lived servers (spin up one stack which lasts for multiple versions/deployments) for our test environments, and then we do blue-green deployments for production which gives it a complete new stack for every deploy.

We've been running a bunch of services on it for a months and its been fine.

idunno2469y ago

we ran fast from beanstalk to pure ecs. Security groups, rds, iam policies, if we were doing that in terraform anyway then launching an asg/elb too is pretty trivial. And problems like, for a failed deploy they were just shrug we dont know what version is deployed. And when there were failures, the beanstalk interface doesnt really tell you why, and you end up digging through random log files to find the right one. Was it a beanstalk error? ECS error? Having another layer that could go wrong just didnt make sense to us.

sheeshkebab9y ago

Second on beanstalk...

I'm no sure why anyone would even bother with raw esc - just let aws manage it all using beanstalk. It certainly still relies on elb/dns and whichever env vars for "discovery" stuff, but it beats rolling your own infrastructure management using ecs/cloudformation etc.

jon-wood9y ago

Beanstalk is great when you're at a scale that can justify one or more instances per application, but raw ECS works better when you've got lots of low traffic services that don't saturate even a single small instance.

1 more reply

santoriv9y ago

+1 for ElasticBeanstalk Docker deployments. I looked at ECS first about a year ago and ran into a lot of the same messes the OP did. ElasticBeanstalk was way more straightforward and the CLI is pretty decent.

One thing we did run into was that very occasionally environmental variables wouldn't show up inside deployed containers so we had to bake all the application secrets into the build process. A little annoying but nothing too terrible. Otherwise everything has been pretty great.

maslamOP9y ago

@madeofpalk, I haven't seen that, actually. I'll look into it.

org789y ago

We had that conversation with them too, and they just show you how every feature you'd get in kubernetes ECS can also do in a round about way or will do once another AWS service is updated. But for some reason are cagy when it comes to solving for "just run this artifact somewhere for me". It's as though they see their model as being a competitive advantage over kubernetes. shrug

codemac9y ago

A good friend told me that he felt that Google Cloud and AWS had "a severe lack of imagination".

Imagine what you could do if you didn't even assume a process model? All app state just resident in memory, but magically persisted? Who needs object storage, re-invent the pointer!

We could have lived in the future, now it seems we're permanently wed to the past.

nikanj9y ago

I heartily recommend this essay http://scholar.harvard.edu/files/mickens/files/thenightwatch... , for gems like "Pointers are real. They’re what the hardware understands. Somebody has to deal with them. You can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus by osmosis."

codemac9y ago

I love James Mickens with all my heart.

All of it.

gtaylor9y ago

> A good friend told me that he felt that Google Cloud and AWS had "a severe lack of imagination".

I don't know about that. Google Container Engine (hosted Kubernetes) is actually pretty awesome and imaginative. It's feeling like GCP's niche is going to have a sizable containerization element. If you browse around their docs, you'll find that GKE/containers have started creeping into the examples for seemingly unrelated services. They're not just dipping a toe in.

More generally, I feel like GCP's container strategy is just leagues ahead of AWS' at this point. While this article was thin on substance, ECS is definitely difficult to set up and maintain. If I'm going to go through all of that trouble, I might as well run my own Kubernetes or Mesos setup and not be locked into ECS.

_asummers9y ago

I know you can deploy Kubernetes on AWS, though I have not tried myself. What, if you have tried it, is it lacking from the GCP version?

3 more replies

count9y ago

We're starting to get to the point where these giants can innovate like that.

It wasn't 3 years ago that 'nobody serious' was 'trusting' cloud providers like AWS/GCE with anything important. This is still the very early days, as evidenced by the ridiculous growth numbers being posted YoY.

jacques_chester9y ago

> Imagine what you could do if you didn't even assume a process model?

We have that world, it's called single-process apps. And it's awful from the point of view of security, scalability and disaster recovery.

> All app state just resident in memory, but magically persisted?

You need transactions or this ends unhappily. Some languages truly grok transactional updates to state. Most do not. In the meantime, you've rate limited the entire system to the slowest component.

> We could have lived in the future, now it seems we're permanently wed to the past.

Your friend overlooks that extremely intelligent people have looked into these things. They usually had extremely important disadvantages, which sufficiently offset the advantages that mere momentum kept the majority approaches as the majority approaches.

Your friend also seems to have left it as an exercise for the reader on how one is meant to deal with distributed systems. The short answer is: it is hard, and trying to create the seamless illusion that the network doesn't exist hasn't really panned out.

The speed of light is a cruel limit.

mschuster919y ago

> Imagine what you could do if you didn't even assume a process model? All app state just resident in memory, but magically persisted? Who needs object storage, re-invent the pointer!

Take your usual Java, NodeJS or Ruby payload, enjoy your memory leaks eating up your space.

moosingin3space9y ago

ECR might be the only good thing about ECS, but even that is still clunky!

gtaylor9y ago

Google's GCE isn't much better. CoreOS's Quay is the best that I've seen. Nice UI, and a head start over Docker Hub's image security scanning.

lobster_johnson9y ago

Do you know if Quay (or anyone else) solves the compilation issue?

The issue is that if you compile anything in your Dockerfile, you end up installing the compiler as well as producing unnecessary build artifacts, which will still remains as a layer that must be downloaded even if you uninstall the compiler and clean up after yourself. In other words, a bunch of unnecessary cruft. This applies not just to compiled languages, but to any language (Node.js, Ruby) that relies on a build phase as part of getting dependencies.

The proper fix is to perform the compilation outside of the main container (for example, by starting a throwaway build container that you only use for compilation) and then copy the final artifacts into the final container. But I don't know of any hosted solutions that support that workflow.

4 more replies

gtaylor9y ago

Whoops, I'd edit this but apparently my mobile app doesn't support such a thing. GCE in the parent should be GCR (Google Container Registry).

SEJeff9y ago

Have you by chance seen Redhat's Openshift? It is some nice features built with Kunernetes as the core.

1 more reply

010a9y ago

And ECR is pretty much just hosted Docker Registry. Try using it versus something like GCR or just Dockerhub and even it starts feeling antiquated.

tjholowaychuk9y ago· 11 in thread

I do think they need to put more effort on CLIs etc, instead of relying on OSS to fulfill this niche, or at very least put more effort into supporting OSS.

Lambda is similar, we have 'Serverless' and I'm hacking on Apex (https://github.com/apex/apex) just to make it usable. I get that they want to create building blocks, but at the end of the day consumers just want something that works, you can still have building blocks AND provide real workable solutions.

I was part of the team migrating Segment's infra to ECS, and for us at least it went pretty well, some issues with agents disconnecting etc I sort of wrote off since ECS was so new at the time.

Another annoying thing not mentioned in the article is that the default AMI used for ECS is not at all production ready, you really have to bake your own images if you want something usable. I suppose this is maybe because there's subjectively no "good" defaults, I'm not sure, but it's a bit of a pain.

ELB for service discovery is fine if you can afford it, I had no issues with that, ELB + DNS keeps things very simple. I'm not a huge fan of all these complex discovery mechanisms, in most cases I think they're completely unnecessary unless you're just looking to complicate your infrastructure.

I also think in many cases not propagating global config (env) changes, is a good thing, depending on your taste. Scoping to the container gives you nice isolation and and more flexibility if you need to redirect a few to a new database for example. You don't have to ask your-self "shit, which containers use this?", it's much like using flags in the CLI, if we _all_ used environment variables in place of every flag it would be a complete mess.

EDIT: I forgot to mention that the ELB zero-downtime stuff was awesome, if you try and re-invent that with haproxy etc, then... that's unfortunate haha. No one should have to implement such a critical thing.

rdtsc9y ago

> instead of relying on OSS to fulfill this niche, or at very least put more effort into supporting OSS.

From whath I hear from people working there, OSS is king but there is also little contribution back to OSS so fits with what you mentioned.

(But I only know about a few AWS services, maybe it is different for others).

nathanboktae9y ago

> I also think in many cases not propagating global config (env) changes. ... You don't have to ask your-self "shit, which containers use this?",

I agree and we (dev lead at Appuri here) achieve the best of both worlds from Kube by in the secrets section of a deployment definition, specifying what secrets we need, but not the value. So we know what services need it, and it's updated in one place. That's just for the secret store though, but we could put non-secrets in secret to use that mechanism.

cpitman9y ago

Have you looked at ConfigMaps? They're a newer feature of Kubernetes that is meant for storing non-secrets, but in general works pretty similarly to how secrets work (create config map, mount in container).

velkyk9y ago

Sure, we use configMaps, secrets, mounting EBS, you name it. Implementing k8s felt like Jack in Titanic getting into first class :). Nice to know they won't lock you up when the boat is going to sink :).

configMaps is nice, but we use it in limited way because its so much easier to update pods when editing env vars. Note: we are using deployments, so if you need to change env var, you do `kubectl edit deployment <name>`, edit/save/close file that opened in your $EDITOR and watch the magic to happen.

nathanboktae9y ago

Oh I'll have to check it out, thanks!

maslamOP9y ago

>Another annoying thing not mentioned in the article is that the default AMI used for ECS is not at all production ready, you really have to bake your own images if you want something usable. I suppose this is maybe because there's subjectively no "good" defaults, I'm not sure, but it's a bit of a pain.

We ran into this as well - I forgot to add this to the post. The Amazon Linux AMI for ECS has _very specific defaults_ that need tweaking.

velocitypsycho9y ago

Could you elaborate more on the issues with the default AMI?

shinzui9y ago

Could you elaborate on the problem with the default AMI?

tjholowaychuk9y ago

Typical stuff like fd limits and network configuration etc. With a light load it would be fine, It's just a shame that you can't boot up an ECS and know it'll scale with you out of the box.

That said this does fit the rest of their services involving EC2, so I guess it's not much different there, but as a consumer I just want the thing to work.

scrollaway9y ago

Apex looks great, can you talk a little more about it?

Lambda doesn't currently support Python 3 (only Python 2.7) and that has been a massive pain in the arse to deal with. I've heard it's possible to get Python 3 working on it by shipping a custom executable and serializing/deserializing state but I figure it's a fairly significant performance hit.

xrjn9y ago

How would one ship this custom executable with python 3? I have played around with aws lambda and zappa, this was a major frustration for me.

maslamOP9y ago· 8 in thread

HN, I'm a co-founder at Appuri. Happy to answer questions! PS: We LOVE most AWS services like Amazon Redshift. Just not ECS ;)

robbles9y ago

Did you run ECS on a custom AMI, or use the stock one?

We've been running with vanilla EB + ECS for months and haven't seen this at all.

From an outside perspective, it sounds like the primary issue you referenced here (the agent disconnecting) could have been due to a mismatch in configuration between the agent and docker, or maybe just a permissions issue. IIRC, the ECS agent tries to clean up containers every few hours so perhaps not being configured correctly caused it to get stuck?

tmacie9y ago

We used the stock AMI. I would not be surprised if we had a configuration issue, but we spent a lot of time trying to debug it and were never able to find the root cause of the issue.

1 more reply

dstroot9y ago

Did you deploy K8S on AWS? If so can you add any details about how? Or are you using K8S elsewhere? I love AWS but planning on spinning up on GCE this weekend to play with K8S.

velkyk9y ago

I am ops at Appuri. We deployed k8s on AWS since we are using other services like Redshift and RDS inside the VPC, also happy with how EC2 works plus of course we have Reserved instances so we didn't look into GCP yet. We're running kube on CentOS 7, we bootstrap nodes using cloud-init (user-data) to setup k8s, which we then use to run everything else. I would love to give you more details, I might write blog post about our kube setup decisions later. Kelsey wrote nice manual for setting up k8s - https://github.com/kelseyhightower/kubernetes-the-hard-way which is definitely on my to-read list this weekend :)

1 more reply

lobster_johnson9y ago

The kube-up.sh script can start a complete Kubernetes cluster on AWS in one fell swoop. It's pretty smooth.

tmacie9y ago

We deployed K8S on AWS (I'm a dev at Appuri). Like Bilal mentioned we run pretty much everything on AWS, so it was an easy decision.

ntumlin9y ago

Off topic from the article but just wanted to let you know that I love the design of your blog.

maslamOP9y ago

Thank you!

cyberferret9y ago· 6 in thread

Excuse me while I pick myself up off the floor when I read "leaks environment variables"... What?? That is incredibly scary for use, as we just went through an audit process of our code on about 6 different web apps to ensure that all secrets were placed in environment variables on our Elastic Beanstalk configs and not in the main codebase... If this now results in LESS security (all our code is in private Git repositories) than before, then we have essentially taken a step backwards!

dorfsmay9y ago

Github private repos have been made public by mistake before. Got repos are cloned on dev laptops, do you enforce laptop encryption?

The right thing to do is using some form of a vault.

cyberferret9y ago

We use BitBucket here, rather than Github - similar risks, I know, but we have predetermined repositories which are all set as private. 3 dev machines which are kept on premises at all times.

Still not optimal as far as security goes, but it seems that he have roughly the same exposure if AWS leaks our keys and passwords to other third party trackers...

fletchowns9y ago

Be careful when modifying user access to a private BitBucket repository. Their autosuggest for the username input field will show all bitbucket users. Makes it incredibly easy to accidentally grant somebody outside of your organization access to a repository.

1 more reply

Ixiaus9y ago

Use kms and dynamodb with key enveloping, or this tool: https://github.com/fugue/credstash

Don't initialize into env vars and don't store in repos, even private ones.

1 more reply

fapjacks9y ago

This is really the best suggestion. I've been using Hashicorp's Vault for storing sensitive configuration information and can't recommend it enough. The rest of Hashicorp's stack of software for Docker is also extremely nice.

StreamBright9y ago

The best audit proof solution to this is using an encrypted database where you can securely push down secrets that you share only with certain users on particular nodes. This is what we used earlier with a bigger company to pass all audits. The ability of reading secrets are tied to a unix user and a fleet (identified by IP ranges for example). The secrets are encrypted and the nodes have the key do decrypt it. Keys are distributed securely at provisioning time. I am not aware any opensource platform that would do this to you out of the box though.

dperfect9y ago· 5 in thread

> ECS doesn't have a way to pass configuration to services

I believe this is the recommended way:

ECS container instances automatically get assigned an IAM role[1], with credentials accessible via instance metadata (169.254.169.254) [2]. Containers can access that metadata too. The AWS SDK automatically checks that metadata and configures itself with those credentials, so all you have to do is give your IAM role access to a private S3 bucket with configuration data and load that configuration when booting up your app.

That way there's no need to copy/paste variables, and no leaking secrets in ENV variables. You do have to be careful though (as with any EC2 instance) not to allow outside access to that instance metadata endpoint, e.g., in a service that proxies requests to user-defined hosts on the network (but if you're doing that, you've got a lot more to worry about anyway).

[1] http://docs.aws.amazon.com/AmazonECS/latest/developerguide/i...

[2] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles...

embiggen9y ago

One reason I am hesitant to go this route is because I don't want to hard-code Amazon's API's into my apps..

dperfect9y ago

I understand the reluctance to add extra dependencies (especially environment-specific ones), but in the case of a typical Ruby app, it amounts to the 'aws-sdk' gem and 1 or 2 lines in an initializer.

For my own purposes, I weighed that against the alternatives[1], and it seems like a fairly reasonable compromise[2]. That won't be the case for everyone, obviously.

[1] http://elasticcompute.io/2016/01/21/runtime-secrets-with-doc...

[2] I'm referring specifically to passing secrets (or other static values) into a container, since that seems to be what the author was talking about. For configuration requiring more complexity, of course other tools are probably more appropriate. In that case, it's outside the scope of what I would reasonably expect ECS to do.

dozzie9y ago

You need to look into sysadmin's toolbelt, then. Sysadmins use configuration management systems (CFEngine, for example) for quite long time now. The only thing you need is to put a post-create script (however it is called in AWS), which would install and deploy such system and let the system configure the machine.

I know it's not sexy for developers to take advice from sysadmins, but at the end of the day, it gets the job done reliably and elegantly.

moca9y ago

Have you considered to use a centralized configuration storage (such as S3 and anything else) with access control and audit trail? That is easier to update configs without restarting all the servers.

velkyk9y ago

this was a no go for us, since most of our apps are minimal golang images. IMO it is just good example of bad design :)

hosh9y ago· 4 in thread

I put something into production with ECS as well. I ran into the same missing components too -- lack of service discovery, and such. Kubernetes work a lot better. As it stands right now, I wouldn't take a gig involving putting ECS into production.

Now if ECS 2.0 was really AWS hosted Kubernetes, I would be very interested in hearing about that...

tantalic9y ago

Google Container Engine (GKE) is certainly the easiest way to setup a Kubernetes cluster. We have been running it for a couple of months now and couldn't be happier. If you're wanting to stick with AWS I have always heard great things about the work CoreOS has been doing in this space: https://github.com/coreos/coreos-kubernetes.

alex-mohr9y ago

It's great to hear GKE is meeting your needs so well! (Yes, I work on it.)

For 1.4, the Kubernetes Cluster Lifecycle and Ops SIGs are working on making the install and setup process much easier, including on AWS [1]. That won't magically turn it into Kubernetes as a Service, of course, but we hope it'll help users on other platforms.

[1]: https://github.com/kubernetes/community/blob/master/sig-clus...

hosh9y ago

That is what I keep hearing. With PetSets rolled out in 1.3, GKE is getting more competitive. At my current job (startup), we're probably going to move towards that.

moondev9y ago

That's exactly what GKE is on GCP and I love it.

huslage9y ago· 4 in thread

Environment Variables are NEVER private. Please don't think that you can hide information in there as all of that information ends up in the process table which is public across the entire machine.

zeroxfe9y ago

Wait, what? How do environment variables end up in the public part of process table? There's no way one user can peek into the environment of another (without permissions, notwithstanding bugs) -- that's part of the design of Unix systems.

nathanboktae9y ago

And the machine (vm) runs in a private VPC. So it's private.

nnutter9y ago

How are they "public" across the entire machine?

phil219y ago

I suppose it depends on your definition of public.

Any environment for a process will be accessible via /proc/<pid>/environ on a linux system. Of course other users cannot read these files, however in the case of something like a Docker image all processes likely share a username and this could be a risk (especially for a public webapp that one day may information leak/allow remote command execution).

At least that's my immediate take on it.

graffitici9y ago· 4 in thread

Anybody has insights about using Docker Swarm? I imagine Kubernetes has been battle-tested way more in production, especially by the likes of Google. But from what I understand, Docker is really pushing swarm. I'd be curious to hear if others even considered Swarm before choosing K8s..

lobster_johnson9y ago

There's not really any comparison. Docker is clearly beefing up Docker/Swarm to be more like Kubernetes, but in its current state, Swarm is just a glorified Docker Compose.

For example, it does not handle services (K8s can automatically provision a load balancer against all your containers), there's no volume handling, no centralized logging, no label-based targeting, it has very limited scheduling (K8s uses cAdvisor to help scheduling, can automatically ensure that pods are spread out across multiple AZs, etc.), etc.

It'll be interesting to see what happens as Docker starts pushing into Kubernetes' space. Given the multiple points of overlap/contention between K8s and Docker (you have to disable Docker's built-in networking and iptables management; Kubelet has to continually monitor Docker for orphaned containers and volumes and so on; etc.) I wouldn't be surprised if Google one day decides to eliminate the Docker daemon as a dependency entirely, by writing a bare-bones container engine into Kubelet.

nakagi9y ago

Really? I also think Docker Swarm Mode is still behind of K8S, but as far as I read the doc, they support - load balancing between container - volume handling https://docs.docker.com/engine/tutorials/dockervolumes/ - label-based constraint

I know some features are not so sophisticated compared with K8S and there is no AZ awareness, but Swarm may try to catch up with it.

lobster_johnson9y ago

I recommend looking into the Kubernetes design to understand how different its design is.

A good example is volume management. With Kubernetes, you can tell a pod to use an AWS EBS volume; when the pod needs the volume, Kubernetes will automagically mount it, and handle the statement management for you.

If you define what's called a persistent volume, your pod can declare that it needs, say, 1GB, and Kubernetes will automatically allocate 1GB from the volume; you can have lots of pods working off this shared volume, and Kubernetes will know which pods have "claimed" which parts of the volume.

Another good example is config and secrets. In Kubernetes, you declaratively create configuration objects ("configmaps") and secrets. If a pod needs, say, access to an external API, you can store the keys in a secret and declaratively give the pod access to the secret, which will be mounted into a folder (or, alternatively, assigned to an environment variable, though that's not as secure).

Yet another example is service management. You can tag a service (which is another type declaration that says "port X on some unique cluster IP should be routed to every pod tagged with these labels") as load-balanced, and if you're running in a cloud environment (AWS, GCE, etc.), K8s can automatically create an external load balancer for you that exposes the service publicly.

Kubernetes is best described as a sophisticated state machine that takes declarative objects ("manifests") that describe your world — i.e. which containers should be running, which services should be exposed, etc. — and then attempts to continuously reconcile reality with that declaration, managing all sorts of state in the process.

Perhaps most important is the ability to abstract resources from pods. A pod just declares the image to run and the resources — volumes, configs, secrets, CPU/memory constraints, etc. — to make available to it. K8s's state machinery takes care of the rest.

As far as I know, Docker Swarm has none of this, and you'd have to build these things (e.g. REXRay for volumes) on top of Swarm yourself.

1 more reply

smarterclayton9y ago

Not really Google driving this, but there is active work on integrating the OCI runtime (the "standard" evolution of libcontainer from Docker, used in Docker 1.12) as a container runtime to Kube. The desire is to reduce some of the overlap between container daemons on the nodes, but also support a wider array of container runtime setups (being able to run VMs via hyper, rkt containers, OCI containers, Docker containers, etc). Each of those mentioned technologies is being sponsored by different parts of the Kubernetes community, but the goal is to have more power and flexibility at runtime. Docker will continue to be a primary part of the story.

nzoschke9y ago· 1 in thread

Thanks for the shoutout to Convox! I'm on the core team.

I understand these challenges. I wrote about a lot of them here:

https://convox.com/blog/ecs-challenges/

But we have been having tons of success on ECS both for our own stuff and for hundreds of users.

I see the agent disconnection problem too. convox automatically marks those as unhealthy and the ASG replaces them.

It's happening more than I'd like but I'm seeing little to no service disruption. One of the root causes is the docker daemon hanging.

Glad Kubernetes is working well for you. Many roads lead to success as the cloud matures.

maslamOP9y ago

That's a great blog post. Thanks for sharing!

rjurney9y ago· 1 in thread

Running DCOS (data center operating system) on AWS is a snap, and solves all these problems. It makes running docker images a no-brainer compared to all other solutions, and this includes docker images that interact with one another (not just 100 apache servers or whatever). It is the best software I have ever used, hands down. It is the second coming of zeus buddha jesus belly. It makes scaling anything in the cloud easy. No, I do not work there. No, I am not exaggerating. Yes, I spent a month fucking with swarm and service discovery before deploying a large cluster of my service in two days on DCOS.

Docker is stuck in the 'one image on one machine' mindset. DCOS is taking over at the higher levels of the stack. Mark my word.

https://dcos.io/

maslamOP9y ago

@rjurney - we started with ECS right around when DCOS was coming out of alpha (?). Anyway, it looks slick!

advisedwang9y ago· 1 in thread

[off topic] Author, if you are reading this be aware that when viewed in a narrow browser window the sharing icons overlap the text, even though 40% of the screen is taken up by the right hand sidebar/empty space.

maslamOP9y ago

Thanks @advisedwang. We're looking into it.

justicezyx9y ago· 1 in thread

“No central config. ECS doesn't have a way to pass configuration to services (i.e. Docker containers) other than with environment variables. Great, how do I pass the same environment variable to every service?”

Would packaging the configurations together with the docker image makes more sense? That enables more hermetic deployment.

velkyk9y ago

Do you mean hard coding configs to docker image? I wouldn't support this, IMO this is worst case scenario setup :)

Imagine you need to change single config value, for this you would need to update image, push, build, redeploy, this can take some time depending on your deployment.

With k8s you do only `kubectl edit configmaps <name>`, restart pods that are using it and you are done.

Also no need to creating per stage images...

SteveWatson9y ago· 1 in thread

Article text is obscured by icons.

maslamOP9y ago

@SteveWatson - thanks for reporting, should be fixed now.

cxmcc9y ago

Our experience with ECS (at instacart) is not the best but we managed to get it work.

Here is how we get around the issues mentioned in the article:

* Service discovery: built our own with rabbitmq (we use that before ECS anyway).

* Configs: pass a s3 tarball url as environment variable, download it in containers.

* Cli: built our own with help of cloudformation

* Agent disconnecting: we did not see situation where all agents disconnected. we use a large pool of instances, there was never an issue to start containers because of agents.

In addition to these, we also do the following to make ECS work as we want it to:

* built our blue-green deploy solution (structure provided by ECS is very limited)

* built our own solution to integrate with ELB (ELB allows only one port per ELB)

jbaviat9y ago

We have been running Sqreen production on ECS since October 2015, and we have been pretty happy about it the whole time. Of course ECS was very minimal at the beginning, then many stuff improved, allowing for easier deploy, easier logging, and finally easier auto-scaling. When the ECR (AWS managed registry) was added to our region, it was quite a party @Sqreen :) I would see no point leaving it for something else today.

A remaining issue is that you cannot spawn two containers speaking to a given ELB (AWS load balancer) on the same host if they need to bind the same port.

Ixiaus9y ago

Probably want to use a secret management tool and just not initialize into environment variables...

https://github.com/fugue/credstash

x0rg9y ago

I hope the future of cloud will really be managed OSS as service. Google is doing a great job with Kubernetes and GKE and I hope the other providers will understand that. Microsoft is on the right way with DCOS as service, Amazon is just not there yet.

siliconc0w9y ago

We evaluated ECS and Beanstalk but ended up writing a tool around building CoreOS/Fleet clusters (not currently opensource but I'm trying).

We ran into similar complaints. CoreOS comes with Etcd which though initially unstable is now solid and incredibly handy for service discovery and configuration. We're using https://github.com/MonsantoCo/etcd-aws-cluster to configure it dynamically. We use etcd+confd to drive nginx containers for routing. All in all it works well. Our biggest problems are docker bug related and those we can generally handle by just terminating the node and letting autoscale heal the cluster.

pbkhrv9y ago

We switched from ECS to Docker Cloud and never looked back.

j / k navigate · click thread line to collapse

127 comments

98 comments · 19 top-level

cddotdotslash9y ago· 32 in thread

derefr9y ago

dmourati9y ago

I disagree with your assessment. I say this as an ops person, actively migrating workload from on premise physical infrastructure to the public cloud.

derefr9y ago

Maybe "ops" is an imprecise designation for the organizational role AWS is intended to target, but "developer" is definitely incorrect.

1 more reply

jamiesonbecker9y ago

> ... there will be a 1:1 mapping between their problems and AWS's offerings

I don't know if you meant it that way, but I laughed out loud when I read that!

1 more reply

010a9y ago

dozzie9y ago

skywhopper9y ago

maslamOP9y ago

yeukhon9y ago

maslamOP9y ago

1 more reply

boulos9y ago

Amusingly RedShift was licensed:

> It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian.

so I think it's hard to draw conclusions from it.

[0] quote is from the Wikipedia entry (https://en.m.wikipedia.org/wiki/Amazon_Redshift)

madeofpalk9y ago

We've been running a bunch of services on it for a months and its been fine.

idunno2469y ago

sheeshkebab9y ago

Second on beanstalk...

jon-wood9y ago

1 more reply

santoriv9y ago

maslamOP9y ago

@madeofpalk, I haven't seen that, actually. I'll look into it.

org789y ago

codemac9y ago

A good friend told me that he felt that Google Cloud and AWS had "a severe lack of imagination".

Imagine what you could do if you didn't even assume a process model? All app state just resident in memory, but magically persisted? Who needs object storage, re-invent the pointer!

We could have lived in the future, now it seems we're permanently wed to the past.

nikanj9y ago

codemac9y ago

I love James Mickens with all my heart.

All of it.

gtaylor9y ago

> A good friend told me that he felt that Google Cloud and AWS had "a severe lack of imagination".

_asummers9y ago

I know you can deploy Kubernetes on AWS, though I have not tried myself. What, if you have tried it, is it lacking from the GCP version?

3 more replies

count9y ago

We're starting to get to the point where these giants can innovate like that.

jacques_chester9y ago

> Imagine what you could do if you didn't even assume a process model?

We have that world, it's called single-process apps. And it's awful from the point of view of security, scalability and disaster recovery.

> All app state just resident in memory, but magically persisted?

You need transactions or this ends unhappily. Some languages truly grok transactional updates to state. Most do not. In the meantime, you've rate limited the entire system to the slowest component.

> We could have lived in the future, now it seems we're permanently wed to the past.

The speed of light is a cruel limit.

mschuster919y ago

> Imagine what you could do if you didn't even assume a process model? All app state just resident in memory, but magically persisted? Who needs object storage, re-invent the pointer!

Take your usual Java, NodeJS or Ruby payload, enjoy your memory leaks eating up your space.

moosingin3space9y ago

ECR might be the only good thing about ECS, but even that is still clunky!

gtaylor9y ago

Google's GCE isn't much better. CoreOS's Quay is the best that I've seen. Nice UI, and a head start over Docker Hub's image security scanning.

lobster_johnson9y ago

Do you know if Quay (or anyone else) solves the compilation issue?

4 more replies

gtaylor9y ago

Whoops, I'd edit this but apparently my mobile app doesn't support such a thing. GCE in the parent should be GCR (Google Container Registry).

SEJeff9y ago

Have you by chance seen Redhat's Openshift? It is some nice features built with Kunernetes as the core.

1 more reply

010a9y ago

And ECR is pretty much just hosted Docker Registry. Try using it versus something like GCR or just Dockerhub and even it starts feeling antiquated.

tjholowaychuk9y ago· 11 in thread

I do think they need to put more effort on CLIs etc, instead of relying on OSS to fulfill this niche, or at very least put more effort into supporting OSS.

I was part of the team migrating Segment's infra to ECS, and for us at least it went pretty well, some issues with agents disconnecting etc I sort of wrote off since ECS was so new at the time.

rdtsc9y ago

> instead of relying on OSS to fulfill this niche, or at very least put more effort into supporting OSS.

From whath I hear from people working there, OSS is king but there is also little contribution back to OSS so fits with what you mentioned.

(But I only know about a few AWS services, maybe it is different for others).

nathanboktae9y ago

> I also think in many cases not propagating global config (env) changes. ... You don't have to ask your-self "shit, which containers use this?",

cpitman9y ago

velkyk9y ago

nathanboktae9y ago

Oh I'll have to check it out, thanks!

maslamOP9y ago

We ran into this as well - I forgot to add this to the post. The Amazon Linux AMI for ECS has _very specific defaults_ that need tweaking.

velocitypsycho9y ago

Could you elaborate more on the issues with the default AMI?

shinzui9y ago

Could you elaborate on the problem with the default AMI?

tjholowaychuk9y ago

Typical stuff like fd limits and network configuration etc. With a light load it would be fine, It's just a shame that you can't boot up an ECS and know it'll scale with you out of the box.

That said this does fit the rest of their services involving EC2, so I guess it's not much different there, but as a consumer I just want the thing to work.

scrollaway9y ago

Apex looks great, can you talk a little more about it?

xrjn9y ago

How would one ship this custom executable with python 3? I have played around with aws lambda and zappa, this was a major frustration for me.

maslamOP9y ago· 8 in thread

HN, I'm a co-founder at Appuri. Happy to answer questions! PS: We LOVE most AWS services like Amazon Redshift. Just not ECS ;)

robbles9y ago

Did you run ECS on a custom AMI, or use the stock one?

We've been running with vanilla EB + ECS for months and haven't seen this at all.

tmacie9y ago

We used the stock AMI. I would not be surprised if we had a configuration issue, but we spent a lot of time trying to debug it and were never able to find the root cause of the issue.

1 more reply

dstroot9y ago

Did you deploy K8S on AWS? If so can you add any details about how? Or are you using K8S elsewhere? I love AWS but planning on spinning up on GCE this weekend to play with K8S.

velkyk9y ago

1 more reply

lobster_johnson9y ago

The kube-up.sh script can start a complete Kubernetes cluster on AWS in one fell swoop. It's pretty smooth.

tmacie9y ago

We deployed K8S on AWS (I'm a dev at Appuri). Like Bilal mentioned we run pretty much everything on AWS, so it was an easy decision.

ntumlin9y ago

Off topic from the article but just wanted to let you know that I love the design of your blog.

maslamOP9y ago

Thank you!

cyberferret9y ago· 6 in thread

dorfsmay9y ago

Github private repos have been made public by mistake before. Got repos are cloned on dev laptops, do you enforce laptop encryption?

The right thing to do is using some form of a vault.

cyberferret9y ago

We use BitBucket here, rather than Github - similar risks, I know, but we have predetermined repositories which are all set as private. 3 dev machines which are kept on premises at all times.

Still not optimal as far as security goes, but it seems that he have roughly the same exposure if AWS leaks our keys and passwords to other third party trackers...

fletchowns9y ago

1 more reply

Ixiaus9y ago

Use kms and dynamodb with key enveloping, or this tool: https://github.com/fugue/credstash

Don't initialize into env vars and don't store in repos, even private ones.

1 more reply

fapjacks9y ago

StreamBright9y ago

dperfect9y ago· 5 in thread

> ECS doesn't have a way to pass configuration to services

I believe this is the recommended way:

[1] http://docs.aws.amazon.com/AmazonECS/latest/developerguide/i...

[2] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles...

embiggen9y ago

One reason I am hesitant to go this route is because I don't want to hard-code Amazon's API's into my apps..

dperfect9y ago

For my own purposes, I weighed that against the alternatives[1], and it seems like a fairly reasonable compromise[2]. That won't be the case for everyone, obviously.

[1] http://elasticcompute.io/2016/01/21/runtime-secrets-with-doc...

dozzie9y ago

I know it's not sexy for developers to take advice from sysadmins, but at the end of the day, it gets the job done reliably and elegantly.

moca9y ago

Have you considered to use a centralized configuration storage (such as S3 and anything else) with access control and audit trail? That is easier to update configs without restarting all the servers.

velkyk9y ago

this was a no go for us, since most of our apps are minimal golang images. IMO it is just good example of bad design :)

hosh9y ago· 4 in thread

Now if ECS 2.0 was really AWS hosted Kubernetes, I would be very interested in hearing about that...

tantalic9y ago

alex-mohr9y ago

It's great to hear GKE is meeting your needs so well! (Yes, I work on it.)

[1]: https://github.com/kubernetes/community/blob/master/sig-clus...

hosh9y ago

That is what I keep hearing. With PetSets rolled out in 1.3, GKE is getting more competitive. At my current job (startup), we're probably going to move towards that.

moondev9y ago

That's exactly what GKE is on GCP and I love it.

huslage9y ago· 4 in thread

Environment Variables are NEVER private. Please don't think that you can hide information in there as all of that information ends up in the process table which is public across the entire machine.

zeroxfe9y ago

nathanboktae9y ago

And the machine (vm) runs in a private VPC. So it's private.

nnutter9y ago

How are they "public" across the entire machine?

phil219y ago

I suppose it depends on your definition of public.

At least that's my immediate take on it.

graffitici9y ago· 4 in thread

lobster_johnson9y ago

There's not really any comparison. Docker is clearly beefing up Docker/Swarm to be more like Kubernetes, but in its current state, Swarm is just a glorified Docker Compose.

nakagi9y ago

I know some features are not so sophisticated compared with K8S and there is no AZ awareness, but Swarm may try to catch up with it.

lobster_johnson9y ago

I recommend looking into the Kubernetes design to understand how different its design is.

As far as I know, Docker Swarm has none of this, and you'd have to build these things (e.g. REXRay for volumes) on top of Swarm yourself.

1 more reply

smarterclayton9y ago

nzoschke9y ago· 1 in thread

Thanks for the shoutout to Convox! I'm on the core team.

I understand these challenges. I wrote about a lot of them here:

https://convox.com/blog/ecs-challenges/

But we have been having tons of success on ECS both for our own stuff and for hundreds of users.

I see the agent disconnection problem too. convox automatically marks those as unhealthy and the ASG replaces them.

It's happening more than I'd like but I'm seeing little to no service disruption. One of the root causes is the docker daemon hanging.

Glad Kubernetes is working well for you. Many roads lead to success as the cloud matures.

maslamOP9y ago

That's a great blog post. Thanks for sharing!

rjurney9y ago· 1 in thread

Docker is stuck in the 'one image on one machine' mindset. DCOS is taking over at the higher levels of the stack. Mark my word.

https://dcos.io/

maslamOP9y ago

@rjurney - we started with ECS right around when DCOS was coming out of alpha (?). Anyway, it looks slick!

advisedwang9y ago· 1 in thread

maslamOP9y ago

Thanks @advisedwang. We're looking into it.

justicezyx9y ago· 1 in thread

Would packaging the configurations together with the docker image makes more sense? That enables more hermetic deployment.

velkyk9y ago

Do you mean hard coding configs to docker image? I wouldn't support this, IMO this is worst case scenario setup :)

Imagine you need to change single config value, for this you would need to update image, push, build, redeploy, this can take some time depending on your deployment.

With k8s you do only `kubectl edit configmaps <name>`, restart pods that are using it and you are done.

Also no need to creating per stage images...

SteveWatson9y ago· 1 in thread

Article text is obscured by icons.

maslamOP9y ago

@SteveWatson - thanks for reporting, should be fixed now.

cxmcc9y ago

Our experience with ECS (at instacart) is not the best but we managed to get it work.

Here is how we get around the issues mentioned in the article:

* Service discovery: built our own with rabbitmq (we use that before ECS anyway).

* Configs: pass a s3 tarball url as environment variable, download it in containers.

* Cli: built our own with help of cloudformation

* Agent disconnecting: we did not see situation where all agents disconnected. we use a large pool of instances, there was never an issue to start containers because of agents.

In addition to these, we also do the following to make ECS work as we want it to:

* built our blue-green deploy solution (structure provided by ECS is very limited)

* built our own solution to integrate with ELB (ELB allows only one port per ELB)

jbaviat9y ago

A remaining issue is that you cannot spawn two containers speaking to a given ELB (AWS load balancer) on the same host if they need to bind the same port.

Ixiaus9y ago

Probably want to use a secret management tool and just not initialize into environment variables...

https://github.com/fugue/credstash

x0rg9y ago

siliconc0w9y ago

We evaluated ECS and Beanstalk but ended up writing a tool around building CoreOS/Fleet clusters (not currently opensource but I'm trying).

pbkhrv9y ago

We switched from ECS to Docker Cloud and never looked back.

j / k navigate · click thread line to collapse