ECS isn't for people who wanted 'managed infrastructure' in the "so we don't have to have a dedicated ops-team at all" sense. ECS is for ops-teams who were previously managing e.g. an OpenStack Nova cluster with an LXC-driver backend, and now want the virtual equivalent of that. Just like EC2 is built for ops-teams who were using VSphere or Xen.
EC2 is not built for ops teams at all. It is built for developers. The VSphere/Xen analogy doesn't hold up. Amazon is explicitly abstracting away the physical and software stack that ops teams would build and mantain onsite. AWS gives you back an API and a reference client. Nothing to install, no physical hosts to buy, no network switches to configure. Simply focus on your compute requirements and go from there.
My position is that AWS is designed for a particular usage pattern: one where a large enterprise organization has a particular department that consumes, configures, and manages AWS's virtual infrastructure (just like an ops team consumes, configures, and manages physical infrastructure), and then, in turn, provide an internal IT-services abstraction to the rest of their organization, using AWS as the "backend" for (some or all of) those services.
None of AWS's services are created from the perspective of use by application developers; they don't present application-level interfaces, nor do they expect (unlike PaaS services) that the application can be rewritten to conform to the shape of the infrastructure. The application is taken as a given. IaaS APIs are created such that an internal IT services department can receive a request from an application developer to provision a cluster for their (fixed) application, and can respond to that request by hitting AWS APIs, instead of by paging ops staff.
In smaller organizations, these lines blur under "devops" sensibilities, where the developers effectively do their own IT services. But in organizations where this boundary is clear, AWS lives inside it, not anywhere where developers can see or touch it. AWS's 'idiomatic' use is to be a transparent drop-in backend replacement that the application developers in the company won't even be aware that ops has switched over to—except to notice that there are now likely fewer ops people overall.
I don't know if you meant it that way, but I laughed out loud when I read that!
Well, you should have said: "you take an experienced internal ops team that was managing physical infrastructure and software running on it, migrate them to AWS, and then still need an experienced ops team, just without managing the hardware". It's not that you migrate to AWS and magically don't need to manage software (or that your programmers magically learn how to do that).
But number two, I get the impression that AWS has some severe underlying technical debt that comes from original architecture decisions from 10 years ago, and that they are pouring a lot of current resources into addressing those problems, and getting customers off the old crusty stuff. For example, there's a major transition coming to make EC2 IDs much longer so they can remain universally unique, but that's a painful transition. Then there's weird stuff like the "classic" non-VPC accounts where security groups are identified by name. And all the services that don't have tags, and the fact that it's easier to operate multiple AWS accounts than to partition permissions within one account. But they are working on fixes for all these things, but a lot of them are buried so deep in the architecture that it's taking some time. But in a couple of years when some of the new things on the horizon come to fruition, I think we'll see more aggressive iteration on managed serverless infrastructure options.
> It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian.
so I think it's hard to draw conclusions from it.
[0] quote is from the Wikipedia entry (https://en.m.wikipedia.org/wiki/Amazon_Redshift)
We've been running a bunch of services on it for a months and its been fine.
I'm no sure why anyone would even bother with raw esc - just let aws manage it all using beanstalk. It certainly still relies on elb/dns and whichever env vars for "discovery" stuff, but it beats rolling your own infrastructure management using ecs/cloudformation etc.
One thing we did run into was that very occasionally environmental variables wouldn't show up inside deployed containers so we had to bake all the application secrets into the build process. A little annoying but nothing too terrible. Otherwise everything has been pretty great.
Imagine what you could do if you didn't even assume a process model? All app state just resident in memory, but magically persisted? Who needs object storage, re-invent the pointer!
We could have lived in the future, now it seems we're permanently wed to the past.
All of it.
I don't know about that. Google Container Engine (hosted Kubernetes) is actually pretty awesome and imaginative. It's feeling like GCP's niche is going to have a sizable containerization element. If you browse around their docs, you'll find that GKE/containers have started creeping into the examples for seemingly unrelated services. They're not just dipping a toe in.
More generally, I feel like GCP's container strategy is just leagues ahead of AWS' at this point. While this article was thin on substance, ECS is definitely difficult to set up and maintain. If I'm going to go through all of that trouble, I might as well run my own Kubernetes or Mesos setup and not be locked into ECS.
It wasn't 3 years ago that 'nobody serious' was 'trusting' cloud providers like AWS/GCE with anything important. This is still the very early days, as evidenced by the ridiculous growth numbers being posted YoY.
We have that world, it's called single-process apps. And it's awful from the point of view of security, scalability and disaster recovery.
> All app state just resident in memory, but magically persisted?
You need transactions or this ends unhappily. Some languages truly grok transactional updates to state. Most do not. In the meantime, you've rate limited the entire system to the slowest component.
> We could have lived in the future, now it seems we're permanently wed to the past.
Your friend overlooks that extremely intelligent people have looked into these things. They usually had extremely important disadvantages, which sufficiently offset the advantages that mere momentum kept the majority approaches as the majority approaches.
Your friend also seems to have left it as an exercise for the reader on how one is meant to deal with distributed systems. The short answer is: it is hard, and trying to create the seamless illusion that the network doesn't exist hasn't really panned out.
The speed of light is a cruel limit.
Take your usual Java, NodeJS or Ruby payload, enjoy your memory leaks eating up your space.
The issue is that if you compile anything in your Dockerfile, you end up installing the compiler as well as producing unnecessary build artifacts, which will still remains as a layer that must be downloaded even if you uninstall the compiler and clean up after yourself. In other words, a bunch of unnecessary cruft. This applies not just to compiled languages, but to any language (Node.js, Ruby) that relies on a build phase as part of getting dependencies.
The proper fix is to perform the compilation outside of the main container (for example, by starting a throwaway build container that you only use for compilation) and then copy the final artifacts into the final container. But I don't know of any hosted solutions that support that workflow.
Lambda is similar, we have 'Serverless' and I'm hacking on Apex (https://github.com/apex/apex) just to make it usable. I get that they want to create building blocks, but at the end of the day consumers just want something that works, you can still have building blocks AND provide real workable solutions.
I was part of the team migrating Segment's infra to ECS, and for us at least it went pretty well, some issues with agents disconnecting etc I sort of wrote off since ECS was so new at the time.
Another annoying thing not mentioned in the article is that the default AMI used for ECS is not at all production ready, you really have to bake your own images if you want something usable. I suppose this is maybe because there's subjectively no "good" defaults, I'm not sure, but it's a bit of a pain.
ELB for service discovery is fine if you can afford it, I had no issues with that, ELB + DNS keeps things very simple. I'm not a huge fan of all these complex discovery mechanisms, in most cases I think they're completely unnecessary unless you're just looking to complicate your infrastructure.
I also think in many cases not propagating global config (env) changes, is a good thing, depending on your taste. Scoping to the container gives you nice isolation and and more flexibility if you need to redirect a few to a new database for example. You don't have to ask your-self "shit, which containers use this?", it's much like using flags in the CLI, if we _all_ used environment variables in place of every flag it would be a complete mess.
EDIT: I forgot to mention that the ELB zero-downtime stuff was awesome, if you try and re-invent that with haproxy etc, then... that's unfortunate haha. No one should have to implement such a critical thing.
From whath I hear from people working there, OSS is king but there is also little contribution back to OSS so fits with what you mentioned.
(But I only know about a few AWS services, maybe it is different for others).
I agree and we (dev lead at Appuri here) achieve the best of both worlds from Kube by in the secrets section of a deployment definition, specifying what secrets we need, but not the value. So we know what services need it, and it's updated in one place. That's just for the secret store though, but we could put non-secrets in secret to use that mechanism.
configMaps is nice, but we use it in limited way because its so much easier to update pods when editing env vars. Note: we are using deployments, so if you need to change env var, you do `kubectl edit deployment <name>`, edit/save/close file that opened in your $EDITOR and watch the magic to happen.
We ran into this as well - I forgot to add this to the post. The Amazon Linux AMI for ECS has _very specific defaults_ that need tweaking.
That said this does fit the rest of their services involving EC2, so I guess it's not much different there, but as a consumer I just want the thing to work.
Lambda doesn't currently support Python 3 (only Python 2.7) and that has been a massive pain in the arse to deal with. I've heard it's possible to get Python 3 working on it by shipping a custom executable and serializing/deserializing state but I figure it's a fairly significant performance hit.
We've been running with vanilla EB + ECS for months and haven't seen this at all.
From an outside perspective, it sounds like the primary issue you referenced here (the agent disconnecting) could have been due to a mismatch in configuration between the agent and docker, or maybe just a permissions issue. IIRC, the ECS agent tries to clean up containers every few hours so perhaps not being configured correctly caused it to get stuck?
The right thing to do is using some form of a vault.
Still not optimal as far as security goes, but it seems that he have roughly the same exposure if AWS leaks our keys and passwords to other third party trackers...
Don't initialize into env vars and don't store in repos, even private ones.
I believe this is the recommended way:
ECS container instances automatically get assigned an IAM role[1], with credentials accessible via instance metadata (169.254.169.254) [2]. Containers can access that metadata too. The AWS SDK automatically checks that metadata and configures itself with those credentials, so all you have to do is give your IAM role access to a private S3 bucket with configuration data and load that configuration when booting up your app.
That way there's no need to copy/paste variables, and no leaking secrets in ENV variables. You do have to be careful though (as with any EC2 instance) not to allow outside access to that instance metadata endpoint, e.g., in a service that proxies requests to user-defined hosts on the network (but if you're doing that, you've got a lot more to worry about anyway).
[1] http://docs.aws.amazon.com/AmazonECS/latest/developerguide/i...
[2] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles...
For my own purposes, I weighed that against the alternatives[1], and it seems like a fairly reasonable compromise[2]. That won't be the case for everyone, obviously.
[1] http://elasticcompute.io/2016/01/21/runtime-secrets-with-doc...
[2] I'm referring specifically to passing secrets (or other static values) into a container, since that seems to be what the author was talking about. For configuration requiring more complexity, of course other tools are probably more appropriate. In that case, it's outside the scope of what I would reasonably expect ECS to do.
I know it's not sexy for developers to take advice from sysadmins, but at the end of the day, it gets the job done reliably and elegantly.
Now if ECS 2.0 was really AWS hosted Kubernetes, I would be very interested in hearing about that...
For 1.4, the Kubernetes Cluster Lifecycle and Ops SIGs are working on making the install and setup process much easier, including on AWS [1]. That won't magically turn it into Kubernetes as a Service, of course, but we hope it'll help users on other platforms.
[1]: https://github.com/kubernetes/community/blob/master/sig-clus...
Any environment for a process will be accessible via /proc/<pid>/environ on a linux system. Of course other users cannot read these files, however in the case of something like a Docker image all processes likely share a username and this could be a risk (especially for a public webapp that one day may information leak/allow remote command execution).
At least that's my immediate take on it.
For example, it does not handle services (K8s can automatically provision a load balancer against all your containers), there's no volume handling, no centralized logging, no label-based targeting, it has very limited scheduling (K8s uses cAdvisor to help scheduling, can automatically ensure that pods are spread out across multiple AZs, etc.), etc.
It'll be interesting to see what happens as Docker starts pushing into Kubernetes' space. Given the multiple points of overlap/contention between K8s and Docker (you have to disable Docker's built-in networking and iptables management; Kubelet has to continually monitor Docker for orphaned containers and volumes and so on; etc.) I wouldn't be surprised if Google one day decides to eliminate the Docker daemon as a dependency entirely, by writing a bare-bones container engine into Kubelet.
I know some features are not so sophisticated compared with K8S and there is no AZ awareness, but Swarm may try to catch up with it.
A good example is volume management. With Kubernetes, you can tell a pod to use an AWS EBS volume; when the pod needs the volume, Kubernetes will automagically mount it, and handle the statement management for you.
If you define what's called a persistent volume, your pod can declare that it needs, say, 1GB, and Kubernetes will automatically allocate 1GB from the volume; you can have lots of pods working off this shared volume, and Kubernetes will know which pods have "claimed" which parts of the volume.
Another good example is config and secrets. In Kubernetes, you declaratively create configuration objects ("configmaps") and secrets. If a pod needs, say, access to an external API, you can store the keys in a secret and declaratively give the pod access to the secret, which will be mounted into a folder (or, alternatively, assigned to an environment variable, though that's not as secure).
Yet another example is service management. You can tag a service (which is another type declaration that says "port X on some unique cluster IP should be routed to every pod tagged with these labels") as load-balanced, and if you're running in a cloud environment (AWS, GCE, etc.), K8s can automatically create an external load balancer for you that exposes the service publicly.
Kubernetes is best described as a sophisticated state machine that takes declarative objects ("manifests") that describe your world — i.e. which containers should be running, which services should be exposed, etc. — and then attempts to continuously reconcile reality with that declaration, managing all sorts of state in the process.
Perhaps most important is the ability to abstract resources from pods. A pod just declares the image to run and the resources — volumes, configs, secrets, CPU/memory constraints, etc. — to make available to it. K8s's state machinery takes care of the rest.
As far as I know, Docker Swarm has none of this, and you'd have to build these things (e.g. REXRay for volumes) on top of Swarm yourself.
I understand these challenges. I wrote about a lot of them here:
https://convox.com/blog/ecs-challenges/
But we have been having tons of success on ECS both for our own stuff and for hundreds of users.
I see the agent disconnection problem too. convox automatically marks those as unhealthy and the ASG replaces them.
It's happening more than I'd like but I'm seeing little to no service disruption. One of the root causes is the docker daemon hanging.
Glad Kubernetes is working well for you. Many roads lead to success as the cloud matures.
Docker is stuck in the 'one image on one machine' mindset. DCOS is taking over at the higher levels of the stack. Mark my word.
Would packaging the configurations together with the docker image makes more sense? That enables more hermetic deployment.
Imagine you need to change single config value, for this you would need to update image, push, build, redeploy, this can take some time depending on your deployment.
With k8s you do only `kubectl edit configmaps <name>`, restart pods that are using it and you are done.
Also no need to creating per stage images...
Here is how we get around the issues mentioned in the article:
* Service discovery: built our own with rabbitmq (we use that before ECS anyway).
* Configs: pass a s3 tarball url as environment variable, download it in containers.
* Cli: built our own with help of cloudformation
* Agent disconnecting: we did not see situation where all agents disconnected. we use a large pool of instances, there was never an issue to start containers because of agents.
In addition to these, we also do the following to make ECS work as we want it to:
* built our blue-green deploy solution (structure provided by ECS is very limited)
* built our own solution to integrate with ELB (ELB allows only one port per ELB)
A remaining issue is that you cannot spawn two containers speaking to a given ELB (AWS load balancer) on the same host if they need to bind the same port.
We ran into similar complaints. CoreOS comes with Etcd which though initially unstable is now solid and incredibly handy for service discovery and configuration. We're using https://github.com/MonsantoCo/etcd-aws-cluster to configure it dynamically. We use etcd+confd to drive nginx containers for routing. All in all it works well. Our biggest problems are docker bug related and those we can generally handle by just terminating the node and letting autoscale heal the cluster.