How we reduced deployment times by 95% (opens in new tab)

(blog.plaid.com)

84 pointsbjacokes6y ago57 comments

57 comments

44 comments · 15 top-level

marcinzm6y ago· 13 in thread

My read seems to be: don't use ECS at large scale or you'll need some really convoluted hacks.

(Disclaimer: I work for AWS, but any opinions expressed here are solely my own and not necessarily the company's.)

I don't know about that. ECS works fine at large scale, but it's not going to replace 4000 tasks immediately. And doing so could potentially be a huge shock to your customers if you tried. (You have to take the time to gracefully drain existing connections, etc.) Nor did the author desire to implement a blue-green type deployment which would use AWS's elasticity to its fullest potential out of unspecified cost concerns.

It's not clear to me that Kubernetes would fare significantly better here. It's not without its own particular performance bottlenecks; and the issues about safely draining connections and blue/green deployments would remain the same there. The issues aren't really related to the particular deployment orchestrator, as much as the fact that 4000 containers is pretty large number by any measure for a single unit of deployment.

maerF0x06y ago

Actually OP gave the exact reason ECS does not work for them:

> The rate at which we can start tasks restricts the parallelism of our deploy. Despite us setting the MaximumPercent parameter to 200%, the ECS start-task API call has a hard limit of 10 tasks per call, and it is rate-limited. We need to call it 400 times to place all our containers in production.

That call limit needs to scale with the cluster size

1 more reply

pojzon6y ago

Can I have a question about the bottlenecks you have mentioned in case of Kubernetes ?

Im really curious about this one. What were the bottlenecks you have faced till now.

AYBABTME6y ago

Why exactly is it an expensive task? Where's the bottleneck? I worked on this problem before, so curious how it's handled (or not) at AWS.

jbob20006y ago

You could also read this as "don't run 4,000 of anything".

4,000 node processes? Wow. Just wow. Surely there is a better solution for this. Maybe this is one of those weird edge cases where you would craft a bespoke infrastructure tool, rather than trying to hack a bunch of off-the-shelf products.

bjacokesOP6y ago

Glad you pointed this out! We could have gone into a lot more detail on the reasons that we've arrived at our current system architecture, but it would've distracted from the root problem we were solving in the post. Happy to go into it a bit here.

Node event loop blockages are the primary reason we have so many processes running. We have enough integrations and iterate on them quickly enough that our infrastructure essentially treats them as untrusted/breakable. We want to avoid ReDoS-style bugs from affecting more than the current request, so we handle one request per process. A little inelegant, but we've still been able to horizontally scale the system, and frankly the extra infrastructure cost hasn't been enough to be worth the effort to change it.

To get around the start-task rate limit, we've tried running multiple identical containers per ECS task. However, they need to be marked as "essential" in CloudFormation to make sure our capacity doesn't degrade on container exits, and this means that one container exiting will also exit other containers in the same task.

Multiple processes per container is another interesting approach. We've used Node subprocesses in the past, but we found them tricky for reasons that are unrelated to deploy speed.

One thing we've really liked about rolling our own approach is that we decide when to declare a deploy complete. ECS is pretty conservative about not declaring a deploy complete until the final container has finished draining, which can take minutes for some of our requests. With our fast deploys, we declare a deploy complete when the final container running old code stops accepting new requests, which is significantly sooner. This makes follow-on deploys and rollbacks much smoother.

2 more replies

crgwbr6y ago

You don’t even need a bespoke tool. Just make each container run 4–8 processes by means on something like PM2. That cuts your CI container count down to 500. No reason a container should mean 1 process.

1 more reply

marcinzm6y ago

I'd expect any decent infrastructure tool to be able to handle 4000 processes/containers. Kubernetes, for example, claims to support 300000 containers across 5000 nodes. Mesos supported 50k+ from what I understand. Hadoop scales to over 10000 nodes.

ravenstine6y ago

ECS can be good for really small things but, even when I worked in public radio, we discovered that ECS is pretty hot garbage without writing a bunch of stuff to deal with its quirks. Scaling worked kinda wonky, and ECS has its own jargon for terminology that already exists with Docker in general, making it unnecessarily confusing. Why they call a container a "task" when the term "container" already exists is beyond me. We ended up using Rancher to manage Kubernetes.

otterley6y ago

ECS tasks are not containers. Tasks are a collection of one or more containers. They're the equivalent of a Pod in Kubernetes.

2 more replies

weberc26y ago

We use ECS via Fargate and I'm mostly happy. Deployments are pretty slow, but that's our only qualm. We PoCed EKS, but it has no integration with CloudWatch or IAM for all intents and purposes. They had roadmap items to address those things, but you absolutely have to reinvent a lot of wheels with Kubernetes compared to the things you get for free with ECS/Fargate.

viraptor6y ago

> Why they call a container a "task" when the term "container" already exists

Likely for the same reason they call it ECS and not EDS. Making everything docker-specific could be a bad idea is more container tech becomes popular and they want to allow you to start Frobnicator tasks as well as Docker tasks on the same platform.

bjacokesOP6y ago

In a way you're right, which is why we are exploring kubernetes. At the same time, ECS has worked really well for us as we've grown to this point. It's worth noting that we have some other motivating reasons to switch off ECS beyond deploy limitations, such as poor observability of ECS agent logs.

It's inevitable that early systems are going to bend and break at various points of the scaling process, and the interesting question is how to navigate that – fight fires for awhile or fix things immediately, hack together a stopgap solution or do the full fix, etc. We thought ours might be an interesting data point to share.

bsaul6y ago· 4 in thread

Side question : what’s the current best practice for ensuring that a server ( node or anything) isn’t currently processing some important information before you shut it down ?

Is it a mix of waiting for request handlers to terminate upon receiving a sigterm then end the current process (and timeouting after a while) ? Does kubernetes handles those kind of things (waiting for a given process to stop before trashing the vm) or is there another layer or tool to do so ?

rohansingh6y ago

While graceful shutdown is important, I think the higher priority should be ensuring that you can gracefully recover.

Because eventually, something is going to die while a request is in-flight. So your batch processing needs to be able to recover from the "somebody tripped on the power cord" scenario.

bjacokesOP6y ago

When we start processing a request, we take the JS Promise that represents the result of that operation and put it into an array in our internal "server state" object. When the server gets a shutdown request, it basically awaits Promise.all(this.inFlightRequests) before exiting. Depending on the LB/routing layer on top of your service, you might need to do this in a loop in case work comes in after you do your first await. In other languages you can join a thread pool, wait on a wait group, or use whatever other synchronization tools are provided.

Note that there's an ECS_CONTAINER_STOP_TIMEOUT parameter in ECS that sets a hard upper limit on how long containers have to exit before getting SIGKILLed, and it defaults to 30 seconds. If you want to allow requests to drain for longer than that, you'll need to update that parameter.

(For services with fast request processing time, the draining process is often a lot simpler. You can just route traffic away from the server, then wait a short period of time before telling the process to exit. It's not quite as precise, but works well for many use cases and requires no bookkeeping.)

benburleson6y ago

Remove the instance from whatever routes requests to it, without killing it -- wait some predefined amount of time -- kill it.

otterley6y ago

I've built services that can gracefully shut down, and my typical design is as follows:

First, maintain an active request counter that you increment when you accept a request and decrement when the request has been fully delivered (including output buffer drained).

Second, implement a TERM signal handler. The handler shuts down the listening socket (so no new connections will be accepted), starts a countdown timer, and waits for the active-request counter to reach 0. When either the countdown timer expires or the active-request counter reaches 0, exit.

benologist6y ago· 3 in thread

Whenever I see these posts I feel like Heroku narrowly missed out on shaping the rest of the cloud just by staying proprietary and expensive.

yebyen6y ago

Compare to Deis, who built a Heroku analogue that you can run on your own compute resources. It was wildly popular for a little while, and then when it turned out there wasn't really a market for it, they simply had to leave it behind. (Today, that's known as Hephy Workflow, full disclosure I'm one of the maintainers of this FOSS project.)

Workflow and previous versions of Deis, as well as Dokku and friends also depend heavily on Open Source contributions of Heroku, so it's somewhat misguided to say that Heroku stays proprietary. They have been "leading the pack" when it comes to Open Source implementations of buildpacks, also Cloud Native form which you can find at buildpacks.io—and they set the original bar for a developer-friendly PaaS experience, which is still basically unrivaled even in the current era of Kubernetes product explosion.

We often ask about Workflow: "why isn't this particular technology several orders of magnitude more popular" and struggle to explain it, since it makes perfect sense to us and of course whenever people try it, they usually love it.

But Deis saw the writing on the wall, shifted gears to be some of the early Kubernetes experts when it started to become clear who was going to win; invented and open sourced "Helm" the K8s package manager utility, which is still very popular by comparison, one of the leading ways to package stuff for K8s. And then Microsoft scooped them up.

privateSFacct6y ago

Agreed.

I think Heroku did a number of things almost better in terms of a deployment / dev story? But getting past the pricing was super hard.

shay_ker6y ago

The pricing is way worth it if you do the math on the number of engineers you won't need anymore

2 more replies

cagataygurturk6y ago· 3 in thread

Going to EKS would take less time than exploring hacks.

marcinzm6y ago

EKS has had networking layer issues in the past from personal experience and I've heard from good engineers that their CNI layer code (which is open source) is not very good. That would make me concerned on how well EKS (which is different from Kubernetes itself) can scale and what edge cases you're liable to hit.

rohansingh6y ago

Yeah, I've had the same issues. Eventually a node would run out of some network resource which would make scheduling containers onto it impossible until it was rebooted.

There seem to be some other intermittent network issues as well. I just moved some stuff for a client off EKS and back onto ECS because it wasn't worth troubleshooting.

That said, I am not a fan of ECS at all. The scheduler is slow and it does really just take way too long for it to even start acting on new deployments.

nrmitchi6y ago

EKS may be a different implementation in some components (CNI), and have some set defaults (EBS for PVs, etc), but at it's core it is still a relatively vanilla Kubernetes.

Making a switch to EKS doesn't mean that you are stuck with EKS forever, and can set you up nicely to make the transition to a more complex/customized cluster implementation/deployment (likely still on AWS, think kops) easier if you find it's necessary for your use case in the future.

nrmitchi6y ago· 2 in thread

If I'm reading this right, then this approach takes away any real safety in terms of deployment. There would be no easy rollback mechanism, and no real assurances that the new code version will actually run.

I understand that the main goal here seemed to be avoiding time spent in ECS rollouts, but this solution seems to be sacrificing many of the guarantees that the rollout process is designed to provide.

The root problem is explicitly called out (slow ECS deployments), and is tied to rate limiting of the ECS `start-task` API call. The post mentions the hard cap on the number of tasks per call, but I'm curious if the actual _rate limit_ could have been increased on the AWS side. Ie, 400 calls would still be needed, but they could be pushed through much faster.

ahmcb6y ago

Hey, great questions, some of the same questions we (the SRE/Platform Team at Plaid) had.

Plaid's rollback job still works the same way for the service using this new deployment, so thankfully, nothing new for engineers at Plaid to learn there.

We also have metrics in Prometheus to indicate which versions of code are running so we can easily verify what is deployed.

WRT the rate limit, we have a great relationship with AWS and Plaid pushes hard on limits which most often AWS is happy to increase for us, but this was a hard limit that could not be raised at the time, but I'm sure they are working on it.

MuffinFlavored6y ago

State:

Version 1.0.0 in prod, serving requests

You want to deploy 1.0.1

You spin up 1.0.1, leaving traffic pointed at 1.0.0

What mechanism actually shifts the traffic from the 1.0.0 instances to the 1.0.1 instances, waiting for all traffic to stop on 1.0.0 before bringing the instances down without causing abrupt connection hangups?

maerF0x06y ago· 2 in thread

From reading other comments it makes me wonder if you (Plaid) tried batching the tasks into N containers? Like if a task 50 containers, then you'd reduce the task call rate limiting by 50x...

bjacokesOP6y ago

Yeah, I mentioned this in a comment in a different thread. The duplicate containers in the task definition need to be marked as "essential" in CloudFormation to make sure our capacity doesn't degrade on container exits, and this means that one container exiting will also exit other containers in the same task. So we have a bleed-over effect where OOM in one request could cause N-1 other requests to fail.

The essential vs non-essential container designation is a little confusing. The standard use case for multi-container tasks seems to be that all containers are marked as essential, i.e. they essentially represent different services that are operating in concert on the same machine. This is definitely not the situation we're in, where each container is totally independent.

So it seems like we'd be a perfect use case for non-essential containers. However, (1) at least one container _must_ be marked as essential, and (2) non-essential containers which exit don't get restarted or replaced. This means we would still have a limited bleed-over effect (if the essential container exits, the other ones do too), and more importantly, we can't guarantee that our capacity will be robust to process exits.

ahazred8ta6y ago

> non-essential containers which exit don't get restarted or replaced

That's why you have one or two essential watchdog containers which relaunch the workers. You keep a large number of them in an "idle, but hot" status to allow for bursts?

1 more reply

testuser55591916y ago· 1 in thread

Slightly off topic:

Does Plaid still operate via screen scraping? I'm a little perplexed as to why banks don't have easy to use APIs, especially given recent regulation. It seems against their best interests to allow a third party to screen scrape and provide a service which the banks themselves could easily reproduce.

What am I missing? Is a bank with an easy to use API not a sound business decision from the bank's perspective?

I know Monzo (challenger bank in UK) has/had an API, though I haven't heard of anyone using it.

derefr6y ago

All a bank is, is a UX on top of low-level financial APIs like ACH. If the bank then exposed an API, you could just use it to build another bank (without all the effort the original bank went through) and so compete with them on lower margins (because you don't have nearly the capital costs to recover that they do.)

Basically, it's the same reason that phone companies would never have allowed MVNOs to exist without legal regulation forcing them. The MVNOs outcompete the infrastructure-building phone companies, because MVNOs don't have to build infrastructure!

crb0026y ago· 1 in thread

Google "checkpoint restart". HPC community has had these tools for years, many in userspace. Can't wait to see a Java or C# shop doing the same hot boots.

viraptor6y ago

Java had targeted hot reloads (going further than full reboot) for quite a while. See Jrebel for example.

sailfast6y ago

Thanks for sharing these lessons!

I don't use ECS at the moment but this is a well laid out post on how to avoid some performance issues that could have a huge impact.

EDIT: Downvoted for expressing appreciation for someone taking the time to note lessons learned?.. OK.

fcolas6y ago

- How did you guys scale that much w/o a bootloader before?

That's what I don't get. All the design patterns are those of Unix. You boot the kernel with a ... bootloader. Then you've the kernel with all the system's params (call it ECS). Then each process is a child of the root process. And when you get by whatever mean the news that your app's source code has changed, you pull that code and start running it, while still having the old one live. Once the fork of the new code returns a proper response code, you kill the old one and set the new app live, otherwise you stay live with the old version.

swiftcoder6y ago

> Engineers would spend at least 30 minutes building, deploying, and monitoring their changes through multiple staging and production environments, which consumed a lot of valuable engineering time

Man, startups have no idea how good they have it. It took a solid week to deploy a change at AWS.

evantahler6y ago

Pretty cool! Actionhero uses the ‘require cache’ trick in development mode to hot-reload your changes as you go. It’s risky in that even though you’ve change the required file, you may not have recreated all you objects again. For that reason Actinhero doesn’t allow this is NodeEnv is anything besides development.

evantahler6y ago

Cool! I’m curious if this is something that nodemon/pm2 could do as task runners. You could call “npm update” and then hup your process...

This is sort of how Capistrano handled deployments, changing a symlink to all project deps and then signaling the process to reload

shay_ker6y ago

After all these years, how is deploying solely on AWS still worse than Heroku & Render?

mylampisawesome6y ago

Just FYI, you're "We're Hiring!" link is broken.

j / k navigate · click thread line to collapse

57 comments

44 comments · 15 top-level

marcinzm6y ago· 13 in thread

My read seems to be: don't use ECS at large scale or you'll need some really convoluted hacks.

otterley6y ago

(Disclaimer: I work for AWS, but any opinions expressed here are solely my own and not necessarily the company's.)

maerF0x06y ago

Actually OP gave the exact reason ECS does not work for them:

That call limit needs to scale with the cluster size

1 more reply

pojzon6y ago

Can I have a question about the bottlenecks you have mentioned in case of Kubernetes ?

Im really curious about this one. What were the bottlenecks you have faced till now.

AYBABTME6y ago

Why exactly is it an expensive task? Where's the bottleneck? I worked on this problem before, so curious how it's handled (or not) at AWS.

jbob20006y ago

You could also read this as "don't run 4,000 of anything".

bjacokesOP6y ago

Multiple processes per container is another interesting approach. We've used Node subprocesses in the past, but we found them tricky for reasons that are unrelated to deploy speed.

2 more replies

crgwbr6y ago

1 more reply

marcinzm6y ago

ravenstine6y ago

otterley6y ago

ECS tasks are not containers. Tasks are a collection of one or more containers. They're the equivalent of a Pod in Kubernetes.

2 more replies

weberc26y ago

viraptor6y ago

> Why they call a container a "task" when the term "container" already exists

bjacokesOP6y ago

bsaul6y ago· 4 in thread

Side question : what’s the current best practice for ensuring that a server ( node or anything) isn’t currently processing some important information before you shut it down ?

rohansingh6y ago

While graceful shutdown is important, I think the higher priority should be ensuring that you can gracefully recover.

Because eventually, something is going to die while a request is in-flight. So your batch processing needs to be able to recover from the "somebody tripped on the power cord" scenario.

bjacokesOP6y ago

benburleson6y ago

Remove the instance from whatever routes requests to it, without killing it -- wait some predefined amount of time -- kill it.

otterley6y ago

I've built services that can gracefully shut down, and my typical design is as follows:

First, maintain an active request counter that you increment when you accept a request and decrement when the request has been fully delivered (including output buffer drained).

benologist6y ago· 3 in thread

Whenever I see these posts I feel like Heroku narrowly missed out on shaping the rest of the cloud just by staying proprietary and expensive.

yebyen6y ago

privateSFacct6y ago

Agreed.

I think Heroku did a number of things almost better in terms of a deployment / dev story? But getting past the pricing was super hard.

shay_ker6y ago

The pricing is way worth it if you do the math on the number of engineers you won't need anymore

2 more replies

cagataygurturk6y ago· 3 in thread

Going to EKS would take less time than exploring hacks.

marcinzm6y ago

rohansingh6y ago

Yeah, I've had the same issues. Eventually a node would run out of some network resource which would make scheduling containers onto it impossible until it was rebooted.

There seem to be some other intermittent network issues as well. I just moved some stuff for a client off EKS and back onto ECS because it wasn't worth troubleshooting.

That said, I am not a fan of ECS at all. The scheduler is slow and it does really just take way too long for it to even start acting on new deployments.

nrmitchi6y ago

EKS may be a different implementation in some components (CNI), and have some set defaults (EBS for PVs, etc), but at it's core it is still a relatively vanilla Kubernetes.

nrmitchi6y ago· 2 in thread

ahmcb6y ago

Hey, great questions, some of the same questions we (the SRE/Platform Team at Plaid) had.

Plaid's rollback job still works the same way for the service using this new deployment, so thankfully, nothing new for engineers at Plaid to learn there.

We also have metrics in Prometheus to indicate which versions of code are running so we can easily verify what is deployed.

MuffinFlavored6y ago

State:

Version 1.0.0 in prod, serving requests

You want to deploy 1.0.1

You spin up 1.0.1, leaving traffic pointed at 1.0.0

maerF0x06y ago· 2 in thread

From reading other comments it makes me wonder if you (Plaid) tried batching the tasks into N containers? Like if a task 50 containers, then you'd reduce the task call rate limiting by 50x...

bjacokesOP6y ago

ahazred8ta6y ago

> non-essential containers which exit don't get restarted or replaced

That's why you have one or two essential watchdog containers which relaunch the workers. You keep a large number of them in an "idle, but hot" status to allow for bursts?

1 more reply

testuser55591916y ago· 1 in thread

Slightly off topic:

What am I missing? Is a bank with an easy to use API not a sound business decision from the bank's perspective?

I know Monzo (challenger bank in UK) has/had an API, though I haven't heard of anyone using it.

derefr6y ago

crb0026y ago· 1 in thread

Google "checkpoint restart". HPC community has had these tools for years, many in userspace. Can't wait to see a Java or C# shop doing the same hot boots.

viraptor6y ago

Java had targeted hot reloads (going further than full reboot) for quite a while. See Jrebel for example.

sailfast6y ago

Thanks for sharing these lessons!

I don't use ECS at the moment but this is a well laid out post on how to avoid some performance issues that could have a huge impact.

EDIT: Downvoted for expressing appreciation for someone taking the time to note lessons learned?.. OK.

fcolas6y ago

- How did you guys scale that much w/o a bootloader before?

swiftcoder6y ago

> Engineers would spend at least 30 minutes building, deploying, and monitoring their changes through multiple staging and production environments, which consumed a lot of valuable engineering time

Man, startups have no idea how good they have it. It took a solid week to deploy a change at AWS.

evantahler6y ago

Cool! I’m curious if this is something that nodemon/pm2 could do as task runners. You could call “npm update” and then hup your process...

This is sort of how Capistrano handled deployments, changing a symlink to all project deps and then signaling the process to reload

shay_ker6y ago

After all these years, how is deploying solely on AWS still worse than Heroku & Render?

mylampisawesome6y ago

Just FYI, you're "We're Hiring!" link is broken.

j / k navigate · click thread line to collapse