We 30x'd our Node parallelism (opens in new tab)

(blog.plaid.com)

202 pointsbjacokes6y ago243 comments

243 comments

140 comments · 24 top-level

7777fps6y ago· 24 in thread

> We were running 4,000 Node containers (or "workers") for our bank integration service. The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations. But since our total capacity was capped at 4,000 concurrent requests, the system did not gracefully scale.

I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

Failing to scale because their previous approach to scaling was a worker per request, a model which was roundly moved away from, because that's how CGI and Apache modules worked and it didn't scale well.

I thought one of the key selling points with Node was an fully async standard library, enabling better scaling in process.

But then you read stories like this, and I find it hard to relate to the original problem.

bjacokesOP6y ago

There are a couple of reasons that the legacy scaling model was viable for us. As mentioned in the post, only 1/10 of our traffic was from the API, which gave us a roundabout way to scale by diverting resources. And it's only viable to use this model of scaling when the business value of a request is high – we were originally quite happy to spin up more containers when we reached our scaling limit. That's the pragmatic reason why we were processing one request per container.

In terms of what issues caused us to move away from parallelism in the first place, it was all the CPU-bound stuff that you might expect: ReDoS-style issues, post-processing arrays in very large edge cases, programmer error, etc.

abalone6y ago

> In terms of what issues caused us to move away from parallelism in the first place, it was all the CPU-bound stuff that you might expect: ReDoS-style issues, post-processing arrays in very large edge cases, programmer error, etc.

But these are not parallelism problems. These are single threading problems, which the core problem with Node.js, not parallelism in general. Hence I think the question stands: why did you choose node for this?

1 more reply

crazygringo6y ago

Yeah, I don't get it either, at all. The original poster wrote below:

But it's trivial (a single line) in Node to place breaks in CPU processing to allow the event loop to fire, and as for "programmer error"... many commenters below are also complaining async programming is too hard or finicky.

But that's like complaining about C because pointers are hard, or Java because OOP is hard, or databases because planning indexes is hard.

Once you "get" async, pointers, OOP, or indexes, it's easy. And it's part of your job as a professional programmer to get it. Async is no trickier than anything else.

The setup in the first place makes absolutely no sense to me, using a language exactly opposite of how it's meant to be.

bjacokesOP6y ago

In one case I cited elsewhere in the comments, an engineer had called ramda.uniq on an array of nested objects which was occasionally very large. When calling into external packages, I don't think we have as much control over yielding to the event loop, but I could be wrong. I know that there are some JSON/regex libraries that give you some protection on this front.

I agree that it would be nice if all developers were infallible – I'm reminded of a friend describing their company, where "we don't write tests because we all write good code". At a certain point, you have to look for processes – linters, monitoring, testing, language choices [1] – where people can't shoot themselves in the foot. (Code reviews being only moderately less fallible than a single engineer.) It's not enough to just say "be better" whenever bad code is written.

I think when the decision was made (years ago) to handle a single request per container, they couldn't find such a process to prevent event loop blockages, other than migrating an already-large codebase away from Node. As others have pointed out, maybe such a migration is necessary – after all, event loop blockages are still an inherent risk because of how Node works. It's just a lower risk than it was a year or two ago, because we've significantly improved our usage of the event loop, and also have tooling in place to catch blockages before they become an issue.

[1] https://news.ycombinator.com/item?id=18564643

1 more reply

markandrewj6y ago

The issue is that many developers that are coming from synchronous programming don't get asynchronous programming. They could both improve the code by not writing blocking code, and also using something like the cluster module (https://nodejs.org/api/cluster.html).

1 more reply

sergiotapia6y ago

At some point you have to take a step back and realize you've grown beyond your tech and reach out for something else. Elixir sounds like a great fit for these problems.

For example Discord reached out to Rust and built tiny Rust components that are called from Elixir for their server user list. Some servers have 200,000+ people online, and Elixir wasn't cutting it performance wise. Rust, boom now it works.

eshyong6y ago

I feel like this article is missing a crucial piece of information: why was integration code was blocking the event loop in the first place?...

rynop6y ago

Agreed. This is the second scratch my head moment from the Plaid engineering team blog recently.

tluyben26y ago

> I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

No you are not. I wonder which CTO would allow this; like everyone here, the exact case is not really clear (or at least why this solution is a great solution for it), but this sounds like a weird solution (and expensive) to some issue. I really don't understand these 'solutions' and I am almost 100% sure I (with a team! but the point that this is not the best solution for the problem) can whip up something far simpler and more efficient for this problem. But ofcourse there are problems that might fit?

ilaksh6y ago

They didn't actually understand Node very well at first and then later they figured it out.

liveoneggs6y ago

Pretty sure apache + cgi would scale better :)

praseodym6y ago

Related, from the article:

> We hypothesized that increasing the Node maximum heap size from the default 1.7GB may help. To solve this problem, we started running Node with the max heap size set to 6GB [..], which was an arbitrary higher value that still fit within our EC2 instances.

Sounds like they were utilizing their EC2 instances very poorly. Why not run more workers per instance, or switch to an instance type with less RAM (or more CPUs)?

tracker16y ago

They were using ECS, it also looks like they had to work through a couple bottlenecks to get multiple requests per node working well... I think they could get further by using the newer workers api, since gRPC doesn't work with the cluster module.

api6y ago

I wonder what percentage of the massive compute power of huge cloud data centers is spent just chugging away on ugly clunky hacks to run bad code?

asdfman1236y ago

Quite a large percentage. I worked on a site that got 1 request per second and they were able to handle it by spinning up like 20 VMs. Turns out they were just using Entity Framework wrong. Whoops.

But also, though, you have to consider that most places aren't Plaid, and most places developer time is more expensive than throwing an extra machine at the problem.

Crystalin6y ago

I got the same feeling. We use node and usually split to 500 concurrent requests per process.

Still interesting...

Vesuvium6y ago

The way it happens always: many people working to a solution, not agreeing on one and then comprising on something in the middle, even if it makes no sense.

phoe-krk6y ago

> I thought one of the key selling points with Node was an fully async standard library, enabling better scaling in process.

We still have an event loop that is trivially blocked by very simple programmer errors, destroying the whole advantage that you describe here.

The fact that Node ships a fully asynchronous standard library doesn't in any way fix the fact that Node is a runtime for a language that itself is a mistake.

nicoburns6y ago

> We still have an event loop that is trivially blocked by very simple programmer errors, destroying the whole advantage that you describe here.

So they fixed the issue that some requests blocked... by making all requests blocking.

2 more replies

duxup6y ago

>by very simple programmer errors

I can't help but also feel that is also an issue and in another given language this issue might not happen ... but they'd hit another.

It's so easy to say "don't do X because problem Y won't happen" but hard to predict what happens when you move from (language, platform, or whatever) X to (language, platform, or whatever) Z.... and I suspect people often hit issues and realize that maybe Y wasn't the problem.

I see it all the time and I feel like "Wait guies I'm not sure we're fixing the right thing!?!?!"

This article raises a lot more questions than answers IMO.

1 more reply

earthboundkid6y ago

Async is just modern cooperative multitasking, and just like the 90s, it's easy to accidentally lock the whole system.

2 more replies

davedx6y ago

> trivially blocked by very simple programmer errors

Can you give an example please?

I think it's much easier to block a thread with C#'s async programming model than node's...

1 more reply

techterrier6y ago

I wonder if all this was at root, pickup up jobs from a message queue and they only wanted each process to only have one job in flight at once.

asdfman1236y ago

> I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

Here's how it probably worked: they liked Node, they liked containers, they put Node into containers and it worked, and they stuck with it as the user base grew.

kevstev6y ago· 22 in thread

I was building scalable node applications a few years ago for a very large e-commerce player- millions of customers. I think node.js is a great platform, but its apparent simplicity means there are hordes, and I mean like 90+% of the community, that can "just get things done" without understanding what is going on under the hood at all. And to be fair, for most startupy types of companies that need to iterate fast, that is what you want to optimize for.

My interview screening question was pretty simple- "Is node.js single threaded or multithreaded?" And to most, they spit back the blogspam headline- "Single threaded!" I think the most correct answer is "its complicated" but would accept that because most people would say that is the "right" answer. So I would follow up with- "what exactly happens in a default installation if we have say... 5 requests come in at exactly the same time to just return some static content from disk?" (Node's default threadpool is 4). And here is where you could see their understanding just fell apart. Some would say they would be handled entirely synchronously, others completely in parallel- but then had no idea what the cause of the parallelism was. Very few actually understood that node is an event loop executing javascript backed by a threadpool for async operations.

Before reading this post, I was like eh this is a waste of time- its typical medium bullshit- they almost certainly found they were doing some blocking call in the event loop and then removed it and voila, 30x speedup. It was interesting because it was a lot worse! They spent all this time and hard work figuring out everything but what was taking so long in the event loop, and it seems that was the last place they actually looked.

Anyway, node can be a highly scalable platform (https://changelog.com/podcast/116) but you need to understand it or else it will bite you in the foot. When I was last doing this stuff, upwards of 80% of our time was being spent essentially just JSON.parse()'ing, and we were looking to move to protobufs to avoid that.

NathanKP6y ago

This is why I recommend that anyone running Node in production use a tracing tool like New Relic. It's super easy to see what is blocking the event loop. Just choose a duration (say 10ms) and look for any execution spans that are longer than that duration.

Ideally you want to be yielding back to the event loop at least every 1 ms. Anything that takes too long without yielding will show up as a latency delay before your code is able to start handling a new request (technically a background thread in Node.js will pick up the request, but your code won't start executing in response to it until you yield back to the event loop again).

To be honest the more difficult thing to diagnose sometimes is event loop overburdening. If each of your execution spans are taking 1ms, then you can only do a max of 1000 of them per second (assuming there was no delay between executions, but there is). So if you are trying to handle a large number of requests per second the event loop may end up with say 1005 execution spans per second that it needs to execute to handle that request volume. Because you can't do 1005ms of work in 1000ms the extra work will queue up.

So gradually you will end up with 5 backlogged execution spans stacking up per second. Each second you will get 5ms more latency. The overall request latency will just gradually increase and increase as work gets further and further delayed in the queue.

Overall I just think of Node.js as a fancy CPU scheduler. As long as you give it even, decently sized chunks of work to schedule, and you don't give it too many to schedule you will be fine. Anyway I'm a huge fan of Node.js but yeah its easy to fall into some gotcha's if you don't study how it works. The simplicity is a bit misleading

inglor6y ago

Elastic APM is better than New Relic in how it traces node and it is completely free and open source (you can use a cloud product).

Disclosure and bias: I work on Node core and always hear ranting about incorrect usage in async_hooks in anyone but Elastic APM in core meetings. I used both products and have no affiliation to other companies.

1 more reply

neebz6y ago

How exactly do you use NewRelic to see what is blocking the event loop? I thought we always needed Flame Graphs for it (which NR doesn't provide)

1 more reply

inglor6y ago

> Very few actually understood that node is an event loop executing javascript backed by a threadpool for async operations.

This is true, and that JavaScript is mostly a synchronous programming language with host environments that can provide asynchronisity.

A caveat though is that the most important part of I/O is network I/O (tcp/udp sockets) and Node uses real async operations there rather than a threadpool.

FS is just really hard to get right in a cross platform way and that's why it's on the threadpool. Some other stuff like dns is also famously on the threadpool but tcp sockets are not - it's a big part of why Node is fast.

specialist6y ago

At my last gig, I maintained a handful of existing, and created some new nodejs things. I had previously done a lot of Java and even hacked on Apache. I had no prior nodejs experience.

The first thing that really bothered me about our use of nodejs was no one could say why stuff would fail in production. So many moving parts. One of my team members figured out some edge case interactions between nodejs and nginx (used for HTTPS), which I would have never figured out on my own. It wouldn't have even occurred to me to look there. But other crashes, caused by apparent leaks, were mystifying.

The second, and bigger, thing that really bothered me about nodejs, and expressjs in particular, was the notion of back pressure is completely missing. If it's in there, I couldn't find it. So our endpoints were still accepting new socket connections without processing responses from backend services (eg redis, other nodejs endpoints, auth services), which would either zombie or ABEND those backends. And no one could figure out why.

I only understood what was happening because I'd already been through all that "architecture" madness a decade earlier with Java services.

I guess what I'm saying is while I LOVE nodejs' closeness to the metal, I didn't like going back in time 10-15 years.

Also, npm is crap.

pier256y ago

Here is a SO answer that expands a bit more on kevstev excellent comment:

https://stackoverflow.com/a/22644735/816478

golergka6y ago

> When I was last doing this stuff, upwards of 80% of our time was being spent essentially just JSON.parse()'ing, and we were looking to move to protobufs to avoid that.

It's only tangentially related to your question, but I can't help but ask this question: why people use JSON instead of protobufs at all?

I'm mostly a client-side developer, and most of my server-side experience is in hobby projects; still, I always used protobufs and loved it. They never damaged my feature velocity, apart from an hour to set up the build system in the beginning, and type safety helped me quite a few times when I forgot to sync changes in protocol on client and server side. Are there some secret advantages of going with json that I don't see because of limited experience?

kevstev6y ago

JSON existed before protobufs is really it. When I left the node world 3 years ago, protobufs were the new hot thing. Any new project should start with them over JSON imho.

There is some friction to them though, and I think a lot of it is that most tutorials and beginner books like to keep things as simple as possible, and people start their little project, it gets traction, and then they figure out they need protobufs but now its hard to introduce. In most projects, even today, it seems that its the version 2.0 that gets protobufs, v1.0 keeps JSON for simplicity, unless you have a bunch of seasoned devs involved.

x86_64Ubuntu6y ago

What does happen with the 5th request.

NathanKP6y ago

It waits in a queue in a background thread in the node.js http library until your code can execute to handle it. So if your code takes a long time before returning back to the event loop the request will just wait in that queue for a long time before the next opportunity for the event loop to execute code in response to the event.

kevstev6y ago

It will get queued, until one of the 4 requests in front of it has its task, to return the file, complete.

1 more reply

rclayton6y ago

Is it really only 4? When I look at my VM stats I seem to recall it having something like 15. Of course, this could be a config change on the Node Alpine container.

Kiro6y ago

I have a Node service where I get tens of thousands requests a second and I still thought Node was single threaded. Where can I read about this?

rclayton6y ago

The event loop is single threaded. Async tasks are executed on I/O threads which are configurable. So if your app is I/O bound, the event loop will typically dequeue tasks pretty quickly allowing lots of requests.

avip6y ago

That's a great interview question (especially if you're not so much into hiring :))

Another one is: what happens when a node process completed execution?

  // node ex.js
  function foo() {  // something async here }
  foo()
  console.log('bye...')

This is a fun question to discuss (I think some consider this a bug in node).

z3t46y ago

Node.js is a good abstraction layer. In my experience, everything get leaky once you get hundreds of concurrent users.

gameswithgo6y ago

>I think node.js is a great platform,

I'm curious as to why. For large scale applications like this, you have other options that offer higher performance ceilings, have more safety and correctness features, and are likely more productive as well. What is the attraction to node?

A guy has to invent a scripting language for browsers in 9 days -> he decides on a lisp -> management says no it has to look like java -> he comes up with something -> its dynamically typed -> lets run a huge banking infrastructure on this

wat

NohatCoder6y ago

The real killer feature is async. Since a modern web request typically spend most of the time waiting for database calls, file system requests or similar, a naively coded server in most languages can handle relatively few requests per thread, so you scale up the number of threads to something like 100 per core, and now the overhead of running and switching between these treads is limiting the performance.

Being used to Node, I was flabbergasted when writing C for Linux* . The file system commands just leave my thread hanging while the result is being generated, if I use it on a network drive it might hang for a minute before timing out, so I have to make a tread for each file system command, solely so that it can stall without bringing down the whole application.

* I have no delusions that Windows is any better, Linux is just what I have first hand experience with.

2 more replies

kevstev6y ago

My background was doing low level C++ in HFT/Algorithmic trading for years, with a bit of Java interspersed, before doing this whole complete right turn of doing webdev in js for ecommerce. In C++, doing web stuff was very difficult, build times were long, JSON support existed but was awkward, iterating was just very difficult, even when you had your whole build/deploy setup going, there was still a lot of work if you wanted to build a CRUD app just marshalling and unmarshalling objects to the DB, etc. It was a drag at best and a real pain at worst. Java... was a bit better, and I don't really have a problem with Java as a language. Java programmers, however, seem to really delight in building architectural monuments and get paid by the abstraction. In every java system I have jumped into, you are always neck deep in xml, and massive object hierarchies and factory factories and its just like where the F is the code that actually does stuff?! I remember working on one project, where I was building the "engine" and another guy was building the web interface for it, and was using Spring when it was relatively new, and he happily declared "all I need to do is wire up my configs now, I am pretty much done." Narrator: A week later, he still wasn't done... and this wasn't unusual in my experience, in most java apps "config" was just as complex and problematic as code, and the mindset of "its just config" made config changes more likely to cause a production outage of some sort.

Then, you take a look at node. You look at a getting started tutorial. Its javascript on the front, and on the back. The JSON in between is "native" and is convenient and easy to use, easy to read, lightweight, and just makes a lot of intuitive sense- especially when I had found myself neck deep in XML in previous jobs for the same tasks. I had a nice looking HTML5 web app running in a few minutes- my mind was blown. Then you take a look at the frameworks- express and hapi, and the vast module ecosystem- and how easy it was to build a simple CRUD website with leveldb, or mysql, or really an endless array of storage options. And people were using those options! It wasn't just the bog standard RDBMS being used every place, with your only real choice being mysql, postgres, or if you had money, Oracle. Building endpoints with routes in these frameworks made your code so easy to divide up along clear lines, and there just wasn't the endless miles of boilerplate/scaffold code, and ugly syntax and type systems to fight with and plan ahead of. Things Just Worked. Turning around a code change was a matter of seconds, not a minutes long build process- I had never felt so productive- and writing code was fun again! Deploys were easy, restarts were fast. Rollbacks, when necessary, were painless. There was a plugin/module for everything (too much in hindsight).

Now, this was 6 years ago. Go was around, but still kind of a blip on the radar, Ruby/Python were probably the closest real contenders. Ruby had lost steam, I honestly took some cursory looks at it, but it didn't seem to have traction. Python, suffered from its single threadedness and GIL, and its popularity with the ML crowd- Flask and such existed, but was pretty rudimentary compared to what Express/Hapi were offering, and no one seemed that interested in those projects. I like Go a lot, and for a pure backend service, it might be my go-to today, as one of the original arguments for Node was "its the same language on the front and the back, no more delineation between FE and BE developers, anyone can jump in and fix the bugs!" Which, along the lines of my original comment, don't really work out in reality, at least not on larger systems. People drawn to FE work usually have never done real systems development and don't understand how things work under the hood- which isn't a problem, until one day it is and then its a huge one.

The dynamic typing argument... is somewhat valid, but I found that enforcing api contracts with hapi/joi gave you the equivalence of type safety at your interface borders, while still giving you the flexibility of dynamic typing within your code. In fact, Joi went even farther than just type checking, it could check that your int was within range for the field, that your dates were formatted properly, etc... In mega large codebases, this will come back to bite you, but I found the plugin architecture of Hapi really discouraged that kind of crap from leaking in and it was easy to build truly modularized code.

The performance ceilings aren't that different, and not that impactful, at least not until you get to FANG scale, and I mean literally only FANG scale. We were running a billion dollar business with on 8 fairly small VMs for the API layer, which handled all of the ecommerce transaction handling. I remember at one point we encountered a memory leak of some sort in node, and the instances were falling over and dying about once an hour, but restarting and recovering- this was causing a few % error rates to our customers. I was insistent that we get all hands on deck to figure this out ASAP, and our head of Ops type person said "kevstev, we can throw hardware at this problem to meet SLOs until you get it under control. Your monthly server costs are less than my studio apartment cost me per month in Jersey City 15 years ago."

You just have to have a basic understanding of whats going on at an architectural level, something a few hours of doing the right reading and experimenting can get you if you have the proper background. The number of gotchas to avoid to get that performance were an order of magnitude, if not more, fewer than in a language like C++ (Which I feel has actually gotten so complicated and difficult to grok its become a parody of itself- and I say that as someone who used it and adored it for 15 years).

1 more reply

DanHulton6y ago

You're handily skipping over 15 years of improvement and iteration between the last and second-last points there.

a13n6y ago

> safety and correctness features

You can achieve safety and correctness features for node via good lint rules and typescript/flow.

1 more reply

JMTQp8lwXL6y ago

Most bugs encountered in production systems aren't type based issues. Types are more useful for developer productivity (e.g., intellisense) than any other purpose.

1 more reply

GordonS6y ago· 22 in thread

I don't like to be overly negative, especially when a company/team is being transparent about what they're doing and giving insight into their engineering practices - but has anyone else's estimation of Plaid's engineering team just gone down the toilet?

This blog post gives me the impression that either Plaid is filled with either junior or incompetent engineers - to scale to 4k containers serving 1 request each for an API workload is absolute insanity.

These engineers are building stuff for banking. Banking!! There is literally no way I'm going near Plaid with a very long bargepole after reading this.

It I was someone senior at Plaid, I'd be pulling this blog post before it harms reputation any further.

yjftsjthsd-h6y ago

I mean, that's always the thing, isn't it? If a company publishes about the problems it has, the question is whether other companies have the same problems and just hide it, or whether this company is actually worse. This comes up a lot with gitlab, for instance; remember the time they discovered they had no backups? At most companies, customers would never find out about that, so I'm not sure that them telling us about it usefully informs my view of their competence. Similarly, here, the only way we'd know about this if they didn't say anything would be poor performance, which would be... less than surprising, on a financial website, in my experience. So maybe they suck, or maybe they're equal with others but a little more open; I don't know how to tell.

GordonS6y ago

Thanks for a reasoned response to what I realise was a very negative comment. I do agree with what you've said, and I do feel a little bad for slamming them when they're being transparent.

OTOH, I do still feel this is so bad they need to be called out on it, and it really does scare me off using them. Given they're being transparent, it boggles the mind that they're tried to justify this, rather than just owning it, admitting it was the result of letting a junior do some resumed-driven-developlemt (or however it came about).

1 more reply

bjacokesOP6y ago

Hi, Plaid engineer here (not the author, but I helped with the post).

I don't think we've tried to assert that the old system is perfect. We went into some detail in the post about why it took us this far. Certainly, the single request per container approach wouldn't scale if our unit economics were different. We didn't get into this too much in the post, but the Node service sits behind a couple of layers of Go services, so the we had more control over scaling API traffic than it might appear.

Likewise, I hope we didn't give the impression that the new system is perfect. We've explored other languages for integrations in the past (even Haskell, at one point), and are continuing to do so. A migration away from our years-old Node integrations codebase would be a massive undertaking at this point. Absent that, it doesn't seem consistent to say "you're incompetent for handling 1 request per container" and also "you're incompetent for writing this post" – if you believe the former then it makes sense to be an advocate for this project, at least until a language migration can be done.

I think the set of hoops we had to jump through in order to add concurrent requests without adding latency is a good demonstration of why we didn't do this sooner. It wasn't a massive undertaking by any means, but it wasn't trivial. At any rate, we're not really looking for a gold star here – just putting this out there and hoping this will be useful for others who are, as other commenters have put it, building their own "Frankensteins" :)

danudey6y ago

I mean, my read of this is:

1. We used a system which uses event loops to achieve great concurrency, but we turned that off because we don't trust it. 2. Instead, we spent $300k/yr rolling out one-process-per-API as though we were using Apache 1.3. 3. We used an arbitrary JSON library without knowing anything about its performance characteristics, which it turns out were inordinately bad

It's not that this wasn't a great exercise in engineering and problem-solving, or that it's not a great demonstration of how to solve scaling problems at scale, those are definitely true. It's more that "we spent $300k/yr more than we needed to so our engineers didn't need to learn how to use our technology stack properly."

I'm not meaning to be harsh, I've kludged enough garbage into production in my lifetime, but more that the fact that you got into that situation in the first place gives a poor impression of either your development team or your development processes.

1 more reply

GordonS6y ago

Reading the article, I'd fully expected a post-mortem at the end, describing how architecture and code review processes were going to be tightened up to ensure a monstrosity like this never happened again - that would have been transparent, interesting, and given me confidence in Plaid's engineering.

Instead, you've peppered this thread with comments that kind-of, sort-of justify the approach taken.

I'm sorry, but this approach cannot be justified - it's overly complex, and far from the simplest or most obvious approach. I'm truely shocked that Plaid has produced an architecture like this, and doubly so that Plaid would try to justify it. My guess here (and given the attempts at justification, this is me being really charitable) is that a junior dev was given too much leeway, and did some resume-driven-development, just so they could say they'd worked with 4k containers.

RSZC6y ago

It's important to keep in mind that efficiency isn't usually particularly important for a startup. I'm sure they knew when they initially set up this system that it wasn't performant...but it was nice and quick and easy and gets the feature out the door. Why should they worry about $100k or whatever when they're funded for > $350M? Their bottleneck is engineer hours, not dollars.

Instead the rational thing to do is build something quick and dirty and optimize later, and that's exactly what they've done.

GordonS6y ago

I understand that, and plenty times myself I've "done the simplest" thing - sometimes you need to ship an MVP, fast.

The difference here is that what they did wasn't even the simplest thing - it was a crazy, insanely wasteful thing that just happened to work for a while. Being honest, for me, it's an indefensible approach.

> Why should they worry about $100k or whatever when they're funded for > $350M? Their bottleneck is engineer hours, not dollars

Arg, but this rubs me up the wrong way! Any half-way competent engineer could have built something simpler and much more performant, and likely in many less hours too. Sometimes stopping, thinking and discussing for a few minutes or hours will save numerous hours. I mean, how many hours did they spend on this "diagnosis" alone?

rockostrich6y ago

>Their bottleneck is engineer hours, not dollars.

Their bottleneck was software being able to scale past a hard stop. I guess having a known breaking point of scalability is a good thing? But building things in a way where you either have to overhaul your development runtime or not be able to scale past a certain point is pretty terrible.

It seems like the only reason they did this was because they really felt the pain of it from the business and dev side and they were lucky enough that they had traffic spikes to raise these issues. If they had more consistent day-to-day traffic then this would have just hit a breaking point one day and they would've been fucked until it was fixed.

mirekrusin6y ago

Frankly the reality is so surreal there, it's actually surprising it doesn't mention uploading csv files over ftp, generating excel files, having cameras pointing on monitors of legacy systems that read data (no, this is not a joke) or spawning a promise for Martha to cross check something and click ok somewhere behind two bastion hosts, three firewalls and one and a half soap integrations. I'm not defending "4k containers because even loop can be blocked" which is silly, just reminding of the context - in other words you can do shittiest automated thing there and you're a hero. Next year-or-two hero is going to be somebody shrinking it by another X orders of magnitude.

kevstev6y ago

I kind of alluded to it in my reply, but I tend to agree- they spent a lot of time and hard work- looking in all the wrong places! Its hard to imagine how they missed the forest for the trees so badly here.

Worse is- they never really explain where that 30x improvement came from- or if they even understand it themselves? They talk a lot about getting their memory issues under control, but hardly at all about actual parallelism- and it seems that even then they confuse it with merely speeding up operations that are blocking.

I kind of expected this post to be "We did a whoops and had a blocking call to a DB/fs/compression call/whatever. This was all happening in the event loop and not being farmed out to the threadpool by libuv. We fixed it and now look like heroes to our CTO!"

tracker16y ago

What they talk about are issues that blocked them from parallelism per node and how they resolved the issues. I'm not sure what additional information you're expecting?

Though I'm somewhat surprised they didn't use Worker patterns per node with self monitoring for health above and beyond what they already did.

tracker16y ago

For banking... accurate, simple, safe, reliable are more important than performance/throughput. IMHO optimizing the above and for developer efficiency should be the first priority and for scale or max throughput later.

The simplest solution is to scale to one worker per node initially if you're doing anything compute intensive... once you've done that, and/or you need better performance for any number of reasons including cost, then you can do more. Now, I'm not sure I would have gotten to 4k nodes before I started to re-evaluate parallelism or better scaling options, but the initial implementation is absolutely fine.

GordonS6y ago

> For banking... accurate, simple, safe, reliable are more important than performance/throughput. IMHO optimizing the above and for developer efficiency should be the first priority and for scale or max throughput later.

I get it, but come on - this was not a "performance optimisation" issue, but one of bad architecture; an architecture that certainly doesn't inspire confidence in the priorities you mention: accuracy, simplicity, safety.

1 more reply

ubu77376y ago

Where I work compliance is job #1.

That doesn't prevent us from thinking about performance. GTFO with this nonsense.

2 more replies

z3t46y ago

Linus Torvalds quote: "You need to grow thick skin". I also got a knee-jerk reaction when reading the first part, but the article explained it well, given that they probably don't want to give out too much information.

So how would you have engineered it? I would just send the data uncompressed granted that the receiving server is probably in the same data-center with switches capable of handling Tbit's of data per second.

I liked the article, but would have wanted more details. I love optimizations, it's such a drug, the rush when you make something x times faster. This article doesn't give me a bad impression. Contrary I'm thinking about sending an application.

mharroun6y ago

In their defense. It looks like they have over 400 employees and raised over 350 million in funding. On all things that truly matter currently they seem like a very sucessfull company.

I can guarantee you a VPE or CTO who can say they helped do that... but ran into a scaling issue from their success will have no issue with employment and no reason to be ashamed. All the more impressive if it was just a bunch of junior engineers.

1 more reply

sicromoft6y ago

This comment says more about you than it does about Plaid. Their "insane" design met business requirements successfully enough to grow them into a multi-billion dollar company.

Did you consider the likely (and more charitable) explanation that they were aware their design was "bad", but had higher priorities until now?

If I were you, I'd be pulling your comment before it harms your reputation any further. :)

rockostrich6y ago

>multi-billion dollar company

WeWork is a "multi-billion dollar company" in the same way that Plaid is. Private funding valuations don't really mean anything anymore.

GordonS6y ago

> Did you consider the likely (and more charitable) explanation that they were aware their design was "bad"

I think marketing and VC valuations grew them into a multi-billion dollar company; whether they remain so, to a large part relies on how fast they burn through VC cash - so, not looking too good on that front...

No even half-way competent engineer would come up with such a complex, unperformant solution to a simple problem - I think a higher priority should be hiring engineers who actually have a clue what they're doing.

As for meeting business requirements... while this might have worked for a while, it was plainly not a good way to meet them, and given Plaid are in the banking sector, really doesn't bode well for the future (I'm having flashforwards already to security breaches, plaintext passwords etc...).

1 more reply

throwGuardian6y ago

> no way I'm going near Plaid with a very long bargepole after reading this

But you'd go to a competitor who hasn't published a blog post, whose internal code you haven't audited and simply presume is just fine?

In plaid's defense, lack of performance tuning isn't necessarily a lack of security focus.

GordonS6y ago

> In plaid's defense, lack of performance tuning isn't necessarily a lack of security focus

Come on, this is not about "performance tuning", where you're trying to eek out every last drop of performance - it's about a completely indefensible, complex, wasteful solution to a simple problem.

I'd say engineering insanity at this level is very worrisome for what they've done at the security side of things.

ubu77376y ago

ROFL "performance tuning" this is not tuning this is architecture.

Some people think you can just write software, sell it to customers, and it's "tuning" to make it work properly.

You should be fired from whatever job you have.

My guess is that you have no job, you are fronting USD.

In which case you have absolutely no place in this conversation and you should be ashamed of yourself for speaking up.

A fool and his money are easily parted.

2 more replies

spamizbad6y ago· 11 in thread

The only way this makes sense to me is if they have to contend with lots of expensive parsing, event sequencing, and throttling requirements. Payment APIs, bank websites, etc can be quite byzantine. I could understand how one might code yourself into a corner with a monolithic node app and basically just say "F-it, we're doing this synchronously!"

I don't even think it's a terribly bad thing to do assuming it favors feature velocity.... but at that point, I'd recommend moving away from Node towards something like Python. And if you wanted to dip your toes back into async plumbing land, explore Go or Elixir.

sho6y ago

> explore Go or Elixir

I have never seen a good argument for using golang for business logic. If you are writing the actual server then sure, use golang. If you are writing some high-speed network interconnect, use golang. Some crazy caching system, sure use golang. The public WS endpoint, use golang.

But if you need to access a DB with golang for anything more than, like, a session token, then you made the wrong choice and you need to go back and re-assess.

Elixir is in the "germination phase" and I predict massive adoption in the next 5 years. It is a truly excellent platform, every fintech company I know at least has their toe in the water. Everyone I show this video to [1] just says "well, shit."

[1] https://www.youtube.com/watch?v=JvBT4XBdoUE

yurish6y ago

What is wrong with accessing DB from golang?

1 more reply

bjacokesOP6y ago

You hit the nail on the head here. When N different API requests simultaneously time out – all because a ramda.uniq call in one of them received an array of 100,000 nested objects – it's easy to make a spot code fix, but harder to systematically prevent it from happening in the future. There aren't really linters for "bad event loop blockage". Code reviews are the main tool we have, but you'd be surprised what sorts of logic can trickily block the event loop. For API reliability and development velocity in the short-term, by far the easiest approach was to throw more infrastructure at the problem.

We do use Go for almost all of our other services, and there are an increasing number of integrations written in Python. But we're still using and investing in our Node integrations code for the foreseeable future, and this was an important step for simplifying our infrastructure.

We certainly hope the tooling and rollout process in the post were instructive for anyone using Node, even if their stacks were pristine from day 1 and never need this sort of complex migration :)

stickfigure6y ago

I'd recommend moving away from Node...

Taking a wild guess: Some of their bank integrations probably require browser automation. If you're doing browser automation, the best tool for the job is (currently) Puppeteer, which runs on Node. There are other third-party language bindings for the Chrome dev tools protocol, but Puppeteer is developed by Google as a first-class citizen alongside Chrome.

paulddraper6y ago

I think that overemphasizes Pupeteer itself.

It's really just bindings for the dev tools protocol.

Half the GitHub issues result in "well the protocol requires X and we can't change that".

Pupeteer is popular because it's web automation protocol bindings for a web language, not because it a sophisticated layer or does very much.

There are literally dozens of language bindings for the protocol. [1] Some are quite good and widely used, for example chromedp (Go bindings). [2]

[1] https://github.com/ChromeDevTools/awesome-chrome-devtools#pr...

[2] https://github.com/chromedp/chromedp

rdsubhas6y ago

4000 chrome instances? Probably not. Here I am trying to run 4 chrome instances in parallel in CI without crashing.

2 more replies

duxup6y ago

That was my thought to. They've got a problem where they've got no idea what a given transaction costs and some unpredictable amount of transactions result in some serious work that holds up the event queue.

God knows they could be waiting for some reel to reel tape to spin up somewhere...

coddle-hark6y ago

The whole point of async I/O is to be able to do something useful while waiting for tape to spin up.

I don’t buy it.

3 more replies

kreetx6y ago

Haskell also has very nice concurrency IMO.

markstos6y ago

Their velocity might have been slowed by figuring out how to manage 4,000 containers effectively. If they had dealt with managing concurrency effectively sooner, they would need 30x less containers-- 133.

tracker16y ago

Not so much, they're using ECS which takes care of a lot of those headaches and sounds like they're coordinating with a load balancer / reverse proxy for distributing those requests... A 1-1 request model in that kind of system is really simple to setup. Setting up to orchestrate multiple requests per node was probably much more time intensive.

https://aws.amazon.com/ecs/

Scarbutt6y ago· 5 in thread

I don't want to be that guy, but why did they start with nodejs for something like this instead of using the JVM or Go?

caiobegotti6y ago

They probably already had some decent experience with Node and it solved their initial problem well enough, refactoring or rewriting costs usually make engineering managers frown upon (wrongfully) and so it becomes much harder to fix this in the long term.

hatch_q6y ago

I have experience with node, am frontend developer... but would never in wildest dreams use Node for any kind of production backend. And it's not even the question of programming language - the problem is NPM and whole package management which is inherently insecure.

meritt6y ago

My guess is because their system is primarily issuing HTTP requests and extracting data out of responses: html, xml, json, plaintext, etc. Web scraping is a messy business and using a language that allows you to be flexible with string manipulation and types goes a long way toward sanity.

calibas6y ago

How is Javascript better at string manipulation? I've never encountered anything special there that I can't do in just about every other language. Javascript just has more helper functions out of the box.

1 more reply

mnutt6y ago

Node is pretty good for managing HTTP requests as long as the responses aren't too large. But parsing data, especially html/xml, is CPU-intensive in node and probably not a great fit.

PixyMisa6y ago· 5 in thread

Nobody involved in this project should be allowed to ever be in the same room as a computer again.

dang6y ago

This comment breaks the site guidelines and is not cool. Would you please read https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here?

https://news.ycombinator.com/newsguidelines.html

jrockway6y ago

Why? They had a 12 factor -ish app that scaled the normal way; run more copies. Eventually that got expensive. They had the observability to figure out what was making it expensive and whether or not their fixes had an effect. They then saved $300,000.

Seems like everything went right to me.

I would be worried if the blog post was "we randomly tweaked some stuff and we can't measure it but it's a little better" or "we rewrote it in go and in the rewrite introduced 87 new bugs while fixing 42 old bugs". They engineered a solution, built from good investment in infrastructure, rather than ninja-ing a hack. That, to me, is a very good thing.

A lot of people seem deeply upset that Node was involved, but I think that's a red herring. The problem they had -- allocate a large chunk of memory, keep a reference to it while it is slowly sent to another server, free memory -- is going to happen in any language. (I don't super agree with their solution of "make the server faster" because one day it's going to be slow for some other reason and this problem will crop up again. Instead they probably just need a fixed amount of memory to dedicate to this process and to drop the debug payload when the buffer is full. Or just put it in the request path if it's crucial that it be produced every time no matter what. At least that will apply backpressure to calling services, pop the circuit breaker, and redirect requests to a region where S3 isn't broken. But I don't think the debug information is THAT important ;)

GordonS6y ago

> Why? They had a 12 factor -ish app that scaled the normal way

So, yes, horizontal scaling is good, especially for stateless workloads - but that doesn't mean you run the most hopelessly under-performing code imaginable on each node, so you basically have to scale out like this! I mean, seriously, 4000 containers to serve 4000 concurrent requests? I mean, I can't even...

I honestly can't believe the attempts in this thread to justify such an utterly, horrendously bad architecture - there are 1001 better, simpler even, ways to approach this.

Yes, premature optimisation is bad, but optimisation here was nowhere near premature.

1 more reply

PixyMisa6y ago

To save $300,000 they first needed to waste $300,000 by reinventing a problem that was solved in 1967.

1 more reply

bdcravens6y ago

I would say the same thing about hiring people who make snide dismissals.

rauchp6y ago· 4 in thread

That was an interesting read, thanks for linking to it. It's hard finding articles online discussing Node and performance, most people just dismiss it as an unviable option due to scale and speed concerns. 30x really is quite the jump though.

> Each Node worker runs a gRPC server

Not going to lie, this kind of surprised me. When I think of a Node backend I think of ExpressJS. Not because I think Express is better, but because it's been pushed around in the past few years as the fastest, simplest way of running a backend.

Yet, if you're going to be running a gRPC server, why not use a more performant language with better multithreading support? I thought this article was about them optimizing a grandfathered-in solution (such as Express), but I can't tell why they built out a gRPC server in Node in the first place.

bjacokesOP6y ago

Our integrations are primarily written in Node, which was the original language used for everything at Plaid. Almost all of those original services (except for integrations) have been migrated to Go or Python at this point. We've standardized on gRPC as our wire format, so we stayed consistent and used gRPC in Node.

With perfect hindsight, it's a fair point that all the pros and cons could net out to another language being best for our integrations. Integrations are the largest and most quickly-changing codebase at Plaid, so such a migration would be a massive undertaking. We definitely didn't want to block scalability improvements on doing a language migration.

mnutt6y ago

I've been hoping that the Cloudflare folks will open source parts of their Workers; they seem to have figured out a secure, performant way to run untrusted javascript at scale.

jrockway6y ago

The Node gRPC implementation is fine. It uses the C++ implementation which is the gold standard. It has Prometheus and OpenTracing interceptors. You basically give nothing up by using it, if your team wants to write a language that runs on node.

tracker16y ago

The bigger issue to me, is (at least the last time I looked) you can't use the cluster module with node combined with gRPC, so the only real way to take advantage of extra CPU capacity, if available is workers or external processes that are self-managed vs. cluster integration.

1 more reply

tyingq6y ago· 4 in thread

"Only 10% of Plaid's data pulls involve a user who is present"

Since they provide an API, it seems like some of the calls where they think a user isn't present might actually have one present.

bjacokesOP6y ago

We thread knowledge of whether a data pull was initiated by the API or by our cron-style service into our load-balancing layer, so this ends up being pretty straight-forward.

tyingq6y ago

Ahh, got it. The "present and linking their account" part threw me off. Sounded like only the "linking" call was getting the fast lane.

CamouflagedKiwi6y ago

The other 90% are not triggered by the API, they are "periodic transaction updates" - presumably they refresh once a day or something.

tyingq6y ago

Yeah, I read that, but it's not clear exactly what those calls are. It sorta sounds like making assumptions on how their users are using the API.

In fact, it sounds like they think "linking an account" is the only "user present" API call:

"Only 10% of Plaid's data pulls involve a user who is present and linking their account to an app"

1 more reply

tyingq6y ago· 4 in thread

Does node have something similar to how apcu is used with PHP?

That is, an mmap based kv store so that if you choose to run more than one node process on a single server, it has a fast kv cache?

I'm aware you can use redis or similar, but a simple mmap kv store is simpler and faster for a single server use case.

godot6y ago

I totally see what you mean, coming from a PHP world myself a few years ago. The key thing to note is that node.js (like many other languages including Java) starts a server process that basically does not stop until you explicitly restart it (or it crashes); unlike PHP where every request starts a brand new process on a clean slate (hence needing APCu to store a local memory cache per server). Meaning, what you can accomplish with APCu in PHP can be trivially accomplished by a simple Object in node.js (i.e. a map/hash), by virtue of having a require cache (hence every time you require'd the lib it returns the same instance of the object).

If you want a simple open source lib to do exactly that for you and provide an easy to use API, you can use something like https://www.npmjs.com/package/tmp-cache .

tyingq6y ago

The context is multiple node processes running on a single box, so a shared cache across processes has value for some use cases. I don't think the cache module you suggested would work in that case.

I'm aware of the runtime model differences between node and PHP.

ddorian436y ago

You can use something like LMDB on every language.

tyingq6y ago

Ah, yeah. I suppose that would mean you need a fast node.js serializer. Apcu uses its own serializer that is fast-ish.

1 more reply

nosianu6y ago· 3 in thread

They write (somewhere in the middle)

> Since V8 implements a stop-the-world GC, new tasks will inevitably receive less CPU time, reducing the worker’s throughput

But there is this Google blog post vom January 2019:

https://v8.dev/blog/trash-talk

> Over the past years the V8 garbage collector (GC) has changed a lot. The Orinoco project has taken a sequential, stop-the-world garbage collector and transformed it into a mostly parallel and concurrent collector with incremental fallback.

So I guess they used an older node.js version. The current LTS version is 12.x and it is from around the middle of this year.

---

PS: If the blog author reads this, there is an accessibility problem with the Google-hosted inline images. If I try - without ad blocker - in an anonymous window I see none of the inline images. Logged into Google with my own account I can see some but not all the images. Apparently which images I can see depends on being logged in to my Google account? I also tried IE Edge just to see if the browser makes a difference - no inline images visible there either.

_fbpt6y ago

When I try to view the image in a new tab, I get:

Your client does not have permission to get URL /Iw-RdHoPjbwuSAqJHK3C0Sy8m29NqzeHPtmJ7CVFuYqwr4CbwpGjwn9O4bcDNtCf_hLD4FGc75nkQYnJBgyA-CT2ikBDWQD-nAtqxXa4Lw2yDuh_-ywcsDaer6m4LyVtljwfrajO from this server. (Client IP address: [redacted])

Rate-limit exceeded That’s all we know.

bjacokesOP6y ago

Fixed the images about half an hour ago, sorry about this!

greentrust6y ago

Ditto, images weren't showing up for me

mceachen6y ago· 3 in thread

In case anyone else gets excited by JSONStream, know that the package hasn't been updated in over a year, and the GitHub repo was archived by the author with no link to a successor.

contrahax6y ago

I'm maintaining a fork here that incorporates all of the valid open PRs from the original repo + some more updates: https://github.com/contra/JSONStream

It isn't published on NPM (you can use it as a git dependency) but if people are interested I can.

mceachen6y ago

Thanks for sharing!

Why don't you publish releases?

fenwick676y ago

Oboe has a similar API, can't speak for performance though.

http://www.oboejs.com/examples

deedubaya6y ago· 3 in thread

A good example of avoiding premature optimization. I'd imagine delaying tackling this problem freed them up to tackle problems that impact users.

coddle-hark6y ago

This only holds if they didn’t pour hours into the original solution. Setting up and managing 4000 node services doesn’t sound like a quick hack.

bjacokesOP6y ago

While we were worried about event loop blockages causing outages, another more subtle problem would have been if event loop blockages doubled our user-facing latency. (If you read the section on latency ratios, you'll see that comparing parallel vs non-parallel workers was the most useful stat in figuring out how effectively we were using the event loop.) It definitely gave us peace-of-mind to know that event loop blockages wouldn't have an effect beyond the requests they're processing.

Honestly, the accounting for which would've been higher impact – investing in parallelism earlier, or adding infrastructure and having more resources to devote to other pressing needs – is difficult to do, even in retrospect. There was surprisingly little effort required to get to 4,000 node containers in an ECS cluster, other than deploy speed issues which we talked about in a previous post [1]. But it's possible this migration process would have been easier if we had done it sooner.

[1] https://blog.plaid.com/how-we-reduced-deployment-times-by-95...

1 more reply

deedubaya6y ago

Writing and maintaining concurrent code for greenfield projects is relatively hard compared to sync code.

Provisioning and deploying with ECS is usually just mouse clicks.

pdimitar6y ago· 2 in thread

...Or you could just use Erlang or Elixir, where concurrency and parallelism come pretty much out of the box, with very little effort required for you to fine-tune the desired policy / strategy.

The insistence on using Javascript is just beyond lunacy at this point.

timmy-turner6y ago

Well, if Elixir had a Typesystem like Javascript has, I'd instantly switch to it. But atm I'm staying with Node because of Typescript.

pdimitar6y ago

True, it doesn't have it. Between pattern matching and function guards however, it has a decent way to protect against common errors.

The true treasure is Erlang / Elixir's runtime though. The parallelism, the self-healing, the preemptive scheduling.

FanaHOVA6y ago· 2 in thread

$300k is $300k, but they just raised $250M last year, is this a really good use of time for their engineering team? That's a little above ~0.1% of capital.

dwild6y ago

Why wouldn't it be? You save 300k, that's an engineer salary... that's pretty much the meaning of a job, building value that's higher than your salary. This clearly took less than a year of engineer time. Seems like they got their value out of that employee.

bdcravens6y ago

That's just one of the benefits.

> our system is more robust to increases in external request latencies or spikes in API traffic from our customers

mnutt6y ago· 1 in thread

I’d be curious to hear more about the circumstances that ended up with a blocked runloop. Are there hundreds of junior engineers, or perhaps third parties writing code that you don’t control? I have seen people accidentally write blocking code, but not at such an egregious rate that we couldn’t catch it in code review, or at worst the runloop detector would alert on it in prod and we would roll back the deploy.

For instances where you actually know you need lots of CPU, there are now strategies for offloading that specific work, although they have taken a while to get nice and easy to use.

bjacokesOP6y ago

Sure, one example I remember off the top of my head is a bank that sometimes returned duplicate transaction data, so an engineer had called ramda.uniq on the transaction array. Transactions are nested objects and slow to compare, so when you find an account with 100,000 transactions... kaboom. Some scenarios are more subtle, but a common theme is that the amount of data in an account can vary by many orders of magnitude.

rynop6y ago· 1 in thread

I’d be curious to hear your reevaluation of moving this to Lambda after some of the major announcements during re:invent. My guess is some of the reasons you went ECS have been addressed with these announcements. Obviously some of the new features are still preview, but would be interested to hear your analysis none the less.

bdcravens6y ago

Oftentimes there's a several month delay from when stuff is announced at re:invent and when it's GA. I don't think anyone would ever make technical decisions based on announcements; they would wait until they could touch it and actually create a proof of concept. In other words, the "analysis" is nonexistent, since there's nothing to analyze.

jdc05896y ago

On a positive note: this was a good write up.

On a negative note: FOR THE LOVE OF ALL THAT IS HOLY, HOW DID THIS HAPPEN.

vmarchaud6y ago

I've encounted different issues with NodeJS services in the past (and still do) both with CPU bottleneck and Heap allocations. So i wrote openprofiling-node [0] during this summer to help me profile my apps directly in production and export the result in a S3 bucket. I believe it may help someone else here so i'm posting it

[0]: https://github.com/vmarchaud/openprofiling-node

awinter-py6y ago

Compared to a compiled language, node / JIT langs make it difficult to know what will be fast in prod.

V8 JIT means that things like order of keys in an object or number of different calls to a function might affect whether your function gets optimized.

And there's no easy way to find out if a JS function is falling back to slow mode or to tell the buildsystem 'this is a hot path, don't let me write code that deopts this call'.

bfrog6y ago

It's not clear from the article why they were only able to run one request per node process, but that alone would make it questionable why use Node at all then. The entire point of the environment has been nixed. The article is quite confounding to understand how they arrived at that point in the first place.

supermatt6y ago

Ironic. Linked images failing to display due to "Rate limit exceeded"...

mirekrusin6y ago

4k containers? That's microservices going macro big time.

CyanLite26y ago

TLDR: How to spend millions of dollars of our investors' money because we hired junior devs who chose a framework that was trendy but couldn't scale.

Phil_Latio6y ago

> We were running 4,000 Node containers

LOL

j / k navigate · click thread line to collapse

243 comments

140 comments · 24 top-level

7777fps6y ago· 24 in thread

I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

I thought one of the key selling points with Node was an fully async standard library, enabling better scaling in process.

But then you read stories like this, and I find it hard to relate to the original problem.

bjacokesOP6y ago

abalone6y ago

1 more reply

crazygringo6y ago

Yeah, I don't get it either, at all. The original poster wrote below:

But that's like complaining about C because pointers are hard, or Java because OOP is hard, or databases because planning indexes is hard.

Once you "get" async, pointers, OOP, or indexes, it's easy. And it's part of your job as a professional programmer to get it. Async is no trickier than anything else.

The setup in the first place makes absolutely no sense to me, using a language exactly opposite of how it's meant to be.

bjacokesOP6y ago

[1] https://news.ycombinator.com/item?id=18564643

1 more reply

markandrewj6y ago

1 more reply

sergiotapia6y ago

At some point you have to take a step back and realize you've grown beyond your tech and reach out for something else. Elixir sounds like a great fit for these problems.

eshyong6y ago

I feel like this article is missing a crucial piece of information: why was integration code was blocking the event loop in the first place?...

rynop6y ago

Agreed. This is the second scratch my head moment from the Plaid engineering team blog recently.

tluyben26y ago

> I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

ilaksh6y ago

They didn't actually understand Node very well at first and then later they figured it out.

liveoneggs6y ago

Pretty sure apache + cgi would scale better :)

praseodym6y ago

Related, from the article:

Sounds like they were utilizing their EC2 instances very poorly. Why not run more workers per instance, or switch to an instance type with less RAM (or more CPUs)?

tracker16y ago

api6y ago

I wonder what percentage of the massive compute power of huge cloud data centers is spent just chugging away on ugly clunky hacks to run bad code?

asdfman1236y ago

Quite a large percentage. I worked on a site that got 1 request per second and they were able to handle it by spinning up like 20 VMs. Turns out they were just using Entity Framework wrong. Whoops.

But also, though, you have to consider that most places aren't Plaid, and most places developer time is more expensive than throwing an extra machine at the problem.

Crystalin6y ago

I got the same feeling. We use node and usually split to 500 concurrent requests per process.

Still interesting...

Vesuvium6y ago

The way it happens always: many people working to a solution, not agreeing on one and then comprising on something in the middle, even if it makes no sense.

phoe-krk6y ago

> I thought one of the key selling points with Node was an fully async standard library, enabling better scaling in process.

We still have an event loop that is trivially blocked by very simple programmer errors, destroying the whole advantage that you describe here.

The fact that Node ships a fully asynchronous standard library doesn't in any way fix the fact that Node is a runtime for a language that itself is a mistake.

nicoburns6y ago

> We still have an event loop that is trivially blocked by very simple programmer errors, destroying the whole advantage that you describe here.

So they fixed the issue that some requests blocked... by making all requests blocking.

2 more replies

duxup6y ago

>by very simple programmer errors

I can't help but also feel that is also an issue and in another given language this issue might not happen ... but they'd hit another.

I see it all the time and I feel like "Wait guies I'm not sure we're fixing the right thing!?!?!"

This article raises a lot more questions than answers IMO.

1 more reply

earthboundkid6y ago

Async is just modern cooperative multitasking, and just like the 90s, it's easy to accidentally lock the whole system.

2 more replies

davedx6y ago

> trivially blocked by very simple programmer errors

Can you give an example please?

I think it's much easier to block a thread with C#'s async programming model than node's...

1 more reply

techterrier6y ago

I wonder if all this was at root, pickup up jobs from a message queue and they only wanted each process to only have one job in flight at once.

asdfman1236y ago

> I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

Here's how it probably worked: they liked Node, they liked containers, they put Node into containers and it worked, and they stuck with it as the user base grew.

kevstev6y ago· 22 in thread

NathanKP6y ago

inglor6y ago

Elastic APM is better than New Relic in how it traces node and it is completely free and open source (you can use a cloud product).

1 more reply

neebz6y ago

How exactly do you use NewRelic to see what is blocking the event loop? I thought we always needed Flame Graphs for it (which NR doesn't provide)

1 more reply

inglor6y ago

> Very few actually understood that node is an event loop executing javascript backed by a threadpool for async operations.

This is true, and that JavaScript is mostly a synchronous programming language with host environments that can provide asynchronisity.

A caveat though is that the most important part of I/O is network I/O (tcp/udp sockets) and Node uses real async operations there rather than a threadpool.

specialist6y ago

At my last gig, I maintained a handful of existing, and created some new nodejs things. I had previously done a lot of Java and even hacked on Apache. I had no prior nodejs experience.

I only understood what was happening because I'd already been through all that "architecture" madness a decade earlier with Java services.

I guess what I'm saying is while I LOVE nodejs' closeness to the metal, I didn't like going back in time 10-15 years.

Also, npm is crap.

pier256y ago

Here is a SO answer that expands a bit more on kevstev excellent comment:

https://stackoverflow.com/a/22644735/816478

golergka6y ago

> When I was last doing this stuff, upwards of 80% of our time was being spent essentially just JSON.parse()'ing, and we were looking to move to protobufs to avoid that.

It's only tangentially related to your question, but I can't help but ask this question: why people use JSON instead of protobufs at all?

kevstev6y ago

JSON existed before protobufs is really it. When I left the node world 3 years ago, protobufs were the new hot thing. Any new project should start with them over JSON imho.

x86_64Ubuntu6y ago

What does happen with the 5th request.

NathanKP6y ago

kevstev6y ago

It will get queued, until one of the 4 requests in front of it has its task, to return the file, complete.

1 more reply

rclayton6y ago

Is it really only 4? When I look at my VM stats I seem to recall it having something like 15. Of course, this could be a config change on the Node Alpine container.

Kiro6y ago

I have a Node service where I get tens of thousands requests a second and I still thought Node was single threaded. Where can I read about this?

rclayton6y ago

avip6y ago

That's a great interview question (especially if you're not so much into hiring :))

Another one is: what happens when a node process completed execution?

  // node ex.js
  function foo() {  // something async here }
  foo()
  console.log('bye...')

This is a fun question to discuss (I think some consider this a bug in node).

z3t46y ago

Node.js is a good abstraction layer. In my experience, everything get leaky once you get hundreds of concurrent users.

gameswithgo6y ago

>I think node.js is a great platform,

wat

NohatCoder6y ago

* I have no delusions that Windows is any better, Linux is just what I have first hand experience with.

2 more replies

kevstev6y ago

1 more reply

DanHulton6y ago

You're handily skipping over 15 years of improvement and iteration between the last and second-last points there.

a13n6y ago

> safety and correctness features

You can achieve safety and correctness features for node via good lint rules and typescript/flow.

1 more reply

JMTQp8lwXL6y ago

Most bugs encountered in production systems aren't type based issues. Types are more useful for developer productivity (e.g., intellisense) than any other purpose.

1 more reply

GordonS6y ago· 22 in thread

These engineers are building stuff for banking. Banking!! There is literally no way I'm going near Plaid with a very long bargepole after reading this.

It I was someone senior at Plaid, I'd be pulling this blog post before it harms reputation any further.

yjftsjthsd-h6y ago

GordonS6y ago

Thanks for a reasoned response to what I realise was a very negative comment. I do agree with what you've said, and I do feel a little bad for slamming them when they're being transparent.

1 more reply

bjacokesOP6y ago

Hi, Plaid engineer here (not the author, but I helped with the post).

danudey6y ago

I mean, my read of this is:

1 more reply

GordonS6y ago

Instead, you've peppered this thread with comments that kind-of, sort-of justify the approach taken.

RSZC6y ago

Instead the rational thing to do is build something quick and dirty and optimize later, and that's exactly what they've done.

GordonS6y ago

I understand that, and plenty times myself I've "done the simplest" thing - sometimes you need to ship an MVP, fast.

> Why should they worry about $100k or whatever when they're funded for > $350M? Their bottleneck is engineer hours, not dollars

rockostrich6y ago

>Their bottleneck is engineer hours, not dollars.

mirekrusin6y ago

kevstev6y ago

tracker16y ago

What they talk about are issues that blocked them from parallelism per node and how they resolved the issues. I'm not sure what additional information you're expecting?

Though I'm somewhat surprised they didn't use Worker patterns per node with self monitoring for health above and beyond what they already did.

tracker16y ago

GordonS6y ago

1 more reply

ubu77376y ago

Where I work compliance is job #1.

That doesn't prevent us from thinking about performance. GTFO with this nonsense.

2 more replies

z3t46y ago

mharroun6y ago

In their defense. It looks like they have over 400 employees and raised over 350 million in funding. On all things that truly matter currently they seem like a very sucessfull company.

1 more reply

sicromoft6y ago

This comment says more about you than it does about Plaid. Their "insane" design met business requirements successfully enough to grow them into a multi-billion dollar company.

Did you consider the likely (and more charitable) explanation that they were aware their design was "bad", but had higher priorities until now?

If I were you, I'd be pulling your comment before it harms your reputation any further. :)

rockostrich6y ago

>multi-billion dollar company

WeWork is a "multi-billion dollar company" in the same way that Plaid is. Private funding valuations don't really mean anything anymore.

GordonS6y ago

> Did you consider the likely (and more charitable) explanation that they were aware their design was "bad"

1 more reply

throwGuardian6y ago

> no way I'm going near Plaid with a very long bargepole after reading this

But you'd go to a competitor who hasn't published a blog post, whose internal code you haven't audited and simply presume is just fine?

In plaid's defense, lack of performance tuning isn't necessarily a lack of security focus.

GordonS6y ago

> In plaid's defense, lack of performance tuning isn't necessarily a lack of security focus

I'd say engineering insanity at this level is very worrisome for what they've done at the security side of things.

ubu77376y ago

ROFL "performance tuning" this is not tuning this is architecture.

Some people think you can just write software, sell it to customers, and it's "tuning" to make it work properly.

You should be fired from whatever job you have.

My guess is that you have no job, you are fronting USD.

In which case you have absolutely no place in this conversation and you should be ashamed of yourself for speaking up.

A fool and his money are easily parted.

2 more replies

spamizbad6y ago· 11 in thread

sho6y ago

> explore Go or Elixir

But if you need to access a DB with golang for anything more than, like, a session token, then you made the wrong choice and you need to go back and re-assess.

[1] https://www.youtube.com/watch?v=JvBT4XBdoUE

yurish6y ago

What is wrong with accessing DB from golang?

1 more reply

bjacokesOP6y ago

We certainly hope the tooling and rollout process in the post were instructive for anyone using Node, even if their stacks were pristine from day 1 and never need this sort of complex migration :)

stickfigure6y ago

I'd recommend moving away from Node...

paulddraper6y ago

I think that overemphasizes Pupeteer itself.

It's really just bindings for the dev tools protocol.

Half the GitHub issues result in "well the protocol requires X and we can't change that".

Pupeteer is popular because it's web automation protocol bindings for a web language, not because it a sophisticated layer or does very much.

There are literally dozens of language bindings for the protocol. [1] Some are quite good and widely used, for example chromedp (Go bindings). [2]

[1] https://github.com/ChromeDevTools/awesome-chrome-devtools#pr...

[2] https://github.com/chromedp/chromedp

rdsubhas6y ago

4000 chrome instances? Probably not. Here I am trying to run 4 chrome instances in parallel in CI without crashing.

2 more replies

duxup6y ago

God knows they could be waiting for some reel to reel tape to spin up somewhere...

coddle-hark6y ago

The whole point of async I/O is to be able to do something useful while waiting for tape to spin up.

I don’t buy it.

3 more replies

kreetx6y ago

Haskell also has very nice concurrency IMO.

markstos6y ago

tracker16y ago

https://aws.amazon.com/ecs/

Scarbutt6y ago· 5 in thread

I don't want to be that guy, but why did they start with nodejs for something like this instead of using the JVM or Go?

caiobegotti6y ago

hatch_q6y ago

meritt6y ago

calibas6y ago

1 more reply

mnutt6y ago

Node is pretty good for managing HTTP requests as long as the responses aren't too large. But parsing data, especially html/xml, is CPU-intensive in node and probably not a great fit.

PixyMisa6y ago· 5 in thread

Nobody involved in this project should be allowed to ever be in the same room as a computer again.

dang6y ago

This comment breaks the site guidelines and is not cool. Would you please read https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here?

https://news.ycombinator.com/newsguidelines.html

jrockway6y ago

Seems like everything went right to me.

GordonS6y ago

> Why? They had a 12 factor -ish app that scaled the normal way

I honestly can't believe the attempts in this thread to justify such an utterly, horrendously bad architecture - there are 1001 better, simpler even, ways to approach this.

Yes, premature optimisation is bad, but optimisation here was nowhere near premature.

1 more reply

PixyMisa6y ago

To save $300,000 they first needed to waste $300,000 by reinventing a problem that was solved in 1967.

1 more reply

bdcravens6y ago

I would say the same thing about hiring people who make snide dismissals.

rauchp6y ago· 4 in thread

> Each Node worker runs a gRPC server

bjacokesOP6y ago

mnutt6y ago

I've been hoping that the Cloudflare folks will open source parts of their Workers; they seem to have figured out a secure, performant way to run untrusted javascript at scale.

jrockway6y ago

tracker16y ago

1 more reply

tyingq6y ago· 4 in thread

"Only 10% of Plaid's data pulls involve a user who is present"

Since they provide an API, it seems like some of the calls where they think a user isn't present might actually have one present.

bjacokesOP6y ago

We thread knowledge of whether a data pull was initiated by the API or by our cron-style service into our load-balancing layer, so this ends up being pretty straight-forward.

tyingq6y ago

Ahh, got it. The "present and linking their account" part threw me off. Sounded like only the "linking" call was getting the fast lane.

CamouflagedKiwi6y ago

The other 90% are not triggered by the API, they are "periodic transaction updates" - presumably they refresh once a day or something.

tyingq6y ago

Yeah, I read that, but it's not clear exactly what those calls are. It sorta sounds like making assumptions on how their users are using the API.

In fact, it sounds like they think "linking an account" is the only "user present" API call:

"Only 10% of Plaid's data pulls involve a user who is present and linking their account to an app"

1 more reply

tyingq6y ago· 4 in thread

Does node have something similar to how apcu is used with PHP?

That is, an mmap based kv store so that if you choose to run more than one node process on a single server, it has a fast kv cache?

I'm aware you can use redis or similar, but a simple mmap kv store is simpler and faster for a single server use case.

godot6y ago

If you want a simple open source lib to do exactly that for you and provide an easy to use API, you can use something like https://www.npmjs.com/package/tmp-cache .

tyingq6y ago

The context is multiple node processes running on a single box, so a shared cache across processes has value for some use cases. I don't think the cache module you suggested would work in that case.

I'm aware of the runtime model differences between node and PHP.

ddorian436y ago

You can use something like LMDB on every language.

tyingq6y ago

Ah, yeah. I suppose that would mean you need a fast node.js serializer. Apcu uses its own serializer that is fast-ish.

1 more reply

nosianu6y ago· 3 in thread

They write (somewhere in the middle)

> Since V8 implements a stop-the-world GC, new tasks will inevitably receive less CPU time, reducing the worker’s throughput

But there is this Google blog post vom January 2019:

https://v8.dev/blog/trash-talk

So I guess they used an older node.js version. The current LTS version is 12.x and it is from around the middle of this year.

---

_fbpt6y ago

When I try to view the image in a new tab, I get:

Rate-limit exceeded That’s all we know.

bjacokesOP6y ago

Fixed the images about half an hour ago, sorry about this!

greentrust6y ago

Ditto, images weren't showing up for me

mceachen6y ago· 3 in thread

In case anyone else gets excited by JSONStream, know that the package hasn't been updated in over a year, and the GitHub repo was archived by the author with no link to a successor.

contrahax6y ago

I'm maintaining a fork here that incorporates all of the valid open PRs from the original repo + some more updates: https://github.com/contra/JSONStream

It isn't published on NPM (you can use it as a git dependency) but if people are interested I can.

mceachen6y ago

Thanks for sharing!

Why don't you publish releases?

fenwick676y ago

Oboe has a similar API, can't speak for performance though.

http://www.oboejs.com/examples

deedubaya6y ago· 3 in thread

A good example of avoiding premature optimization. I'd imagine delaying tackling this problem freed them up to tackle problems that impact users.

coddle-hark6y ago

This only holds if they didn’t pour hours into the original solution. Setting up and managing 4000 node services doesn’t sound like a quick hack.

bjacokesOP6y ago

[1] https://blog.plaid.com/how-we-reduced-deployment-times-by-95...

1 more reply

deedubaya6y ago

Writing and maintaining concurrent code for greenfield projects is relatively hard compared to sync code.

Provisioning and deploying with ECS is usually just mouse clicks.

pdimitar6y ago· 2 in thread

...Or you could just use Erlang or Elixir, where concurrency and parallelism come pretty much out of the box, with very little effort required for you to fine-tune the desired policy / strategy.

The insistence on using Javascript is just beyond lunacy at this point.

timmy-turner6y ago

Well, if Elixir had a Typesystem like Javascript has, I'd instantly switch to it. But atm I'm staying with Node because of Typescript.

pdimitar6y ago

True, it doesn't have it. Between pattern matching and function guards however, it has a decent way to protect against common errors.

The true treasure is Erlang / Elixir's runtime though. The parallelism, the self-healing, the preemptive scheduling.

FanaHOVA6y ago· 2 in thread

$300k is $300k, but they just raised $250M last year, is this a really good use of time for their engineering team? That's a little above ~0.1% of capital.

dwild6y ago

bdcravens6y ago

That's just one of the benefits.

> our system is more robust to increases in external request latencies or spikes in API traffic from our customers

mnutt6y ago· 1 in thread

For instances where you actually know you need lots of CPU, there are now strategies for offloading that specific work, although they have taken a while to get nice and easy to use.

bjacokesOP6y ago

rynop6y ago· 1 in thread

bdcravens6y ago

jdc05896y ago

On a positive note: this was a good write up.

On a negative note: FOR THE LOVE OF ALL THAT IS HOLY, HOW DID THIS HAPPEN.

vmarchaud6y ago

[0]: https://github.com/vmarchaud/openprofiling-node

awinter-py6y ago

Compared to a compiled language, node / JIT langs make it difficult to know what will be fast in prod.

V8 JIT means that things like order of keys in an object or number of different calls to a function might affect whether your function gets optimized.

And there's no easy way to find out if a JS function is falling back to slow mode or to tell the buildsystem 'this is a hot path, don't let me write code that deopts this call'.

bfrog6y ago

supermatt6y ago

Ironic. Linked images failing to display due to "Rate limit exceeded"...

mirekrusin6y ago

4k containers? That's microservices going macro big time.

CyanLite26y ago

TLDR: How to spend millions of dollars of our investors' money because we hired junior devs who chose a framework that was trendy but couldn't scale.

Phil_Latio6y ago

> We were running 4,000 Node containers

LOL

j / k navigate · click thread line to collapse