Lessons learned from running GraphQL (opens in new tab)

(blog.dream11engineering.com)

62 pointsarbobmehmood4y ago63 comments

63 comments

50 comments · 13 top-level

ripper11384y ago· 14 in thread

I wonder if they feel like GraphQL was worth it, vs. normal API servers. Maybe they saved some dev time on the front end, but did that outweigh the dev time spent on building, optimizing, etc? Somehow I doubt it.

doctor_eval4y ago

I’ve been using graphql for years. In my experience, it dramatically simplifies microservice API architecture vs “normal” API servers.

Graphql is super easy to understand, easy to deploy, easy to scale and easy to grow.

It’s not perfect - the lack of namespaces can be a pain, a few more standard types would be good, and mutations feel a bit under baked - but there’s much to love, and very little to dislike.

zelphirkalt4y ago

Downsides afaik are: (1) No way to do queries, which return recursive JSON objects of arbitrary depth. (2) Not using standard JSON as a format for writing your query, instead unnecessarily making up a new querying lang, a design flaw basically. (3) More dependencies in frontend as well as backend. (4) More difficult to determin what exactly is going on in processing 1 query, ergo more difficult to fix performance problems.

1 more reply

tshaddox4y ago

I think it would depend mostly on the diversity of combinations of data the frontend needs. Their GraphQL implementation is essentially automating the process of frontend teams asking the backend for new endpoints or configurations of existing endpoints to deliver new combinations of data for use by frontend clients. It’s pretty easy to see that certain for frontend requirements the backend GraphQL will be worth it, and for other frontend requirements the backend work would not be worth it. In this case I’m relatively confident that they’re coming out ahead.

veidelis4y ago

I assume that to prevent DoS attacks, the backend would need to know the list of allowed queries. This change would be easier than changing a rest api, for example.

nesarkvechnep4y ago

I bet REST API + HTTP caching is going to outperform the GraphQL APIs. And maybe most importantly, it’s going to be cheaper.

gonzo414y ago

I would place a bet along with you on that one. The question the graphql folk need to think about is who's paying for the value add of large one hit requests. cycles add up! There's a price to pay for having your api also have to parse and understand a request beyond just fetching the result.

1 more reply

timsuchanek4y ago

You’ll still very likely end up with dozens of http REST calls that could be one GraphQL query, hence one http call. Bias alert (I’m the founder), but I believe you should have a look at GraphQL edge caching: https://graphcdn.io

1 more reply

jjtheblunt4y ago

I wonder if they feel like Node was worth it. (seriously)

brabel4y ago

Exactly, the problems they found were mostly Node problems, not GraphQL problems.

Zababa4y ago

I wonder why we haven't seen yet some more optimisation for JS. The only thing I'm aware of is the Google Closure Compiler. With everyone using TypeScript, there must be a way to use all that type information to make JS run more efficiently, or compile parts of your TypeScript code.

nesarkvechnep4y ago

Seems to me Elixir + Absinthe would’ve been a better tech stack.

1 more reply

sibeliuss4y ago

> Somehow I doubt it

I don't know. They seem to be satisfied customers, and were simply optimizing an already working pipeline in anticipation of saving money.

As a side note, I haven't seen a single "We tried GraphQL and it failed us" story on HN. Not that they don't exist, of course. It's just that there doesn't seem to be much debate about its promise.

mirekrusin4y ago

If you read between lines you'll see failures. This article is an example. Spawning 7500 servers to handle this traffic is a facepalm failure.

1 more reply

FpUser4y ago

>"As a side note, I haven't seen a single "We tried GraphQL and it failed us" story on HN"

Why would it "fail". It is just one of many possible protocols to query data. It is like arguing about using this computer language vs that computer language. Bar difference in performance they would all work.

doctor_eval4y ago· 6 in thread

Do I read this right? 1,000,000 requests/sec across 7,500 instances is only 133 requests/second, and graphql wouldn’t typically represent the business logic or data layer.

I love me some graphql, but that seems to be a very low figure. I’m curious how complex the queries are and what else these servers are doing.

jayceedenton4y ago

I've seen similar situations where each host was achieving just 16 requests/sec.

Engineers are expensive, and growing more so every year. It's hard to justify time spent to optimise rather than throwing more instances at it. The cloud has made this worse in a way, since provisioning more hosts can be done so easily.

Not many engineers even have the skill to identify and resolve performance problems, so again, people just keep adding more machines. Long term, the problem slowly builds all over the system and the bill becomes mind-boggling.

I do think that we (the software folks) don't help ourselves here. We build frameworks and tools that are still far too hard to inspect. What to watch (and how to optimise) in production, is often never considered deeply when building or documenting the hot new thing.

jopsen4y ago

Well, adding machines allows you to postpone fixing the problem.

It's perfectly reasonable, once the bill starts to catch up to your budget you spend time on optimizations :)

EDIT: the only solid argument against throwing machines at the problem is that: scaling something across multiple servers is hard. If you spent energy on performance, maybe you didn't have to.

lbriner4y ago

Not sure why this is downvoted, there are def some issues here. I find the same, there are many things like GraphQL but even SQL libraries or entire frameworks that are not easy to performance-test, or perhaps they are easy to performance test but hard to resolve.

I also agree that most of the devs I have ever worked with, in the UK, have little to no idea about how to actually test performance effectively.

Even though I am personally really interested in performance, even using a cool tool like Resharper profiler takes some time to get your head round.

doctor_eval4y ago

I get it, but at this scale I think we are talking in the order of a million bucks a year. I don’t know the situation in India, but I imagine that buys a lot of engineers.

chank4y ago

From my own experiences building/scaling graphql, this is embarrassingly low. The poor performance is definitely coming from poor code/lib selection.

dang4y ago

Ok, we've taken "at scale" out of the title above.

FpUser4y ago· 4 in thread

>"We provision approximately 7,500 instances for 1 million requests per second."

Looking at this numbers makes me think that a single instance of properly written server running on a single dedicated piece of hardware can handle this without breaking a sweat. My servers for example handle thousands of requests per second. It looks to me like one giant waste of human and hardware resources. Not very "green" approach I would say.

lbriner4y ago

The benefit you are getting from GraphQL is massive flexibility of the query in return for probably sub-par performance.

The fact that you can get X 100K requests/per second best-case is not really the point. The point is if I don't want to write hand-cranked code for every kind of possible query, I take a performance hit as a result.

Not sure how easy it would be for them to identify poorly performing queries and split them out into their own optimised code.

FpUser4y ago

>"The point is if I don't want to write hand-cranked code for every kind of possible query, I take a performance hit as a result."

This is how we end up with the architectures consuming orders of magnitude more computing resources and giant management overhead. Just because someone wants to be spared from a bit of thinking.

I can see how GraphQL would work for orgs with the massive scale like FB/Google/Insert your favorite. For the most of rest of the world it is nothing but unneeded overhead on resource both human and computing.

And of course cloudy people like Amazon would love you to use all this tech. The more you slow down your application the more resources you will be leasing from them so they get more money

quonn4y ago

The default implementation of GraphQL has a lot of overhead in query parsing and validation alone. You can try this yourself with complex queries and simulating some load.

But it‘s an issue that can be solved.

FpUser4y ago

">You can try this yourself with complex queries and simulating some load."

I do not need to try it. I know what it takes to parse/validate this kind of queries and then manage to get and assemble the results from numerous sources.

>"But it‘s an issue that can be solved."

No. This issue will not be solved as in general it is a problem of mapping one storage / functionality format to end client format. It can be easily solved for particular situations by writing custom servers (this is for example one of the things I do) but doing it generically introduces overhead / costs that are very unhealthy for a normal businesses.

And it is of course bad as it wastes energy.

cryptica4y ago· 4 in thread

It was obvious to me from the beginning that GraphQL would add overhead and complexity on the backend; especially related to caching all possible permutations/views of the data. In some cases I imagine it would consume a lot of memory; wouldn't it cause a memory leak vulnerability if you allow infinite permutations to be cached by the server? On the other hand, if you only cache responses to popular requests, doesn't that expose your servers to DDoS? An attacker could just generate a ton of unique GraphQL queries to make the servers bypass the cache and consume a ton of CPU. The fact that GraphQL allows all these permutations in the queries is the root of the problem. It's not something which can be solved or optimized within GraphQL.

I think it's regrettable that all the big money got behind GraphQL instead of aiming for solutions which provide resource granularity and shift decision-making to the client side. Who is better placed to know what resources they want than the client? A big advantage of HTTP/REST is that it either serves individual resources or a limited number of different collections of resources and it lets clients do the heavy lifting of figuring out which resources they need and how they want to combine them. Caching REST endpoints is straight forward and resilient to DDoS attacks because the variations in responses is strictly limited.

Also, it makes sense to move processing to clients when those processing costs are imperceptible to users.

void_mint4y ago

> It was obvious to me from the beginning that GraphQL would add overhead and complexity on the backend;

Did you read the article? Most of the issues weren't related to GraphQL, they were just Node issues/optimizations.

> I think it's regrettable that all the big money got behind GraphQL instead of aiming for solutions which provide resource granularity and shift decision-making to the client side.

This is the stated intent of GraphQL. Literally the reason it exists.

n_e4y ago

> The fact that GraphQL allows all these permutations in the queries is the root of the problem. It's not something which can be solved or optimized within GraphQL.

Common ways to solve that are to whitelist the allowed queries or to cache at the resolver level instead of the query level.

jensneuse4y ago

I'd like to argue against that. Yes, whitelisting is a solution. But Caching at the Query level can be extremely efficient. I'm the founder of WunderGraph and we're doing it like this. We turn GraphQL Operations into REST/JSON-RPC Endpoints, allowing them to be cached by CDNs, Browsers, etc... https://wundergraph.com/docs/overview/features/edge_caching

quonn4y ago

> I think it's regrettable that all the big money got behind GraphQL instead of aiming for solutions which provide resource granularity and shift decision-making to the client side. Who is better placed to know what resources they want than the client?

With GraphQL the client specifies exactly what it needs. It‘s as granular as you can imagine, unlike REST.

DarthNebo4y ago· 3 in thread

Either optimising the main app or caching with Redis/Memcached can seriously reduce the number of instances & improve the 133 req/sec per server metric as well.

lbriner4y ago

It sounds like they are already caching but the problem is what happens when the number of possible queries is so high, you cannot cache them all and also on "live" apps like the football ones, the result of the query might change relatively quickly so can't be cached.

What might be possible is double-level caching, so you cache underlying data and then query from that, the results of which are also cached.

timsuchanek4y ago

Depends how fast your data changes. SWR can be very effective for that. I suggest having a look at GraphCDN (bias alert, I’m one of the founders, so take it with some salt ;)

timsuchanek4y ago

You could do that or directly cache at the edge [0] - makes it much faster for the user and is cheaper. https://graphcdn.io does that.

fabian2k4y ago· 2 in thread

The big items in that list of performance issues don't seem to have anything to do with GraphQL. They seem to be related to a heavily function style using lots of immutability and the Ramda library. I'd also suspect that these choices are responsible for the GC issues due to lots of allocations for the immutable objects.

I know, premature optimization and all that. But I think at the point where you're going for microservices because of scaling you really should also look into the lower level issues like that from the start. You should notice that the shiny library you're using is 100x slower than just writing plain code. And you should be aware of excessive allocations in hot paths.

Zababa4y ago

If you're using microservices, it may be a good oppurtunity to try a language that supports immutability and the heavy functional style without the loss of performance. Something like Elixir, Haskell, OCaml, Rust.

jeswin4y ago

At around one million requests per second, imperative programming becomes affordable; and more importantly, necessary.

orf4y ago· 2 in thread

Wouldn’t using a lambda for this be a good choice? You’re just parsing a input document into a set of backend requests and then executing them - there doesn’t have to be anything stateful here that would require an actual running instance.

If you combine this with API-gateway you’ve got caching (and potentially token auth) for free.

jensneuse4y ago

What you describe already exists. I'm the founder of https://wundergraph.com and we're doing exactly what you describe, combining GraphQL with Auth and Caching, plus some more extras...

orf4y ago

I was referring to deploying the graphql routers onto an AWS lambda within your own account, which looks entirely different from your product?

1 more reply

rawoke0836004y ago· 1 in thread

medium.com is terrible with GQL... I can have 50 tabs open in chrome. Then I open 1 tab on medium.com and CPU spike like crazy. Turns out it's their graphql-queries at 100 miles a second. One of the API (GQL) queries fails (I'm guessing my adblocker ?) then the retry seems to be in a tight never ending loop !

vosper4y ago

This doesn’t sound like a problem due to GraphQL, though?

tusharmath4y ago· 1 in thread

At Dream11 we love GraphQL :) It has made the lives of both — service owners and clients (Android, iOS, Web) much easier!

Over the years we had packed the server with almost all the graphQL optimizations we could find on the internet. The blog outlines some of the key optimizations we had put in to improve the performance of our application code (Which doesn't have a lot to do with GraphQL, as most people have already commented). I want to still give a bit of an "insider's perspective" as much as I can, so here it goes —

1. The graphQL team that did the optimizations had two engineers who were actively working on it. It seemed like a futile project at first. The goal was to find low-hanging fruits (if any) and prepare for our peak season (IPL 2021) but eventually, find other long-term alternatives. Killing graphQL altogether and moving that logic on the clients was still on the table. Fortunately, the team did a fantastic job of optimizing it so much that we are now committed to supporting it long-term.

2. We try to keep our microservices as discrete, pointed, and as unopinionated as possible. We also indulge the clients by letting them query huge amounts of data at once. All this makes our graphQL layer seriously complex. There is a huge amount of computation that happens on this layer. To get some perspective our /health call to the server is 10x faster than the most requested graphQL query. Needless to say, it's not a fair comparison because unlink the query, health doesn't make any network calls, or has any practical CPU load.

3. We have caching implemented on our graphQL clients, however, the reason we get such a high request rate, is because our concurrency is also very high. A typical user is barely making 10 requests in a minute but overall we achieve millions of requests in a second.

4. As a part of the long-term strategy, we did consider using Rust as our choice of the stack. We had heard a lot of noise about how RUST was beating all the benchmarks. So we did some POCs internally and implemented a part of our graphQL service in Rust. What we learned was that the Rust implementation was ~2.5x faster than our node.js implementation and also consumed relatively less memory. This was fine but wasn't good enough for us to migrate our large node.js codebase, and learn a completely new stack. Building a team with domain expertise in Rust in India is particularly hard.

5. It might seem like we are not pushing the production servers hard enough, you'd be surprised to know that it's true! Because our traffic is very unpredictable we like to maintain a comfortable CPU utilization for every possible extreme scenario that our Data Science team can predict. The risk of our edge layer going down is seriously revenue hitting. So even when our benchmarks say we can push the systems 5x more, the final call remains with Site Reliability Teams and the risk appetite we have for that particular game.

6. The blog briefly also talks about using multiple ELBs, to which we distribute traffic using DNS. The problem with DNS is that it doesn't guarantee a truly uniform distribution of the traffic. Even with a very low TTL, sometimes we observe a difference of more than 20% in requests/sec between two ELBs at an instant. This and other infrastructure-specific nuances have to be considered by the SRE teams to estimate capacity on production.

7. Lastly, the servers we use on production are small machines — 8 cores for the majority of our stack. This lies in the goldilocks area where we get the best cost to performance ratios. Scaling down or up the machine type has a significant impact on the cost.

It's been a journey of love and hate with graphQL and we continue to invest in making our edge robust and even faster. Feel free to connect with us on — https://twitter.com/D11Engg

tusharmath4y ago

https://twitter.com/Dream11Engg?s=09

This is the new link. Some how the old link doesn't work if you have the app installed.

tshaddox4y ago

Since it wasn’t mentioned, I’m curious if they have ever investigated the performance of GraphQL server implementations in other programming languages.

plasma4y ago

Found the Dream11 CTO, Amit Sharma, giving a tech talk about scaling, https://youtu.be/WifL4SWGJQw

withinboredom4y ago

I was considering a similar architecture for something else. It’s nice to see that it basically works to a point because I was worried of exactly the costs they ran into.

nivertech4y ago

I've loadtested Absinthe on a single, but beefy EC2 instance, I've got ~ 10K/s dummy GraphQL queries (not involving database, just a resolver returning the value directly).

j / k navigate · click thread line to collapse

63 comments

50 comments · 13 top-level

ripper11384y ago· 14 in thread

doctor_eval4y ago

I’ve been using graphql for years. In my experience, it dramatically simplifies microservice API architecture vs “normal” API servers.

Graphql is super easy to understand, easy to deploy, easy to scale and easy to grow.

It’s not perfect - the lack of namespaces can be a pain, a few more standard types would be good, and mutations feel a bit under baked - but there’s much to love, and very little to dislike.

zelphirkalt4y ago

1 more reply

tshaddox4y ago

veidelis4y ago

I assume that to prevent DoS attacks, the backend would need to know the list of allowed queries. This change would be easier than changing a rest api, for example.

nesarkvechnep4y ago

I bet REST API + HTTP caching is going to outperform the GraphQL APIs. And maybe most importantly, it’s going to be cheaper.

gonzo414y ago

1 more reply

timsuchanek4y ago

1 more reply

jjtheblunt4y ago

I wonder if they feel like Node was worth it. (seriously)

brabel4y ago

Exactly, the problems they found were mostly Node problems, not GraphQL problems.

Zababa4y ago

nesarkvechnep4y ago

Seems to me Elixir + Absinthe would’ve been a better tech stack.

1 more reply

sibeliuss4y ago

> Somehow I doubt it

I don't know. They seem to be satisfied customers, and were simply optimizing an already working pipeline in anticipation of saving money.

As a side note, I haven't seen a single "We tried GraphQL and it failed us" story on HN. Not that they don't exist, of course. It's just that there doesn't seem to be much debate about its promise.

mirekrusin4y ago

If you read between lines you'll see failures. This article is an example. Spawning 7500 servers to handle this traffic is a facepalm failure.

1 more reply

FpUser4y ago

>"As a side note, I haven't seen a single "We tried GraphQL and it failed us" story on HN"

doctor_eval4y ago· 6 in thread

Do I read this right? 1,000,000 requests/sec across 7,500 instances is only 133 requests/second, and graphql wouldn’t typically represent the business logic or data layer.

I love me some graphql, but that seems to be a very low figure. I’m curious how complex the queries are and what else these servers are doing.

jayceedenton4y ago

I've seen similar situations where each host was achieving just 16 requests/sec.

jopsen4y ago

Well, adding machines allows you to postpone fixing the problem.

It's perfectly reasonable, once the bill starts to catch up to your budget you spend time on optimizations :)

EDIT: the only solid argument against throwing machines at the problem is that: scaling something across multiple servers is hard. If you spent energy on performance, maybe you didn't have to.

lbriner4y ago

I also agree that most of the devs I have ever worked with, in the UK, have little to no idea about how to actually test performance effectively.

Even though I am personally really interested in performance, even using a cool tool like Resharper profiler takes some time to get your head round.

doctor_eval4y ago

I get it, but at this scale I think we are talking in the order of a million bucks a year. I don’t know the situation in India, but I imagine that buys a lot of engineers.

chank4y ago

From my own experiences building/scaling graphql, this is embarrassingly low. The poor performance is definitely coming from poor code/lib selection.

dang4y ago

Ok, we've taken "at scale" out of the title above.

FpUser4y ago· 4 in thread

>"We provision approximately 7,500 instances for 1 million requests per second."

lbriner4y ago

The benefit you are getting from GraphQL is massive flexibility of the query in return for probably sub-par performance.

Not sure how easy it would be for them to identify poorly performing queries and split them out into their own optimised code.

FpUser4y ago

>"The point is if I don't want to write hand-cranked code for every kind of possible query, I take a performance hit as a result."

This is how we end up with the architectures consuming orders of magnitude more computing resources and giant management overhead. Just because someone wants to be spared from a bit of thinking.

And of course cloudy people like Amazon would love you to use all this tech. The more you slow down your application the more resources you will be leasing from them so they get more money

quonn4y ago

The default implementation of GraphQL has a lot of overhead in query parsing and validation alone. You can try this yourself with complex queries and simulating some load.

But it‘s an issue that can be solved.

FpUser4y ago

">You can try this yourself with complex queries and simulating some load."

I do not need to try it. I know what it takes to parse/validate this kind of queries and then manage to get and assemble the results from numerous sources.

>"But it‘s an issue that can be solved."

And it is of course bad as it wastes energy.

cryptica4y ago· 4 in thread

Also, it makes sense to move processing to clients when those processing costs are imperceptible to users.

void_mint4y ago

> It was obvious to me from the beginning that GraphQL would add overhead and complexity on the backend;

Did you read the article? Most of the issues weren't related to GraphQL, they were just Node issues/optimizations.

> I think it's regrettable that all the big money got behind GraphQL instead of aiming for solutions which provide resource granularity and shift decision-making to the client side.

This is the stated intent of GraphQL. Literally the reason it exists.

n_e4y ago

> The fact that GraphQL allows all these permutations in the queries is the root of the problem. It's not something which can be solved or optimized within GraphQL.

Common ways to solve that are to whitelist the allowed queries or to cache at the resolver level instead of the query level.

jensneuse4y ago

quonn4y ago

With GraphQL the client specifies exactly what it needs. It‘s as granular as you can imagine, unlike REST.

DarthNebo4y ago· 3 in thread

Either optimising the main app or caching with Redis/Memcached can seriously reduce the number of instances & improve the 133 req/sec per server metric as well.

lbriner4y ago

What might be possible is double-level caching, so you cache underlying data and then query from that, the results of which are also cached.

timsuchanek4y ago

Depends how fast your data changes. SWR can be very effective for that. I suggest having a look at GraphCDN (bias alert, I’m one of the founders, so take it with some salt ;)

timsuchanek4y ago

You could do that or directly cache at the edge [0] - makes it much faster for the user and is cheaper. https://graphcdn.io does that.

fabian2k4y ago· 2 in thread

Zababa4y ago

jeswin4y ago

At around one million requests per second, imperative programming becomes affordable; and more importantly, necessary.

orf4y ago· 2 in thread

If you combine this with API-gateway you’ve got caching (and potentially token auth) for free.

jensneuse4y ago

What you describe already exists. I'm the founder of https://wundergraph.com and we're doing exactly what you describe, combining GraphQL with Auth and Caching, plus some more extras...

orf4y ago

I was referring to deploying the graphql routers onto an AWS lambda within your own account, which looks entirely different from your product?

1 more reply

rawoke0836004y ago· 1 in thread

vosper4y ago

This doesn’t sound like a problem due to GraphQL, though?

tusharmath4y ago· 1 in thread

At Dream11 we love GraphQL :) It has made the lives of both — service owners and clients (Android, iOS, Web) much easier!

It's been a journey of love and hate with graphQL and we continue to invest in making our edge robust and even faster. Feel free to connect with us on — https://twitter.com/D11Engg

tusharmath4y ago

https://twitter.com/Dream11Engg?s=09

This is the new link. Some how the old link doesn't work if you have the app installed.

tshaddox4y ago

Since it wasn’t mentioned, I’m curious if they have ever investigated the performance of GraphQL server implementations in other programming languages.

plasma4y ago

Found the Dream11 CTO, Amit Sharma, giving a tech talk about scaling, https://youtu.be/WifL4SWGJQw

withinboredom4y ago

I was considering a similar architecture for something else. It’s nice to see that it basically works to a point because I was worried of exactly the costs they ran into.

nivertech4y ago

I've loadtested Absinthe on a single, but beefy EC2 instance, I've got ~ 10K/s dummy GraphQL queries (not involving database, just a resolver returning the value directly).

j / k navigate · click thread line to collapse