Stack Overflow is a cacheless, 9-server on-prem monolith (opens in new tab)

(twitter.com)

160 pointsmike_h3y ago118 comments

118 comments

93 comments · 26 top-level

atonse3y ago· 12 in thread

Even though I love their simplicity as an example of how to be pragmatic and not over-engineer, do remember that they’ve tuned their code to the point that they built an ORM that is one of the fastest in the NET world. I used it and it was awesomely lightweight.

It’s as much an example of how far world class talent can go, as it is about doing more with less.

jameshart3y ago

Right - Marc Gravell and Tim Craver, who worked on the core architecture of Stack Overflow, were both so obsessive about extracting performance from .net web applications that when they couldn’t do any more from the outside, they both quit and went to work for Microsoft on performance improvements in the framework itself.

I feel like it’s similar to how people point to Craigslist as evidence that you can still build sites in Perl - ignoring the fact that Craigslist has Larry Wall on a retainer.

Running highly scalable monoliths is easy! As long as you’re willing to hire some of the five to ten people in the world who are capable of advancing the state of the art of development on that technology stack…

BackBlast3y ago

Except that servers are literally 50-100x more powerful than they were when these sites were built. You just don't need legendary talent anymore to accomplish pretty reasonable scaling with a simple low server count architecture.

1 more reply

winrid3y ago

You don't, really. You can use Django or Perl today and just enable nginx caching for non authenticated users, for many applications.

Stack Overflow didn't need these optimizations. They could have just deployed 20 servers instead and still been profitable. People optimized just because they like to.

1 more reply

sathyabhat3y ago

Minor correction but that’s Nick Craver https://nickcraver.com/

tcgv3y ago

> Running highly scalable monoliths is easy! As long as you’re willing to hire some of the five to ten people in the world who are capable of advancing the state of the art of development on that technology stack…

I truly believe that being able to design and run a modular monolith application effectively (not talking about the 'hyperscale' scenario here) should be a prerequisite for designing and running a set of interconnected microsservices. The challenge is similar, but dealing with modular monoliths has the advantage of not having to deal with the uncertainty of networking programming (i.e. remote calls, network error handling, distributed transactions).

thunderbong3y ago

I think the other point being - very few applications need this kind of scaling.

didntreadarticl3y ago

Dapper! I used it a while back and it was a single class that bundled query results straight into a list of objects by emitting low level CLR bytecode

Looks like its expanded a little since then

https://github.com/DapperLib/Dapper

eqvinox3y ago

You can also see this the other way around — it's a testament to how slow some other stuff is.

Which, to be clear, is not intended to be a negative statement about that "other stuff". It really depends. Some is. But I've also seen things just done poorly by applying tools wrong, e.g. ORM misuse leading to thousands of queries that should have been one OUTER JOIN.

But I don't think you need engineers of their unique calibre to get most of what they got. It's probably an exponential thing, if you have some merely good engineers you could maybe achieve 80% of their performance. The last 20% are just much more costly.

KyeRussell3y ago

Yep. Following some of the SO folks on Twitter a while back, I remember watching them do all sorts of things with .NET that didn’t feel remotely “necessary” for a Q&A website. It’s not like you can pull people off the street and have them get away with infrastructure this simple.

cntainer3y ago

> It’s not like you can pull people off the street and have them get away with infrastructure this simple

I know that in many cases simple != easy but I can't help feeling sad while reading this.

When I started my career cloud wasn't yet mainstream bu as a beginner I was able to deploy and configure a nginx proxy and loadbalance between 2-3 backend servers without too much effort. It wasn't some kind of rocket science.

I guess the current issue is that cloud has been marketed so much that nobody who's just starting out in the industry even has a second thought about using it by default. What can I say, great job from the cloud providers in capturing their customers as soon as they get in front of the store.

1 more reply

mdasen3y ago

Not to take anything away from Dapper (it's an excellent library), but it isn't really that much faster than EntityFramework anymore.

> EF Core 6.0 performance is now 70% faster on the industry-standard TechEmpower Fortunes benchmark, compared to 5.0.

> This is the full-stack perf improvement, including improvements in the benchmark code, the .NET runtime, etc. EF Core 6.0 itself is 31% faster executing queries.

> Heap allocations have been reduced by 43%.

> At the end of this iteration, the gap between Dapper and EF Core in the TechEmpower Fortunes benchmark narrowed from 55% to around a little under 5%.

https://devblogs.microsoft.com/dotnet/announcing-entity-fram...

Again, this isn't to take anything away from Dapper. It's a wonderful query library that lets you just write SQL and map your objects in such a simple manner. It's going to be something that a lot of people want. Historically, Entity Framework performance wasn't great and that may have motivated StackOverflow in the past. At this point, I don't think EF's performance is really an issue.

If you look at the TechEmpower Framework Benchmarks, you can see that the Dapper and EF performance is basically identical now: https://www.techempower.com/benchmarks/#section=data-r21&l=z.... One fortunes test is 0.8% faster for Dapper and the other is 6.6% faster. For multiple queries, one is 5.6% faster and the other is 3.8% faster. For single queries, one is 12.2% faster and the other 12.9% faster. So yes Dapper is faster, but there isn't a huge advantage anymore - not to the point that one would say StackOverflow has tuned their code to such an amazing point that they need substantially less hardware. If they swapped EF in, they probably wouldn't notice much of a difference in performance. In fact, in the real world where apps, the gap between them is probably going to end up being less.

If we look at some other benchmarks in the community, they tell a similar story: https://github.com/FransBouma/RawDataAccessBencher/blob/mast...

In some tests, EF actually edges past Dapper since it can compile queries in advance (which just means calling `EF.CompileQuery(myQuery)` and assigning that to a static variable that will get reused.

Again, none of this is to take away from Dapper. Dapper is a wonderful, simple library. In a world where there's so many painful database libraries, Dapper is great. It shows wonderful care in its design. Entity Framework is great too and performance isn't really an interesting distinction. I love being able to use both EF and Dapper and having such amazing database access options.

atonse3y ago

Totally agree. To clarify, when I picked Dapper, it was 2014, where there was a huge difference.

No doubt EF has probably gotten to that level since MS has done a stellar job with .NET core of relentlessly slimming things down and improving performance.

cntainer3y ago· 8 in thread

Imagine trying to present this kind of architecture to a room full of executives already sold on the "benefits" of kubernetes, big data, serverless, etc.

prng20213y ago

Hah. I get your point but it would be an easy sell for them. The impossible sell would be to engineers. Executives would just compare operating costs estimates.

cntainer3y ago

Good point, for normal executives (whatever that means). In my little bubble most executives I have to deal with believe themselves to be on par with solution/enterprise architects and they like to show this by saying stuff like: "Let's use microservices and kubernetes for better scalability, everybody's doing it..."

ElectricalUnion3y ago

What would prevent you from running 9 "web server pods" with 64GB ram each? Just implement the whole thing on top of Kubernetes, why not?

Nextgrid3y ago

A 64GB RAM instance on a cloud (which is what you're most likely using if you have K8S) will set you back a decent amount of money, even more so if you want one matching the specs that Stack Exchange actually uses.

If you need that level of performance you need to go bare-metal, and this is where you'll hit a lot of roadblocks (yet they will be happy to spend 10-100x more money trying to make do with the cloud).

cntainer3y ago

did that before: running a single monolithic app inside a kubernetes cluster on a single pod. I still feel dirty after doing it.

My current hobby is to try and run monolithic apps like these on serverless services like cloud run. There's still some pain related to attaching persistent storage to a container but otherwise it feels like a great option.

threeseed3y ago

The use case is simple i.e. web front end, thin app layer, database.

So if you were to implement this same architecture using Kubernetes or Serverless it would be as equally simple as a bunch of Ansible or Puppet scripts.

Nextgrid3y ago

Keep in mind that the only reason this works is because it's all running on beefy bare-metal servers.

If you want to run it on Kubernetes I hope you know how to install/maintain K8S on-prem, because there's no way you're going to get this level of performance from any cloud provider (not at a sane price anyway).

cntainer3y ago

Agreed, but would most engineers understand that they can keep the simplicity of the solution if the underlying infrastructure is based in the cloud/serverless/etc?

Fro my limited experience many engineers fall in the trap of adding accidental complexity to an otherwise simple architecture just by trying to use the latest/coolest cloud architecture trend.

Monolith in the cloud on kubernetes? Speak no such abomination. Of course we have to do microservices, the more the better. How can we scale otherwise?

SQL DB? What is this, 2010? Of course we're going to use Cosmos DB, how else could we get "single-digit millisecond response times, automatic and instant scalability, along with guarantee speed at any scale".

Of course I'm exaggerating for dramatic effect but I rarely see teams disciplined enough to keep cloud architectures simple and clean.

selcuka3y ago· 5 in thread

It is ironic that many questions on Stack Overflow are about various cloud services, hyped-up technologies, and problems caused by over-engineering.

didntreadarticl3y ago

various cloud services

This question does not appear to be about programming, Closed.

hyped-up technologies

subjective, Closed

problems caused by over-engineering

Opinion-based, Closed.

motoxpro3y ago

I know you were joking but I am so glad stuff like that is not on SO. It would look like Quora which is the scourge of the internet.

1 more reply

docandrew3y ago

I’m always puzzled when I’m using SO to help diagnose some obscure problem in my tech stack and I see a bunch of “hot questions” in the sidebar about whether dwarf armor can deflect magic bullets, or what the energy capacity of a Stormtrooper’s laser rifle is, etc.

docandrew3y ago

Not knocking any interest, just really curious that there is such a wide range of topics on there.

1 more reply

oconnore3y ago

"The medium is the message" wins again.

kichik3y ago· 5 in thread

Is there a website that tracks outages of other websites like Stack Overflow over years? I know some that tell you if it's down right now, but not over years.

I have a subjective feeling that Stack Overflow is down a lot more than other websites. I don't see that ever mentioned in the discussion of cloud vs on-prem which makes the discussion seem lacking.

didntreadarticl3y ago

http://stats.pingdom.com/w2oc4thvox7s/73676/history

capableweb3y ago

Seems to be testing from just one location, as far as I can tell?

Randomly, packets time out on the internet, I would take this random dashboard with a grain of salt, we cannot be sure SO had a outage just because one request happen to fail.

1 more reply

ilyt3y ago

With a caveat that pingdom will mark "a connection from pingdom server from other side of the world to the server" as downtime, even if the target and your ISP, and the ISP of your ISP had no problems.

Spooky233y ago

That’s an engineering choice not cloud vs. cloud. How many services are down when AWS us-east has a problem?

kichik3y ago

True. But cloud makes it a lot easier. In some cases it's built-in, like S3. In others it's a checkbox like RDS Multi-AZ. And if you need to roll your own, multi-AZ or even multi-region is much more straightforward than renting another rack somewhere.

I have personally seen Stack Overflow be "under maintenance" or straight up down a lot more than I have seen entire us-east-1 down.

1 more reply

eduction3y ago· 4 in thread

The best cache is the one built into the database. People seem to forget that the major rdbmses have sophisticated cache strategies of their own and that handing them more RAM (and ensuring they are configured to use it for query or other cache) is usually a good first strategy before trying to second guess and reinvent the cache outside the db.

Thread says SO allocates 1.5TB RAM to SQL Server. Sounds wise.

MrFoof3y ago

Makes sense. Traditional RDBMSs are basically a buffer cache and a query optimization engine.

If the data is sitting in memory, and you've tuned extracting the data from memory as fast as possible, job done.

winrid3y ago

Not just a RDBMSs. Any modern DB, document store, or kv store will use a buffer cache.

likeabbas3y ago

It's all about the load though. SO is probably 95% Read-Only which makes sense for removing the cache layer. If you had a more writes, then they would need an external cache to offset the read load.

tfehring3y ago

I don’t follow. Holding the total server load constant, why wouldn’t a read-heavy workload benefit more from caching than a more balanced read/write workload?

PaulKeeble3y ago· 3 in thread

Microservices remains mostly an organisational pattern to scale development teams not necessarily the system performance. Microservices add a lot of complexity and overhead.

mupuff12343y ago

"Normal" sized services should be adequate enough for that purpose.

threeseed3y ago

Microservices became a synonym for Services Orientated Architecture years ago.

It's almost always relatively normal sized services split by functional area e.g. Auth, Cache etc.

4 more replies

sebazzz3y ago

Besides, microservices don't guarantee horizontal scaling just like a monolith does not imply no ability to do horizontal scaling.

tony-allan3y ago· 3 in thread

In the diagram [1], I can see why you might design it that way if starting from scratch but it works as is so why change it.

Is there a particular reason to suggest a change to the architecture?

[1] https://twitter.com/sahnlam/status/1629713954225405952/photo...

borland3y ago

Diagram 1 has the comment "What I think it should be".

It's easy to interpret that as "stackoverflow should change to be like this", but I think it was meant to be more like "If I had to guess how stackoverflow works, this is what I think it would look like".

It's amazing how much performance and scalability you can get out of computers, if you don't burden them with 100x overhead caused by shoveling data between microservices all the time :-)

quanticle3y ago

    It's easy to interpret that as "stackoverflow should change to be
    like this", but I think it was meant to be more like "If I had to
    guess how stackoverflow works, this is what I think it would look
    like".

That's not a better interpretation. It says something (something not good) about the mindset of modern software engineers that the first thing they think of when they look at a website like StackOverflow is a n-layer microservice architecture, with more moving components than a Swiss chronometer.

2 more replies

default-kramer3y ago

The word "should" might be confusing here. I didn't read it as the author recommending a change; rather the author first proposes "Given what I know about Stack Overflow, they must be doing something like this, right?" Then boom comes the surprising revelation.

didntreadarticl3y ago· 3 in thread

And runs on .NET

One of the only well known sites to do so, I think?

profile533y ago

I think most things Microsoft run on .net incl. parts of bing and office online.

didntreadarticl3y ago

ah yes, of course

mytailorisrich3y ago

Joel Spolsky used to work for Microsoft and all his products were developed using the MS ecosystem, I believe.

mike_hearn3y ago· 3 in thread

It's a useful reality check. Dedicated machines are fast and you can do a lot without much software complexity. People mention the StackOverflow guys optimizing their software, but their CPU utilization is 5% so they have a lot of headroom to be less optimized. Probably they just enjoyed it and could spend time on that, so why not?

At KotlinConf in April I'll be giving a talk on two-tier architecture, which is the StackOverflow simplicity concept pushed even further. Although not quite there yet for social "web scale" apps like StackOverflow, it can be useful for many other kinds of database backed services where the users are a bit more committed and you're less dependent on virality. For example apps where users sign a contract, internal apps, etc.

The gist is that you scrap the web stack entirely and have only two tiers: an app that acts as your frontend (desktop, mobile) and an RDBMS. The frontend connects directly to the DB using its native protocols and drivers, the user authentication system is that of the database. There is no REST, no JSON, no GraphQL, no OAuth, no CORS, none of that. If you want to do a query, you do it and connect the resulting result stream directly to your GUI toolkit's widgets or table view controls. If what you want can't be expressed as SQL you use a stored procedure to invoke a DB plugin e.g. implemented with PL/Java or PL/v8. This approach was once common - the thread on Delphi the other day had a few people commenting who still maintain this type of app - but it fell out of favor because Microsoft completely failed to provide good distribution systems, so people went to the web to get that. These days distributing apps outside the browser is a lot easier so it makes sense to start looking at this design again.

The disadvantages are that it requires a couple more clicks up front for end users, and if they have very restrictive IT departments it may be harder for them to get access to your app. In some contexts that doesn't matter much, in others it's fatal. The tech for blocking DoS attacks isn't as good, and you may require a better RDBMS (Postgres is great but just not as scalable as SQL Server/Oracle). There are some others I'll cover in my talk along with proposed solutions.

The big advantage is simplicity with consequent productivity. A lot of stuff devs spend time designing, arguing about, fighting holy wars over etc just disappears. E.g. one of the benefits of GraphQL over plain REST is that it supports batching, but SQL naturally supports even better forms of batching. Results streaming happens for free, there's no need to introduce new data formats and ad-hoc APIs between frontend and DB, stored procedures provide a typed RPC protocol that can integrate properly with the transaction manager. It can also be more secure as SQL injection is impossible by design, and if you don't use HTML as your UI then XSS and XSRF bugs also become impossible. Also because your UI is fully installed locally, it can provide very low latency and other productivity features for end users. In some cases it may even make sense to expose the ability to do direct SQL queries to the end user, e.g. if you have a UI for browsing records then you can allow business analysts to supply their own SQL query rather than flooding the dev's backlog with requests for different ways to slice the data.

fatnoah3y ago

When my startup was acquired a few years ago, our infra was hosted at AWS, but most of our "cloud features" were used more for monitoring, alerting, and dashboarding. The real work was done by Windows/SQL and .NET app code. Ours was a messaging application that we tested to support about 350 messages/second, and we had to integrate with the "big co" backend after we were acquired. The bigco back-end could handle about 3-5 messages/second.

Our main production "infra" was a load-balanced pair of medium CPU front-end servers and a high-memory back-end for the SQL server. Theirs was approximately 20x the size, and a more "traditional" cloud microservices, etc. infrastructure. Optimization makes all the difference. So many of the "extras" just add unnecessary complexity, just like avoiding those "extras" probably does when they actually are required.

mwcampbell3y ago

On the topic of Postgres versus MS SQL Server or Oracle, I wonder if any of the newer Postgres-compatible databases, like Cockroach or Materialize, solve the scalability issue you raise with Postgres, while not having quite the stigma of MS SQL Server or (especially) Oracle.

mike_hearn3y ago

I'm not sure. Postgres itself has good performance but the issue for two-tier architecture is number of simultaneous connections. Postgres uses a process per connection. Something like pgbouncer in front can help with that but then the complexity starts going up again, as pgbouncer limits to some extent what you can do. Obviously if you have enough RAM to service all simultaneously connected clients it's not a problem, and you can scale RAM by just adding RO replicas. You can also set connections to aggressively time out if clients are idle, and the clients can re-establish them on demand, so there's lots that can be done.

But ultimately, a db like SQL Server or Oracle will just let you use lots of connections without breaking a sweat. They're both threaded and fully async, it's a much more efficient model.

lifeisstillgood3y ago· 2 in thread

The main takeaway is that the questions searched for are so widely distributed that there is no need for a cache layer - they are nothing but long tail.

At that point there is no 'cloud' design that can help. Its either one database (or maybe just shard everything onto thousands of distributed nodes)

But the point I am trying to make is that kubernetes and microservices etc are based on idea of winners - power laws. One tweet everyone wants to read. One search term, one viral video.

Then again. This is just a question of taste - the taste of the dev lead. What (s)he feels is best approach. Take another company doing the same thing and different approach might emerge.

pickledish3y ago

I mean, kubernetes or microservices don’t care how the data reads are distributed, right? That problem is a database-level thing whereas k8s is infrastructure, you can run any kind of database with any kind of sharding you want on it. I feel like it might be more accurate to say something like “the value of caching is based on the idea of winners” for example

lifeisstillgood3y ago

Yes. I think basically I would not have done it like that but they did, it's wildly successful so fair play - it's taste that makes the difference

ctvo3y ago· 2 in thread

The folks over at SO picked a stack (C#, SQL Server, IIS), and optimized the heck out of it to keep this "simplicity". Much of SO is custom built from the ground up to push performance and stay within the purity of the canonical .net stack.

It isn't clear to me this is a model that would work elsewhere, or should be held up as something to be replicated.

Did they save time? Did they save money? Did this help make SO a wildly successful company? Did it allow them to deliver features to customers faster?

Yeroc3y ago

It's worth reminding people what is actually possible with a relatively simple architecture. There's a vast number of websites and services with a very small fraction of the traffic of Stack Overflow with a much more complicated architecture simply because everyone thinks you need Kubernetes etc to scale out.

inhumantsar3y ago

That's the point though. If you want to focus your engineering time on optimization and code quality, then of course you can scale to SO's size with 9 servers and a simple architecture.

If you're still growing and more interested in delivering tons of features quickly, and/or don't have the ability to attract world leading talent, then a more complicated architecture with clear boundaries is often a better call than delivering relatively few features with obsessive rigor in a monolithic codebase.

1 more reply

cosmotic3y ago· 2 in thread

It's not cacheless. There are countless caches throughout (including what appears to be ~1TB of memory in the database server), just not a dedicated cache machine.

Sammi3y ago

It think OP is only referring to server architecture. And as you say there is no cache server. So cacheless server architecture.

ElectricalUnion3y ago

By this definition almost all non-toy applications under non-toy OSes have caches, because of CPU caches and registers.

Fire-Dragon-DoL3y ago· 2 in thread

Isn't stackoverflow, incidentally, one of the websites who would benefit the most from caching, given their content supposedly is going to be static the majority of the time?

infomaniac3y ago

This is addressed in one of the linked tweets.

stby3y ago

This one: https://twitter.com/sahnlam/status/1629713961951330304?s=20

bitwize3y ago· 2 in thread

That defies the laws of physics. How can they be web scale without cloud and microservices?

another2another3y ago

I want to upvote you, but you forgot MongoDB, which is the most fundamental law of web scale.

bitwize3y ago

We all know that /dev/null is an adequate substitute, as long as it gets those kickass benchmark numbers.

jonas-w3y ago· 2 in thread

The linked url [0] is also a great visualization with a bit more data than the twitter image.

[0] https://stackexchange.com/performance

hoseja3y ago

Only 450 peak reqs/s? Doesn't that seem low?

eppp3y ago

x9, it says per server.

tiffanyh3y ago· 2 in thread

> Removed Redis 4 years ago; average latency remained unchanged at 20ms.

A hidden taken away is that NVMe storage databases are so fast, they are comparable to in-memory (redis) databases these days.

kkielhofner3y ago

Throwing 1.5TB of RAM in the SQL Server (server) has to help too!

tiffanyh3y ago

> [1.5TB of RAM] that is a third of the entire Q&A dataset.

Yes, but maybe not as much as you’d think.

https://twitter.com/sahnlam/status/1629713961951330304

1 more reply

faizmokhtar3y ago· 2 in thread

"What I think it should be"

That's a little bit arrogant no?

KyeRussell3y ago

Quite the opposite. It’s what mere morals think it’d be, vs what the extraordinary talent has gotten away with.

didntreadarticl3y ago

They mean preconception

tyingq3y ago· 1 in thread

Not caching the questions and answers makes sense to me, as I imagine the hit rate wouldn't be terribly good. I would guess, though, that they somehow cache things like the sidebar list of blog articles, featured items, "Hot Network Questions", etc.

banana_giraffe3y ago

They do in fact cache some things like that, they've had caching issues in the past (and again recently, I think) with the wrong cache being used in some situations:

https://meta.stackexchange.com/a/235277

foobazzy3y ago· 1 in thread

Please ignore my lack of understanding a bit here. I'm genuinely trying to learn.

I've always heard (and it made sense to me) that to reduce latency of requests from across the globe, you might want to have read replicas or caches spread on global infrastructure. Then how is it that stack overflow is fast here when the db is on-prem, 7 seas across from me? Any amount of RAM should not account for the distance, right?

spiffytech3y ago

You can put a big dent in the impact of the speed of light if you keep round-trips to a minimum.

This is one advantage of server-rendered HTML (though that's not the only option you have).

It also helps that StackOverflow is light on interactivity. You load a page, read for a minute, then maybe click a vote button or open a textarea to discuss. As long as the text and styles load quickly, you won't notice if progressive enhancement scripts take a little more time to load.

bryancoxwell3y ago· 1 in thread

It’s also one of the few sites I use that regularly goes down for maintenance.

ThatMedicIsASpy3y ago

steam would be the biggest for me

ksec3y ago· 1 in thread

And somehow Wikipedia require thousands of severs.

ElectricalUnion3y ago

Wikipedia servers much heavier multimedia content around 20x more often (in page views), with a vastly highier write load.

yamrzou3y ago· 1 in thread

Is it hosted on the cloud?

didntreadarticl3y ago

Nope, on-prem

https://twitter.com/alexcwatt/status/1544876135711916035?lan...

tylergetsay3y ago

I don't think its that much more complicated than Wikimedia, which does 5x the traffic: https://meta.wikimedia.org/wiki/Wikimedia_servers

bluedino3y ago

Not that long ago (2016) they had:

  Servers:

  SQL Servers (Stack Overflow Cluster)
   2 Dell R720xd Servers
  SQL Servers (Stack Exchange “…and everything else” Cluster)
   2 Dell R730xd Servers, each with:
  Web Servers
   11 Dell R630 Servers
  Service Servers (Workers)
   2 Dell R630 Servers
   1 Dell R620 Server
  Elasticsearch Servers (Search)
   3 Dell R620 Servers
  HAProxy Servers (Load Balancers)
   2 Dell R620 Servers
  Redis Servers (Cache)
   2 Dell R630 Servers
  VM Servers (VMWare, Currently)
   2 Dell FX2s Blade Chassis, each with 2 of 4 blades populated
   4 Dell FC630 Blade Servers (2 per chassis)
   2 Equalogic SAN PS6000-series
  Machine Learning Servers (Providence)
   2 Dell R620 Servers
  Machine Learning Redis Servers (Still Providence)
   3 Dell R720xd Servers
  LogStash Servers
   6 Dell R720xd Servers
  HTTP Logging SQL Server
   1 Dell R730xd 
  Development SQL Server
   1 Dell R620 

  Network:

  2x Cisco Nexus 5596UP core switches (96 SFP+ ports each)
  10x Cisco Nexus 2232TM Fabric Extenders (2 per rack)
  2x Fortinet 800C Firewalls
  2x Cisco ASR-1001 Routers
  2x Cisco ASR-1001-x Routers
  6x Cisco 2960S-48TS-L Management network switches (1 Per Rack)

https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...

wlonkly3y ago

When I look up www.stackoverflow.com, I get Fastly IPs. I feel like using a CDN has to count as some cache?

ec1096853y ago

Source material is from 2022, so title should include that disclaimer.

j / k navigate · click thread line to collapse

118 comments

93 comments · 26 top-level

atonse3y ago· 12 in thread

It’s as much an example of how far world class talent can go, as it is about doing more with less.

jameshart3y ago

I feel like it’s similar to how people point to Craigslist as evidence that you can still build sites in Perl - ignoring the fact that Craigslist has Larry Wall on a retainer.

BackBlast3y ago

1 more reply

winrid3y ago

You don't, really. You can use Django or Perl today and just enable nginx caching for non authenticated users, for many applications.

Stack Overflow didn't need these optimizations. They could have just deployed 20 servers instead and still been profitable. People optimized just because they like to.

1 more reply

sathyabhat3y ago

Minor correction but that’s Nick Craver https://nickcraver.com/

tcgv3y ago

thunderbong3y ago

I think the other point being - very few applications need this kind of scaling.

didntreadarticl3y ago

Dapper! I used it a while back and it was a single class that bundled query results straight into a list of objects by emitting low level CLR bytecode

Looks like its expanded a little since then

https://github.com/DapperLib/Dapper

eqvinox3y ago

You can also see this the other way around — it's a testament to how slow some other stuff is.

KyeRussell3y ago

cntainer3y ago

> It’s not like you can pull people off the street and have them get away with infrastructure this simple

I know that in many cases simple != easy but I can't help feeling sad while reading this.

1 more reply

mdasen3y ago

Not to take anything away from Dapper (it's an excellent library), but it isn't really that much faster than EntityFramework anymore.

> EF Core 6.0 performance is now 70% faster on the industry-standard TechEmpower Fortunes benchmark, compared to 5.0.

> This is the full-stack perf improvement, including improvements in the benchmark code, the .NET runtime, etc. EF Core 6.0 itself is 31% faster executing queries.

> Heap allocations have been reduced by 43%.

> At the end of this iteration, the gap between Dapper and EF Core in the TechEmpower Fortunes benchmark narrowed from 55% to around a little under 5%.

https://devblogs.microsoft.com/dotnet/announcing-entity-fram...

If we look at some other benchmarks in the community, they tell a similar story: https://github.com/FransBouma/RawDataAccessBencher/blob/mast...

In some tests, EF actually edges past Dapper since it can compile queries in advance (which just means calling `EF.CompileQuery(myQuery)` and assigning that to a static variable that will get reused.

atonse3y ago

Totally agree. To clarify, when I picked Dapper, it was 2014, where there was a huge difference.

No doubt EF has probably gotten to that level since MS has done a stellar job with .NET core of relentlessly slimming things down and improving performance.

cntainer3y ago· 8 in thread

Imagine trying to present this kind of architecture to a room full of executives already sold on the "benefits" of kubernetes, big data, serverless, etc.

prng20213y ago

Hah. I get your point but it would be an easy sell for them. The impossible sell would be to engineers. Executives would just compare operating costs estimates.

cntainer3y ago

ElectricalUnion3y ago

What would prevent you from running 9 "web server pods" with 64GB ram each? Just implement the whole thing on top of Kubernetes, why not?

Nextgrid3y ago

cntainer3y ago

did that before: running a single monolithic app inside a kubernetes cluster on a single pod. I still feel dirty after doing it.

threeseed3y ago

The use case is simple i.e. web front end, thin app layer, database.

So if you were to implement this same architecture using Kubernetes or Serverless it would be as equally simple as a bunch of Ansible or Puppet scripts.

Nextgrid3y ago

Keep in mind that the only reason this works is because it's all running on beefy bare-metal servers.

cntainer3y ago

Agreed, but would most engineers understand that they can keep the simplicity of the solution if the underlying infrastructure is based in the cloud/serverless/etc?

Fro my limited experience many engineers fall in the trap of adding accidental complexity to an otherwise simple architecture just by trying to use the latest/coolest cloud architecture trend.

Monolith in the cloud on kubernetes? Speak no such abomination. Of course we have to do microservices, the more the better. How can we scale otherwise?

Of course I'm exaggerating for dramatic effect but I rarely see teams disciplined enough to keep cloud architectures simple and clean.

selcuka3y ago· 5 in thread

It is ironic that many questions on Stack Overflow are about various cloud services, hyped-up technologies, and problems caused by over-engineering.

didntreadarticl3y ago

various cloud services

This question does not appear to be about programming, Closed.

hyped-up technologies

subjective, Closed

problems caused by over-engineering

Opinion-based, Closed.

motoxpro3y ago

I know you were joking but I am so glad stuff like that is not on SO. It would look like Quora which is the scourge of the internet.

1 more reply

docandrew3y ago

Not knocking any interest, just really curious that there is such a wide range of topics on there.

1 more reply

oconnore3y ago

"The medium is the message" wins again.

kichik3y ago· 5 in thread

Is there a website that tracks outages of other websites like Stack Overflow over years? I know some that tell you if it's down right now, but not over years.

I have a subjective feeling that Stack Overflow is down a lot more than other websites. I don't see that ever mentioned in the discussion of cloud vs on-prem which makes the discussion seem lacking.

didntreadarticl3y ago

http://stats.pingdom.com/w2oc4thvox7s/73676/history

capableweb3y ago

Seems to be testing from just one location, as far as I can tell?

Randomly, packets time out on the internet, I would take this random dashboard with a grain of salt, we cannot be sure SO had a outage just because one request happen to fail.

1 more reply

ilyt3y ago

Spooky233y ago

That’s an engineering choice not cloud vs. cloud. How many services are down when AWS us-east has a problem?

kichik3y ago

I have personally seen Stack Overflow be "under maintenance" or straight up down a lot more than I have seen entire us-east-1 down.

1 more reply

eduction3y ago· 4 in thread

Thread says SO allocates 1.5TB RAM to SQL Server. Sounds wise.

MrFoof3y ago

Makes sense. Traditional RDBMSs are basically a buffer cache and a query optimization engine.

If the data is sitting in memory, and you've tuned extracting the data from memory as fast as possible, job done.

winrid3y ago

Not just a RDBMSs. Any modern DB, document store, or kv store will use a buffer cache.

likeabbas3y ago

It's all about the load though. SO is probably 95% Read-Only which makes sense for removing the cache layer. If you had a more writes, then they would need an external cache to offset the read load.

tfehring3y ago

I don’t follow. Holding the total server load constant, why wouldn’t a read-heavy workload benefit more from caching than a more balanced read/write workload?

PaulKeeble3y ago· 3 in thread

Microservices remains mostly an organisational pattern to scale development teams not necessarily the system performance. Microservices add a lot of complexity and overhead.

mupuff12343y ago

"Normal" sized services should be adequate enough for that purpose.

threeseed3y ago

Microservices became a synonym for Services Orientated Architecture years ago.

It's almost always relatively normal sized services split by functional area e.g. Auth, Cache etc.

4 more replies

sebazzz3y ago

Besides, microservices don't guarantee horizontal scaling just like a monolith does not imply no ability to do horizontal scaling.

tony-allan3y ago· 3 in thread

In the diagram [1], I can see why you might design it that way if starting from scratch but it works as is so why change it.

Is there a particular reason to suggest a change to the architecture?

[1] https://twitter.com/sahnlam/status/1629713954225405952/photo...

borland3y ago

Diagram 1 has the comment "What I think it should be".

It's amazing how much performance and scalability you can get out of computers, if you don't burden them with 100x overhead caused by shoveling data between microservices all the time :-)

quanticle3y ago

    It's easy to interpret that as "stackoverflow should change to be
    like this", but I think it was meant to be more like "If I had to
    guess how stackoverflow works, this is what I think it would look
    like".

2 more replies

default-kramer3y ago

didntreadarticl3y ago· 3 in thread

And runs on .NET

One of the only well known sites to do so, I think?

profile533y ago

I think most things Microsoft run on .net incl. parts of bing and office online.

didntreadarticl3y ago

ah yes, of course

mytailorisrich3y ago

Joel Spolsky used to work for Microsoft and all his products were developed using the MS ecosystem, I believe.

mike_hearn3y ago· 3 in thread

fatnoah3y ago

mwcampbell3y ago

mike_hearn3y ago

But ultimately, a db like SQL Server or Oracle will just let you use lots of connections without breaking a sweat. They're both threaded and fully async, it's a much more efficient model.

lifeisstillgood3y ago· 2 in thread

The main takeaway is that the questions searched for are so widely distributed that there is no need for a cache layer - they are nothing but long tail.

At that point there is no 'cloud' design that can help. Its either one database (or maybe just shard everything onto thousands of distributed nodes)

But the point I am trying to make is that kubernetes and microservices etc are based on idea of winners - power laws. One tweet everyone wants to read. One search term, one viral video.

Then again. This is just a question of taste - the taste of the dev lead. What (s)he feels is best approach. Take another company doing the same thing and different approach might emerge.

pickledish3y ago

lifeisstillgood3y ago

Yes. I think basically I would not have done it like that but they did, it's wildly successful so fair play - it's taste that makes the difference

ctvo3y ago· 2 in thread

It isn't clear to me this is a model that would work elsewhere, or should be held up as something to be replicated.

Did they save time? Did they save money? Did this help make SO a wildly successful company? Did it allow them to deliver features to customers faster?

Yeroc3y ago

inhumantsar3y ago

That's the point though. If you want to focus your engineering time on optimization and code quality, then of course you can scale to SO's size with 9 servers and a simple architecture.

1 more reply

cosmotic3y ago· 2 in thread

It's not cacheless. There are countless caches throughout (including what appears to be ~1TB of memory in the database server), just not a dedicated cache machine.

Sammi3y ago

It think OP is only referring to server architecture. And as you say there is no cache server. So cacheless server architecture.

ElectricalUnion3y ago

By this definition almost all non-toy applications under non-toy OSes have caches, because of CPU caches and registers.

Fire-Dragon-DoL3y ago· 2 in thread

Isn't stackoverflow, incidentally, one of the websites who would benefit the most from caching, given their content supposedly is going to be static the majority of the time?

infomaniac3y ago

This is addressed in one of the linked tweets.

stby3y ago

This one: https://twitter.com/sahnlam/status/1629713961951330304?s=20

bitwize3y ago· 2 in thread

That defies the laws of physics. How can they be web scale without cloud and microservices?

another2another3y ago

I want to upvote you, but you forgot MongoDB, which is the most fundamental law of web scale.

bitwize3y ago

We all know that /dev/null is an adequate substitute, as long as it gets those kickass benchmark numbers.

jonas-w3y ago· 2 in thread

The linked url [0] is also a great visualization with a bit more data than the twitter image.

[0] https://stackexchange.com/performance

hoseja3y ago

Only 450 peak reqs/s? Doesn't that seem low?

eppp3y ago

x9, it says per server.

tiffanyh3y ago· 2 in thread

> Removed Redis 4 years ago; average latency remained unchanged at 20ms.

A hidden taken away is that NVMe storage databases are so fast, they are comparable to in-memory (redis) databases these days.

kkielhofner3y ago

Throwing 1.5TB of RAM in the SQL Server (server) has to help too!

tiffanyh3y ago

> [1.5TB of RAM] that is a third of the entire Q&A dataset.

Yes, but maybe not as much as you’d think.

https://twitter.com/sahnlam/status/1629713961951330304

1 more reply

faizmokhtar3y ago· 2 in thread

"What I think it should be"

That's a little bit arrogant no?

KyeRussell3y ago

Quite the opposite. It’s what mere morals think it’d be, vs what the extraordinary talent has gotten away with.

didntreadarticl3y ago

They mean preconception

tyingq3y ago· 1 in thread

banana_giraffe3y ago

They do in fact cache some things like that, they've had caching issues in the past (and again recently, I think) with the wrong cache being used in some situations:

https://meta.stackexchange.com/a/235277

foobazzy3y ago· 1 in thread

Please ignore my lack of understanding a bit here. I'm genuinely trying to learn.

spiffytech3y ago

You can put a big dent in the impact of the speed of light if you keep round-trips to a minimum.

This is one advantage of server-rendered HTML (though that's not the only option you have).

bryancoxwell3y ago· 1 in thread

It’s also one of the few sites I use that regularly goes down for maintenance.

ThatMedicIsASpy3y ago

steam would be the biggest for me

ksec3y ago· 1 in thread

And somehow Wikipedia require thousands of severs.

ElectricalUnion3y ago

Wikipedia servers much heavier multimedia content around 20x more often (in page views), with a vastly highier write load.

yamrzou3y ago· 1 in thread

Is it hosted on the cloud?

didntreadarticl3y ago

Nope, on-prem

https://twitter.com/alexcwatt/status/1544876135711916035?lan...

tylergetsay3y ago

I don't think its that much more complicated than Wikimedia, which does 5x the traffic: https://meta.wikimedia.org/wiki/Wikimedia_servers

bluedino3y ago

Not that long ago (2016) they had:

  Servers:

  SQL Servers (Stack Overflow Cluster)
   2 Dell R720xd Servers
  SQL Servers (Stack Exchange “…and everything else” Cluster)
   2 Dell R730xd Servers, each with:
  Web Servers
   11 Dell R630 Servers
  Service Servers (Workers)
   2 Dell R630 Servers
   1 Dell R620 Server
  Elasticsearch Servers (Search)
   3 Dell R620 Servers
  HAProxy Servers (Load Balancers)
   2 Dell R620 Servers
  Redis Servers (Cache)
   2 Dell R630 Servers
  VM Servers (VMWare, Currently)
   2 Dell FX2s Blade Chassis, each with 2 of 4 blades populated
   4 Dell FC630 Blade Servers (2 per chassis)
   2 Equalogic SAN PS6000-series
  Machine Learning Servers (Providence)
   2 Dell R620 Servers
  Machine Learning Redis Servers (Still Providence)
   3 Dell R720xd Servers
  LogStash Servers
   6 Dell R720xd Servers
  HTTP Logging SQL Server
   1 Dell R730xd 
  Development SQL Server
   1 Dell R620 

  Network:

  2x Cisco Nexus 5596UP core switches (96 SFP+ ports each)
  10x Cisco Nexus 2232TM Fabric Extenders (2 per rack)
  2x Fortinet 800C Firewalls
  2x Cisco ASR-1001 Routers
  2x Cisco ASR-1001-x Routers
  6x Cisco 2960S-48TS-L Management network switches (1 Per Rack)

https://nickcraver.com/blog/2016/03/29/stack-overflow-the-ha...

wlonkly3y ago

When I look up www.stackoverflow.com, I get Fastly IPs. I feel like using a CDN has to count as some cache?

ec1096853y ago

Source material is from 2022, so title should include that disclaimer.

j / k navigate · click thread line to collapse