For any job-hunters, it's important you forget this during interviews.
In the past I've made the mistake of trying to convey this in system design interviews.
Some hypothetical startup app
> Interviewer: "Well what about backpressure?"
>"That's not really worth considering for this amount of QPS"
> Interviewer: "Why wouldn't you use a queue here instead of a cron job?"
> "I don't think it's necessary for what this app is, but here's the tradeoffs."
> Interviewer: "How would you choose between sql and nosql db?"
> "Doesn't matter much. Whatever the team has most expertise in"
These are not the answers they're looking for. You want to fill the whiteboard with boxes and arrows until it looks like you've got Kubernetes managing your Kubernetes.
I think three things about what you're saying:
1. The answers you're giving don't provide a lot of signal (the queue one being the exception). The question that's implicitly being asked is not just what you would choose, but why you would choose it. What factors would drive you to a particular decision? What are you thinking about when you provide an answer? You're not really verbalizing your considerations here.
A good interviewer will pry at you to get the signal they need to make a decision. So if you say that back-pressure isn't worth worrying about here, they'll ask you when it would be, and what you'd do in that situation. But not all interviewers are good interviewers, and sometimes they'll just say "I wasn't able to get much information out of the candidate" and the absence of a yes is a no. As an interviewee, you want to make the interviewer's job easy, not hard.
2. Even if the interviewer is good and does pry the information out of you, they're probably going to write down something like "the candidate was able to explain sensibly why they'd choose a particular technology, but it took a lot of prodding and prying to get the information out of them -- communications are a negative." As an interviewee, you want to communicate all the information your interviewer is looking for proactively, not grudgingly and reluctantly. (This is also true when you're not interviewing.)
3. I pretty much just disagree on that SQL/NoSQL answer. Team expertise is one factor, but those technologies have significant differences; depending on what you need to do, one of them might be way better than the other for a particular scenario. Your answer there is just going to get dinged for indicating that you don't have experience in enough scenarios to recognize this.
One of the things I tell people preparing for system design interviews is the more senior you are, the more you need to drive the interview yourself, knowing when to go deep, what to go deep on, and how to give the most signal to the interviewer.
But on the other side, that kind of interview process is itself also a signal candidates might take to avoid playing the game, knowing that most (not all, of course) companies probing for the wrong signals during the interview process are indicative of how they do function as a whole.
(been in both types of companies, and in both sides of the table)
But to your point, many times one interviews for a job they don't really have the luxury of getting rejections and need to land somewhere fast so they can keep paying the mortgage. So while yes interviewing is a two way street, there's still quite a bit of calibration to make sure you land on the other person's side of the street so to speak.
> "That's not really worth considering for this amount of QPS"
"What if Michael Jackson dies and your (search|news|celebrity gossip) service gets a spike in traffic way beyond the design parameters? How would you anticipate and mitigate such an event?"
(Extra points if the answer is not necessarily backpressure but they start talking about DDoS mitigation, outlier detection, caching or serving static results from extremely-common queries, spinning up new capacity to adjust to traffic spikes, blackholing traffic to protect the overall service, etc.)
> Interviewer: "Why wouldn't you use a queue here instead of a cron job?" "I don't think it's necessary for what this app is, but here's the tradeoffs."
"What if you have a subset of customers that demand faster responses than a cron job can provide?"
(And then that can become a discussion about splitting off traffic based on requirements, whether it's even worth adding the logic to split traffic vs. just using a queue for everyone, perhaps making direct API requests without either a queue or cron job for requests from just those customers, relying on the fact that they are not numerous or these requests are infrequent to trade capacity for latency, etc.)
> How would you choose between sql and nosql db?"
I would've expected the candidate to at least be able to talk about indexing, tradeoffs of joining in the DB vs. in the application, schema migrations and upgrades, creating separation between data-at-rest vs. data-in-flight, etc. If they can't do that and just handwave away as "whatever the team is most comfortable with", that's a legit hole in their knowledge. Usually you ask system design interviews of senior candidates that will be deciding on architecture and, if not hiring out the team directly, providing input to senior managers who will be hiring, so you can swap out the team nearly as easily as swapping out the architecture.
If a really good "tech" engineer ruled out all the places that are bad at interviewing, they would probably be unemployed.
You have to look past bad interviewing practice, to some degree.
> there's still quite a bit of calibration to make sure you land on the other person's side of the street so to speak.
Exactly. But if they try to Leetcode you, you have to decide whether you have any self-respect at all, or you're all just playing house together.
Those questions are all prompts to have a discussion in lieu of tech trivia hour. Those responses do not demonstrate wisdom, they reveal a lack of maturity. It's not the interviewers fault you refuse to be interviewed.
These ARE the answers we are looking for. As the system design interview (I’ve done hundreds) I want you to start with these answers then we can layer on complexity if you’ve solved the problem and there’s time left to go into navel gazing mode.
Seeing the panic slowly build in mid-level engineers’ eyes as it dawns on them that not every problem can be solved by caching is pretty fun too. “Ok cool you’ve cached it there, now how do you fill the cache without running into the same performance issue?”
Exactly. Part of the interview is explaining when and why these techniques are necessary as part of demonstrating your understanding.
If the candidate gives non-answers like “I don’t think it matters because you’re a startup” or “I’d just use whatever database I’m comfortable with” that’s not demonstrating knowledge at all. That’s dismissing the question in a way that leaves the interviewer thinking you don’t have that knowledge, or you don’t take their problems seriously enough to put thought into them. There is a type of candidate who applies to startups because they think nothing matters and they can YOLO anything together for a few years before moving on to the next job, and those are just as bad as the super over-engineering candidates.
The interview is your chance to show you know the topics and when to apply them, not the time to argue that the startup shouldn’t care about such matters.
Do you tell people this explicitly? If so, good on you; if not, please start! I think one of the biggest problems with interviews these days is misaligned expectations, particularly interviewees coming in assuming that what's desired is immediate evidence that they're so experienced in solving FAANG-scale problems that it's their default mode.
It's good not to over engineer, over engineering can be a cause of unneeded complexity, but when complexity is warranted the ability to solve for it simply is also needed.
More importantly though, you haven't explained or rationalized why?
It's not needed for this QPS? Oh ya? Why not? What's your magic threshold? When would it be needed? How do you plan for the team to know that time is approaching? If it's needed later how would you retrofit it? Is that going to be a simple addition? How do you know the max QPS won't be too high and that traffic won't be spiky? What if a surprise incident occurred that caused the system to overload, how would your design, without backpressure, handle that, how would you mitigate and recover?
In system design there's no real right answer, as an interviewer you're looking for the candidate to demonstrate their ability to identify the point of concerns, reason through the possibilities, explain their decisions and trade offs, and so on.
It’s like people crave complexity because it makes them, indispensable? Like if you’re the only one who knows how the billing reconciliation service works, they couldn’t possibly fire you?
They will.
Being pragmatic is something I look for in engineers. So long as they understand where to draw the line (and use a queue instead of cron). However that’s usually several years away at this point and them being able to say “You don’t need that, all you need is…” is welcome. Then again, that’s probably why I got fired. :shrug:
I first wrote code 50 years ago (I am 63yo) so yes, imo we are too old, but ...
It is worth noting that systems concepts/techniques often have analogues aka different names and histories in different fields and subfields.
If I were to "explain" back pressure to an ordinary person I might model my analogy to the logic of this ~classic joke:
Bob: Let's go to Trendio(TM) for dinner tonight!? Carol: Oh, nobody goes there anymore, it's too crowded!
Also, often a modern take-this-for-granted concept may be seen as an outgrowth of previous problems or solutions.
For example back pressure is conceptually adjacent to the clever~hack/design of random backoff in Ethernet.
Or if talking to a math geek or traffic planner you might relate it to ~modern understanding of congestion including oddities like possibly removing roads/routes to ~paradoxically improve traffic flow.
We are deep in the Information Age barreling towards Singularities, so none of us, young or old, see and understand but a tiny fraction of where we've been, are, or might be going.
Cue Calvin & Hobbes cartoon of us racing downhill in a fragile box.
Perhaps, as others have essentially suggested, merging your mind with an ~AI will help (albeit temporarily, imo). I prefer to think of us/greybeards as potentially Wise, yet, paradoxically, clueless.
Beginner's Mind, with likely no time/future for Mastery, is still potentially pleasant, and I would argue useful for Debugging.
Obviously this modern AI tsunami is phase shifting us all into debug~mode anyway, eh?
It's the opposite, as you get older you will feel this more and more.
But the keep the concept in your mind in case you have to distribute some problem. It's a central one.
Though you might be familiar with other terms that effectively mean the same thing, like counter pressure
(but worry yea not, just like someone said of another term: "Dependency Injection" is a 25-dollar term for a 5-cent concept, something is similar for this term. )
> > "That's not really worth considering for this amount of QPS"
There is a good way and a bad way to communicate this in interviews.
If an interviewer is asking about back pressure, they’re prompting you to demonstrate your knowledge of back pressure and how and when it would be applied. Treating it as an opening to debate the validity of the question feels like dodging the question or attempting to be contrarian. Explaining when and where you would choose to add back pressure would be good, but then you should go on to answer the question.
This question hits close to home for me because I was once working at a small startup that was dealing with a unique problem where back pressure really was the correct way to manage one of our problems, but we had a number of candidates do exactly what you did: Scoff at the idea that such a topic would be relevant at a startup.
If we’ve been dealing with a problem for months and a candidate comes in and confidently tells us that problem isn’t something we would experience and dismisses our question, that’s not a positive signal.
> > Interviewer: "How would you choose between sql and nosql db?"
> > "Doesn't matter much. Whatever the team has most expertise in"
This is basically a softball question. Again, if you provide a non-answer or try to dismiss the question it feels like you’re either dodging the topic or trying to be contrarian. It’s also a warning sign to the interviewer that you might gravitate toward what’s easy for you instead of right for the project.
This one also resonates with me because I spent years of my life making MongoDB do things that would have been trivial if earlier developers had used something like SQLite instead. The reason they chose MongoDB? Because the team was familiar with it. It was hell to be locked into years of legacy code built around the wrong tool for the job because some early employees thought it didn’t matter “because startup”
As an interviewer, let me give some advice: If an interviewer asks a question, you should answer the question. Anything that feels like changing the subject, dodging the question, or arguing the merits of the question feels like the candidate either doesn’t understand the topic or wants to waste time by debating the question.
It can be very valuable to explain when and why a topic would become necessary, right before you explain it. Instead of “this application has low QPS and therefore I will not answer your question” (not literally what you said, but how it comes across) you could instead explain how the need for back pressure could be avoided first by scaling servers appropriately and then go on to answer the question that was asked.
One time I was working in a body leasing company and our team was hired by bigco for an internal project. Two months earlier an internal employee was tasked to research the project and develop a prototype. When we started all major set pieces were written in stone. Month later said employee left. When we later checked the job listing he likely applied to our tech stack mirrored that to a letter. He got free training, a resume and a new job. We were stuck with these decisions for 3 years.
Another time a local branch of another bigco was trying to carve out a major piece of internal cake. Head-of was hired, team was quickly ramped up and they started cooking their foothold. Then a series of major power shifts happened couple levels above our pay grade and another branch came out with competitive strategy. We had a 2 days long internal brainstorm involving 50 people to come up with arguments and strategies how to defend our approach. We bet on blue, they were selling red. Life's were at stake. And many truly believed that blue was the way to go, and red was a recipe for disaster. Two days later we had a rock solid presentation that was trashing red approach. But if course most of these decisions are not made by nerds and middle mgmt do eventually the company placed their bet in red and the whole dept became redundant. No one likes to lose their jobs, so our blue head-of quickly turned his cloak and the team became an outsourcing provider for the winning team. What makes this story particularly funny is the fact that the head-of immediately started campaign of conference presentations where he sweard that all his life he believed that red was the future that will eventually trump blue, and any competition that is still using blue is destined to fail in short future.
Unless the initial question requirements are insane (build Twitter at Twitter scale), I start with the smallest, dumbest thing that will work. Usually that's a single machine/VM talking to a database (or even just SQLite!). Compute and storage are so fast these days that you could comfortably run your fledgling service on a Raspberry Pi, even serving three or four-digit QPS depending on the workload.
Of course, you still have to "play the game" in the interview, so make sure to be clear about how you'd change this as requirements changed (higher QPS, more features, etc)
"Backpressure? I don't think you'll have enough traffic to make backpressure necessary. The mode of failure here is that you run out of queue space and start dropping messages, and it's not a big deal if some messages get dropped here. But if we do decide that dropped messages are causing problems, and if it starts becoming a regular occurrence (we'll set up observability), here's how the producer can poll the queue size and return an error to the user under heavy load.
We’re at the start of another cycle of a lot of niche products followed by the rise of big Acme megacorps who conquer them all economies of scale that compete on margin. It comes just as we’re at the tail end of this cycle with tech as we knew it for the last 50 or so years.
The point of an interview is to lay bare one’s thought process entirely so that the interviewer has full awareness of the person you are. And to likewise extract that from the interviewer. Getting or transmitting less information is just underutilizing the time. Interviewers are also flawed and may not be good enough at extracting the information from you.
If you’re an ideal decision maker, you will likely out-skill the majority of interviewers. You’re being hired to make their org succeed. So just do that.
I think people who describe system designs frequently fail to demarcate the space they’re operating in, so subsequent engineers cannot determine whether the original designer failed to consider something or whether the original designer considered and dismissed something. The point is to be able to express this concisely.
IMHO, doing it well means that not only do you get it right but you send the information down through time so that subsequent observers understand why and also get it right consequently.
Real companies do. The moment you deploy one line of code, it's legacy. It goes from there. Soon you have to build systems that interface with other systems you'd rather were better architected and designed, except you have to deal with them as they are. Then your product becomes one of these, and with no need to maintain or expand it for a long time, it rots a bit, and now someone has to pick it up or interface with it, and your product made things more complex, and the complexity can't be magic wand waved away.
If you can't, you might be getting interviewed by people you do not what to work with and you should want to know that.
That becomes obvious when you start bootstrapping an HA cluster with multiple control plane nodes.
K8s is not for the faint of heart…or rational system designers ;)
People ask for fizzbuzz in parallel not because it's practical.
Your answers are completely valid but you have to communicate to the interviewer that you considered the possibilities and the tradeoffs.
If the interviewer needs to "forcefully" extract from you the logic behind your design choices than a lot of times that's enough to fail you.
Dismissive answers that assume they are needlessly over complicating things tells them exactly what they need to know
> SQL and NoSQL don’t matter much
Database is literally the most important architectural decision possible, next to the application programming language.
(Prove me wrong)
Oh yes! Never do a join in the application code! But also: use views! (and stored procedures if you can). A view is an abstraction about the underlying data, it's functional by nature, unlikely to break for random reasons in the future, and if done well the underlying SQL code is surprisingly readable and easy to reason about.
Writing raw SQL views/queries per MVC view in SSR arrangements is one of the most elegant and performant ways to build complex web products. Let the RDBMS do the heavy lifting with the data. There are optimizations in play you can't even recall (because there's so many) if you're using something old and enterprisey like MSSQL or Oracle. The web server should be able to directly interpolate sql result sets into corresponding <table>s, etc. without having to round trip for each row or perform additional in memory join operations.
The typical ORM implementation is the exact opposite of this - one strict object model that must be used everywhere. It's about as inflexible as you can get.
The author's said nothing about ORMs. It feels like you're trying to post a personal beef about ORMs that's entirely against the "pragmatic" software design engineering the author's opining. Using ORMs to massively reduce your boiler-plate CRUD code, then using raw SQL (or raw SQL + ORM doing the column mapping) for everything else is a pragmatic design choice.
You might not like them, but using ORMs for CRUD saves a ton of boilerplate, error-prone, code. Yes, you can footgun yourself. But that's what being a senior developer is all about, using the tools you have pragmatically and not foot gunning yourself.
And it's just looking for the patterns, if you see a massive ORM query, you're probably seeing a code smell. A query that should be in raw SQL.
You can write reusable plain functions as abstractions, returning QuerySets that allow further filters being chained onto the query, before the actual SQL is materialized and sent to the database.
The result of this doesn’t have to match the original object models you defined, it’s still possible to be flexible with group bys resulting in dictionaries.
> Particularly if you’re using an ORM, beware accidentally making queries in an inner loop. That’s an easy way to turn a select id, name from table to a select id from table and a hundred select name from table where id = ?.
Rails makes this easy to avoid. Using `find_each` batches the queries (by 1,000 records at a time by default).
Reading through the comment section on this has been interesting. Either lots of people using half baked ORMs, people who have little experience with an ORM, or both.
I can't respond to the "typical" part as most of my experience is using EF Core, but it's far from inflexible.
Most of my read-heavy, search queries are views I've hand written that integrate with EF core. This allows me to get the benefit of raw SQL, but also be able to use LINQ to do sorting/paging/filtering.
In particular, have you have to do testing, security (eg. row level security), manage migrations, change management (eg. for SOC2 or other security frameworks), cache offloads (Redis, and friends), support for microservices, etc.
Comments like this give me a vibe of young developers trying out Supabase for the first time feeling like that approach can scale indefinitely.
T-SQL was not a good programming language last century when it was vaguely current, and so no I do not want to write any significant amount of code in T-SQL. For my sins I maintain a piece of software with huge T-SQL procedures (multi-page elaborations by somebody who really, really like this stuff) and they're a nightmare. The tooling doesn't really believe in version control, the diagnostics when you make a mistake are either non-existent or C++ style useless spew.
We hire a lot of very junior developers. People who still need to be told not to comment out code in release, that variable numbers are for humans to read not machines, that sort of thing. We're not quite hiring physicists to write software (I have done that at a startup) but it's close. However, none of the poor "My first program" code I see in a merge request by a new hire is anywhere close to as unreadable as the T-SQL we already own and maintain.
Stored procedures also add another risk. You have to keep them in sync with code, making releases more error prone. So you have to add extra layers of complexity to manage versioning.
I can see the advantage of extreme performance/efficiency gains, but it should be really big to be justified.
There was _one guy_ who maintained it and understood how it worked. He was very smart but central to the company’s operations. So having messy stuff makes it brittle/hard to change in more ways than one and
The “backend” scales much easier than the database. Loading data by simple indexes, eg. user_id, and joining it on the backend, keeps the db fast. Spinning up another backend instance is easy - unlike db instance.
If you think, your joins must happen in db, because data too big to be loaded to memory on backend, restructure it, so it’s possible.
Bonus points for moving joins to the frontend. This makes data highly cacheable - fast to load, as you need to load less data and frees up resources on server side.
If you want a KV store, use a KV store. If you want an RDBMS, then use its features. They haven’t changed much in the last 50 years for a reason.
Let's say you run a webshop and have two tables, one for orders with 5 fields, one for customers, with 20 fields.
Let's say you have 10k customers, and 1m orders.
A query performing a full join on this and getting all the data would result in 25 million fields transmitted, while 2 separate queries and a client side manual join would be just 5m for orders, and 200k for customers.
But usually you need some of the orders and you need the customer info associated with them. Often the set of orders you’re interested in might even be filtered by attributes of the customers they belong to.
The decision of whether to normalize our results of a database query into separate sets of orders and customers, or to return a single joined dataset of orders with customer data attached, is completely orthogonal to the decision of whether to join data in the database.
It's very natural to want customer information when querying an order, and if you have a view like orders_with_customer_info, you get that with zero effort when querying that view by order id.
You also get consolidated data (orders by customer) by doing
select count(*), sum(amount) from orders_with_customer_info group by customer_id
which I think is pretty straightforward.I worked on an application which joined across lots of tables, which made a few dozen records balloon to many thousands of result rows, with huge redundancy in the results. Think of something like a single conceptual result having details A, B, C from one table, X, Y from another table, and 1, 2, 3 from another table. Instead of having 8 result rows (or 9 if you include the top level one from the main table) you have 18 (AX1, AX2, AX3, AY1, ...). It gets exponentially worse with more tables.
We moved to separate queries for the different tables. Importantly, we were able to filter them all on the same condition, so we were not making multiple queries to child tables when there were lots of top-level results.
The result was much faster because the extra network overhead was overshadowed by the saving in query processing and quantity of data returned. And the application code was actually simpler, because it was a pain to pick out unique child results from the big JOIN. It was literally a win in every respect with no downsides.
(Later, we just stuffed all the data into a single JSONB in a single table, which was even better. But even that is an example of breaking the old normalisation rule.)
That doesn't really sound like a place where data is actually conceptually joined. I expect, as it is something commonly attempted, that you were abusing joins to try and work around the n+1 problem. As a corollary to the above, you also shouldn't de-join in application code.
As a somewhat contrived example since I just got out of bed, if your software has a function that needs all the invoice items from invoices from this year which invoice address country is a given value, use a join rather than loading all invoices, invoice addresses and invoice items and performing the filtering on the client side.
Though as you point out, if you just need to load a given record along with details, prefer fetching detail rows independently instead of making a Cartesian behemoth.
When I use to interview to be a developer at a company, it was always an automatic no for me if a company kept business logic in stored procedures and had a separate team of “database developers”.
As far as not doing joins in code, while I agree for the most part. GitHub itself has a rule against joining tables using sql that belong to different domains.
https://github.blog/engineering/infrastructure/partitioning-...
Then companies buy a solution to aggregate all the different databases in a single "data-lake" (or whatever buzzword is hot right now) so you can do OLAP queries. Without consistency guarantees of course.
And I am not saying this is never the _right_ solution, but it should almost never be the _first_ solution
Not saying this should always be the case, but sometimes it is the right call.
To quote Douglas Adams: "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."
Likewise, if you cache a piece of data in your application because you assume that it won't change, that just makes it likely that if and when it does change, you'll have bugs. Moving the cache to the database layer so that it can be properly invalidated fixes this.
It's true that an application-side join can still be more performant if the DB cache isn't good enough, but IMO you should only take that step after actually profiling your queries.
If you have some complex queries on every page load with a huge number of users, put it in the DB as much as possible.
If you need to iterate over a bunch of records and do something based on some combination of values, and it's for a weekly reporting thing, I'd much rather see 3 nested foreach loops with a lot of early exits to skip the things you don't care about than a multi-kb SQL-statement that took two days to develop and nobody every dares to touch again because it's hard to handle.
A great software design will separate all business logic into its own layer. That might be a distinct project, module, or namespace, depending on what your language supports. Keep business logic out of SQL and out of web server code (controllers, web helpers, middleware, etc.).
Then you're treating SQL as the data store it is designed to be. When you embed application logic in SQL, you're hiding core functionality in a place where most developers won't expect to find it. This approach also creates tight coupling between your application and your database provider, making it hard to switch as needs change/the application grows.
For example, you may want to (or have the option to) vertically partition your database, or use different data stores. The app layer is usually stateless and can scale perpetually, but the database might be a bottleneck.
Joining in the database over the application is a great default. But I wouldn't say "never join in the application code".
Depending on the ecosystem the code base adopts a good orm might be a better choice to do joins.
Also SQL is easy, but figuring out what's up with indexes and planner is not.
I have some remarks though. Taken from the article:
> Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service.
This is not so cut-and-dry. The trade offs are far from obvious or acceptable.
If the five services access the database then you are designing a distributed system where the interface being consumed is the database, which you do not need to design or implement, and already supports authorization and access controls out of the box, and you have out-of-the-box support for transactions and custom queries. On the other hand, if you design one service as a high-level interface over a database then you need to implement and manage your own custom interface with your own custom access controls and constrains, and you need to design and implement yourself how to handle transactions and compensation strategies.
And what exactly do you buy yourself? More failure modes and a higher micro services tax?
Additionally, having five services accessing the same database is a code smell. Odds are that database fused together two or three separate databases. This happens a lot, as most services grow by accretion and adding one more table to a database gets far less resistance than proposing creating an entire new persistence service. And is it possible that those five separate services are actually just one or two services?
You absolutely should design and implement it, exactly because it is now your interface. In fact, it will add more constraints to your design, because now you have different consumers and potentially writers all competing for the same resource with potentially different access patterns. Plus the maintenance overhead that migrations of such shared tables come with. And eventually you might have data in this table that are only needed for some of the services, so you now need to implement views and access controls at the DB level.
Ideally, if you have a chance to implement it, an API is cleaner and more flexible. The problem in most cases is simply business pushing for faster features which often leads to quick hacks including just giving direct access to some DB table from another service, because the alternative would take more time, and we don't have time, we want features, now.
But I agree with your thoughts in the last paragraph. It happens very often that people don't want to undertake the effort of a whole new design or redesign to match the evolving requirements and just patch it by adding a new table to an existing DB, then another,...
PostgreSQL, to name one example, can handle every one of these challenges.
Moving your data types from SQL into another language solves exactly 0 migration problems.
Every migration you can hide with that abstraction language you can also hide in SQL. Databases can express exactly the same behaviors as your application code.
APIs can be evolved much more easily than shared database schemas. Having worked with many instances of each kind of system, I think this outweighs all of the other considerations, and I don't think I'll ever again design a system with multiple services accessing the same database schema.
It was maybe a good idea if you were a small company in the early 2000s, when databases were well-understood and services weren't. After that era, I haven't seen a single example of a system where it wasn't a mistake for multiple services to access the same database schema (not counting systems where the read and write path were architecturally distinct components of the same service.)
So this service was basically an "universal integration service". Company wanted to share some data, and they wanted to implement it in an universal way. So basically I've implemented SOAP web service which received request with SQL text and responded with list of rows. This service was surprisingly popular and used a lot.
I was smart enough, so I built a limited SQL syntax parser and UI, so administrator could just set up tables and columns they wanted to share for this specific client. SQL query was limited in a sense that it worked only with one table, simple set of columns and some limited conditions (that I bothered to implement).
The reason I've heard about it few months ago is that they shared with me, that they caught a malicious guy, who worked at some company integrating with this system and he tried to do SQL attack. They noticed errors in the logs and caught him.
Their database is pretty much done and frozen, regarding to schema. They hardly evolve it. So this service turned out ot be pretty backwards-compatible. And simple changes, of course, could be supported with view, if necessary.
When you need to alter the datastore, usually for product or scalability, you have to orchestrate all access to that datastore.
Ergo: one only one thing using the datastore means less orchestration.
At work, we just updated a datastore. We had to move some tables to their own db. 3 years later, 40+ teams have updated their access. This was a product need. If this was a scale issue, the product would just have died sans some as of yet imagined solution.
Counterpoint (assuming by database you mean database cluster, not a schema): having a separate physical DB for each service means that for most places, your reliability has now gone from N to N^M.
Nice boxes in the architectural diagram. Each box is handed to a different team and then, when engineers from those teams don't talk to each other, the system doesn't suddenly fail in an unexpected way.
Seems overly negative of broad advice on a good pattern?
is_on => true
on_at => 1023030
Sure, that makes sense. is_a_bear => true
a_bear_at => 12312231231
Not so much, as most bears do not become bears at some point after not being a bear.In the is_a case almost always a type or kind is better as you’ll rarely just have bears even if you only start with bears, just as you rarely have just two states for a status field (say on or off), often these expand in use to include things like suspended, deleted and asleep.
So generally I’d avoid booleans as they tend to multiply and increase complexity partially when they cover mutually exclusive states like live, deleted and suspended. I have seen is_visible, is_deleted and is_suspended all on the same table (without a status) and the resulting code and queries are not pretty.
I’d use an integer rather than a timestamp to replace them though.
If your data was simple enough, you could have an integer hold the entire meaning of a table's row, if every client understood how it was interpreted. You could do bitwise manipulations, encodings, and so on.
Sometimes it is nice to understand what the data means in the schema alone. You can do that with enums, etc.
ate_an_apple_in_may_2024
saw_an_eclipse_before_30
These are more of the sort of things I don't see needing enums, timestamps, integers...Although I’m not even sure it’s broadly a good principle, even in the on_at case; if you actually care about this kind of thing, you should be storing it properly in some kind of audit table. Switching bool to timestamp is more of a weird lazy hack that probably won’t be all that useful in practice because only a random subset of data is being tracked like that (Boolean data type definitely isn’t the deciding factor on whether it’s important enough to track update time on).
The main reason it’s even suggested is probably just that it’s “free” — you can smuggle the timestamp into your bool without an extra column — and it probably saved some effort accidentally; but not because it’s a broadly complete solution to the set of problems it tries to solve for
I’ve got the same suspicion with soft-deletes — I’m fairly positive it’s useless in practice, and is just a mentally lazy solution to avoid proper auditing. Like you definitely can’t just undelete it, and it doesn’t solve for update history, so all you’re really protecting against is accidental bulk delete caught immediately? Which is half the point of your backup
isDarkTheme: {timestamped} paginationItems: 50
I can see when dark theme was activated but not when pagination was set to 50.
also, i can’t see when dark theme is being deactivated either.
seems like a poor-man changelog. there maybe use cases for it but i can’t think of anything tbh.
Additionally, there are situations where it is logical to store a boolean. For example, if the boolean denotes an outcome:
process_executed_at timestamp not null
process_succeeded boolean not nullI do try my best to pack my columns, but it's a fragile and likely premature optimization. Better to opt for something defensive at a cost of like, 7 bytes per row (in Postgres).
Good system design is designing a system that works best for the problem at hand.
I’m surprised that the drawbacks of EAV or just using JSON in your relational database don’t get called out more.
I’d very much rather have like 20 tables with clear purpose than seeing that colleagues have once more created a “classifier” mechanism and are using polymorphic links (without actual foreign keys, columns like “section” and “entity_id”) and are treating it as a grab bag of stuff. One that you also need to read the application code a bunch to even hope to understand.
Whenever I see that, I want to change careers. I get that EAV has its use cases, but in most other cases fuck EAV.
It’s right up there with N+1 issues, complex dynamically generated SQL when views would suffice and also storing audit data in the same DB and it inevitably having functionality written against it, your audit data becoming a part of the business logic. Oh and also shared database instances and not having the ability to easily bootstrap your own, oh and also working with Oracle in general. And also putting things that’d be better off in the app inside of the DB and vice versa.
There are so many ways to decrease your quality of life when it comes to storing and accessing data.
That said, sometimes when I realize there's no way for me to come up even with a rough schema (say, some settings object that is returned to the frontend), I use JSONB columns in Postgres. As a rule of thumb, however, if something can be normalized, it should be, since, after all, that's still a relational database despite all the JSON(B) conveniences and optimizations in Postgres.
What's the "proper" way to do this? Separate DB? Separate data store?
Whether that's a typical relational DB or something more specialized (like a log shipping solution) that's up to you, but usually it would be separate from the main DB.
If you need some functionality that depends on events that have taken place, you probably want to store information about those events in the main data store (but only what's needed for that functionality, not a list of all mutations done to a table like audit data might include).
In general, it's nice to have such a clear boundary of where the business domain ends and where the aux. stuff to help you keep it running goes - your logs and audit data, analytics and metrics, tracing spans and so on.
Edit: as a critique of my own arguments here, I will admit that doing the above can introduce some complexity and that in simpler systems it might be overkill. But I've seen what happens when everything is just in one huge DB instance, where about 90% of the overall schema size is literally due to records in those audit tables and everyone is surprised why opening the "History" tab for a record takes a while (and anything else that references said history, e.g. visibility of additional records), and it's not great either.
Audit data in the same DB is great, because it can be written transactionally for relatively cheap (multi table updates, triggers, actual transactions with multiple writes, etc).
After that, sure, ship it elsewhere and prune the audit tables if you like. But having the audit writes go directly to Kafka or whatnot is a pain because it requires your client logic to a) have a distributed publish-event transaction (which can work in this case more easily than distributed transactions in general with careful use of idempotency keys, read back, or transactional outboxes, but it’s complicated and requires everyone writing to auditable tables to play along), and b) reduces your reliability because now the audit store or its message queue needs to be online for every write as well as your database.
And there’s plenty of good reasons for business logic to use (only for reads) audit data. What else would business logic do if an audit table existed and there was a business need to e.g. show customers a change history for something? Build another redundant audit system instead?
Common tools used for this "optimization" often raise the complexity and lower the performance of the system.
For example, a db with a single table with just a key and a value is very flexible and "optimized for change" but it offers lower performance (in most cases) and is harder to reason about.
I also frequently see people (me too) prematurely make abstractions (interfaces, extra tables, etc) because they're "optimizing for change". Then that part never changes OR it changes in a way that their abstraction doesn't abstract over OR they figure out a better abstraction later on when the app has matured a bit. Then that part of the code is at best wasted space (usually it needs to be rewritten yet no one gets time to do that).
Of course, it's also foolish to say "never abstract". I almost always find it worth it to abstract over I/O, just so I can easily add logging, dual writes, or mock it. And when a change is obviously coming down the line it makes sense to plan for it.
But usually I'm served best by trying to keep most of my computation pure functions (easy to test), doing as little as possible in the I/O path (it should just persist or print stuff so I can mock it) and otherwise write obvious "deletable" code that does one thing so I can debug it and, only if necessary, replace with a better abstraction if I need to.
When my service wants to store and retrieve as part of its behaviour, of course I'm going to back it with a hashmap first.
Once I know it fulfills its business logic I'll start fiddling with hard-to-change stuff like DB schemas and migrations.
And having finished and tested the logic, I'll have a much better idea of the actual access patterns so I can design good tables & indexes.
We can usually have many more cheaper dedicated services for doing a thing that accounts for more good than a single service that grows to become more and more omnipotent. It also means you're much likely to win contracts because you can price yourself competitively
Rings very true. Engineers are rated based on the "complexity" of the work they do. This system seems to encourage over-engineered solutions to all problems.
I don't think there is enough appreciation for KISS - which I first learned about as an undergrad 20 years ago.
Sure, there are problems that are inherently complex and require complex solutions. But most likely yours isn't one of them, most likely you have a basic web app.
I had been nodding away about state and push/pull, but this section grabbed my attention, since I’ve never seen it do clearly articulated before.
I found myself truly confused by this one - does this actually need stating? Do people actually re-read immediately after a write? Provided you got confirmation that a write was successful and the data doesn’t have anything that an SQL trigger would change, what would be the point of an immediate read instead of just using the DB “successfully written” response as a go-ahead to just update the in-memory data?
A lot of the time the datastructure you pass into writeAPI(obj) is different from the datastructure that is returned from readAPI(obj) -- even if the information contained is the same!
No one wants to do that data structure transformation and potentially miss an edge case/break some implicit assumption about the data structure some fuckall downstream consumer has.
However it is already done in the readAPI() function. So latency and throughput be damned, let us do:
writeAPI(objects)
objects = readAPI(objects)
To be clear, I'm talking about the typical bloated data structure we all know and love: 20+ fields, different fields redundant for different services, sometimes they are empty in which case we have to fall back to calculate that field differently. And it is this hacky due to a quick bug fix during a sev1 last year that was never revisited to be fixed "correctly". There is a ticket hanging around somewhere to do this, but the assignee has left the company.Simply reading after a write is a perfectly adequate solution for I'd say most systems.
If you have performance issues and a change like this may solve them, sure. Switch to app-generated IDs and add workarounds for whatever other issues arise, and skip the read. But if you don't need to I don't see why you'd go through this trouble.
I know it's a bit untrue, but you can't do that many things wrong with a stateless application running in a container. And often the answer is "kill it and deploy it again". As long as you don't shred your dataset with a bad migration or some bad database code, most bad things at this level can be fixed in a few minutes with a few redeployments.
I'm fine having a larger amount of people with a varying degree of experience, time for this, care and diligence working here.
With a persistence like a database or a file store, you need some degree of experience of what you have to do around the system so it doesn't become a business risk. Put plainly, a database could be a massive business risk even if it is working perfectly... because no one set backups up.
That's why our storages are run by dedicated people who have been doing this for years and years. A bad database loss easily sinks ships.
> As long as you don't shred your dataset with a bad migration or some bad database code, most bad things at this level can be fixed in a few minutes with a few redeployments.
At some point between these statements you switched from stateless to stateful and I can't follow the rest of the argument.
If you introduce a migration like "UPDATE billing SET prices = 0 ; WHERE something < 5", that's an entirely valid migration, but you mess up your state and then everyone is in a world of pain. This could, however, still be caught by various code review strategies, incremental rollouts and a large number of good development practices.
This is still easy, you can catch it before it hits prod so you don't have to fix prod.
And prod could still be fixed if your database layer manages backups, just with a day or two of downtime. If you don't have backups, you may have permanently lost information, which could kill the company.
For mostly political reasons, if you are onboarded to a team with a billion microservices and a lot of fanciness, it's unlikely that you will ever get approval or time to introduce simplicity. Or maybe I just got corrupted myself by the reality where I have to work now.
My former Ph.D. supervisor who moonlights as a consultant on this topic uses a nice acronym to capture this: BAPO. Business, Architecture, Process, and Organization. The idea is to end up with optimal business, an optimal architecture & design for that business, the minimum of manual processes that are necessitated by that architecture, and an organization that is efficiently executing those processes. So, you should design and engineer in that order.
Most companies do this in reverse and then end up limiting their business with an architecture that matches whatever processes that their org chart necessitated years ago in a way that doesn't makes any logical sense whatsoever except in the historical context of the org chart. If you come in as a consultant to fix such a situation, it helps understanding that whatever you are going to find is probably wrong because of this reason. I've been in the situation where I come in to fix a technical issue and immediately see that the only reason the problem exists is the org chart is bullshit. That can be a bit awkward but lucrative if you deal with it correctly. It helps asking the right questions before you get started.
Turning that around means you start from the business end (where's the money coming from?, what value can we create?, etc.), finding a solution that delivers that and then figure out processes and organizational needs. Many companies start out fairly optimal and then stuff around them changes and they forget to adapt to that.
Having micro services because you have a certain team structure is a classic mistake here. You just codified your organizational inefficiency. Before you even delivered any business value. And now your organizational latency has network latency to match that. Probably for no good reason other than that team A can't be trusted to work with team B. And even if it's optimal now, is it going to stay optimal?
If you are going to break stuff into (micro) services, do so for valid business/technical reasons. E.g. processing close to your data is cheaper, caching for efficiency means stuff is faster and cheaper, physically locating chunks of your system close to the customer means less latency, etc. But introducing network latency just because team A can't work with team B, is fundamentally stupid. Why do you even have those teams? What are those people doing? Why?
>What this means in practice is having one service that knows about the state - i.e. it talks to a database - and other services that do stateless things. Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service.> If a system has distributed-consensus mechanisms, many different forms of event-driven communication, CQRS, and other clever tricks, I wonder if there’s some fundamental bad decision that’s being compensated for (or if the system is just straightforwardly over-designed).
then later down in the article:
> Send as many read queries as you can to database replicas. A typical database setup will have one write node and a bunch of read-replicas. The more you can avoid reading from the write node, the better - that write node is already busy enough doing all the writes.
isn't this same as CQRS
Whenever I read something like this I feel so confused. Who actually calls themselves an engineer when they have no idea what they're talking about. Ignorant confidence is such a useless personality trait.
As long as we keep measuring LOCs or features added, there will always be jobs for them.
As in:
- writing a constitution
- designing API for good DX
- improving corporate culture
I intuitively want to call all of those system design, because they're all systems in the literal sense. But it seems like everyone else uses "system design" to mean distributed computer service design.
Any ideas what word or phrase I could use to mean "applying systems thinking to systems that include humans"
I could nitpick individual points in the article, but that misses the bigger issue: the premise is off.
Don’t chase generic advice about good or bad design. First understand your requirements, then design a system that meets them.
If the metaphor of a software circuit breaker is meant to emulate an electrical circuit breaker, then it seems to me that these two are inverted. Whenever a physical circuit breaker is open, it is not dangerous and not passing current.
Adding a separate table where the presence of a record means 'true' allows recording related state without complicating the main table.
And sometimes a boolean is exactly what you want.
It's hinted at a little bit in the OP, with:
> What does good system design look like? I’ve written before that it looks underwhelming
This is because there are humans in your system! Other developers! You in the future! You have to resort to heuristics like "simple == good" because you're only looking at a small part of the whole system.
And zoom out even more, you get to the actual users. How do they interact with the system? If you implement a rate limiter, how do the users respond when they hit it? Do they just spam-refresh the page? Open more tabs? Use their phone? Do they develop weird superstitions about it? Do they spam-call your phone support lines? Does your response to a thundering herd anticipate the second-order impact of your phone support lines being DDOSed?
On state, in my current project, it is not statefulness that causes trouble, but when you need to synchronize two stateful systems. Every time there's bidirectional information flow, it's gonna be a headache. The solution is of course to maintain a single source of truth, but with UI application this is sometimes quite tricky.
Monotonic state is also better than mutable state. If you must distribute state, think ownership. Who owns it? Eg Theres nothing necessarily wrong with having state owned by eg a mobile client that can be adjusted by the user. Then you can sync it to the backend if you want, but they are only a reader/listener, and should never try to control it directly.
> Even good system design advice can be kind of bad. I love Designing Data-Intensive Applications, but I don’t think it’s particularly useful for most system design problems engineers will run into.
But continues to do the same throughout the rest of its advices. It also says:
> ... Drawing the line here is a judgment call and depends on specifics,
And immediately mentions:
> but in general I aim to have my tables be human-readable ...
Which to me reads as "I'm going to ignore the difference of the context everywhere and instead apply mine for everyone, and I'm going to assume most of the wolrd face the same problems as me". It's even worse than the book being criticized in the beginning, as the book at least has "Data-Intensive" in its title.
This is quiet easily fixable. The author can describe the typical scenario they are working with on a day-to-day basis. Do they work with 10 users a day? 100? 10,000,000? What is the traffic? How many engineers? What's the situation of the team/company; do FIXMEs turn into fixes or they become it's a feature? And so on.
In the end, without setting a baseline, a lot of engineers will start pointing fingers at each other dismissing the opposite ideas because it doesn't fit their situation. The reasoning might be true, but before that, it is "irrelevant", hence any opposition to or defending of it.
I do wonder about why the author left out testing, documentation and qa tool design though. To my mind, writing a proper phpcs or whatever to ensure everyone on the team writes code in a consistent way is crucial. Without documentation we end up forgetting why we did certain things. And without tests refactors are a nightmare.
I am firmly in the former camp. In my opinion databases should be for storing and retrieving data as quickly and efficiently as possible. But the consensus in the database world seems to be that databases are primarily for enforcing business rules and domain models with foreign key constraints, triggers, views, transactions, type safety, domain modeling of relations, and on and on – some of which is at odds with storing and retrieving data efficiently.
Almost all the tools I've seen are either fully event-sourced or have nothing to do with event-sourcing. There aren't a ton of in-betweens.
I get it, but that sounds very finicky code to get right and a good source of hard-to-debug bugs.
1. Essentially every RDBMS except MySQL has a RETURNING clause, so you can read your write essentially for free (and even MySQL lets you read the auto-increment value it inserted).
2. Barring that, how is this finicky to get right? If you get an ack back from the DB, assuming you haven’t done silly things to fsync settings, it’s written. It’s durable. So, within the same try/except or similar, take what you just told it to write, and use it. If you don’t get an ack back, it did not write, so don’t use what you had stored.
You could also contribute to an open source project like kubernetes or postgres to get your feet wet.
Like all things, the best way to get better is to do it.
* Fundamental books/courses on distributed systems that will help you understand the internals of most distributed systems and algorithms (DDIA is here, even though it's not even the most theoretical treatment)
* Hand-wavy cookbooks that tend to oversimplify things, and (I am intentionally exaggerating here) teach to reason like "I have assumed a billion users, let's use Cassandra"
I liked the article for its focus on real systems and the sensible rules of thumb instead of another reformulation of the gossip protocol that very few engineers will ever need to apply in practice themselves.
In all seriousness, this is an extraordinary subtle and complex area, and there are few rules.
For example, "if you need data from multiple tables, JOIN them instead of making separate queries and stitching them together in-memory" may be useful in certain circumstances. For highly scalable consumer systems, the rule of "avoid joins as much as possible" can work a lot better.
There is also no mention of how important it is to understand the business - usage patterns, the customers, the data, the scale of data, the scale of usage, security, uptime and reliability requirements, reporting requirements, etc.
Wholly agree.
It would be a extremely long article it it went into detail on everything.
The ideal solution: Avoid having five different services all write to the same table.
If five different services have to write to the same table, there is a major overlap of logic too. Are the five services really different or one would suffice?
Taking practical realities into consideration, we can do what the author says. However, we risk implementing a lot of orchestration logic. We introduce a whole new layer of problems. Is that time not better spent refactoring the services: either give them their own DB tables or merge them into one servic?
A software system trades problems for different problems. e.g. We will manage your TODO list, provide consistency, durability, security, better than you could do yourself. But in order to get these benefits you have to understand our model, we have TODOs, users, lists, permissions, etc.
Decisions about the interface (what problems the system presents to the users) are the most consequential, and the most costly to get wrong. If you aren't spending most of your time arguing about the interface, then you are wasting your time arguing about things that are comparatively easier to change later. Literally everything else about the system can be changed without bothering the users.
Very recently discussed here a week ago: https://news.ycombinator.com/item?id=44840693
I've seen engineers have servers spin up lambdas to do async jobs that are just database calls.
So the server essentially waits for lambda which waits for a database. Why? Why can't you just have the server wait for the database?
It's like I'm going to pay a person to wait in line for me while I wait for him. Why? You're waiting anyway!? And you just paid to involve an additional person to unnecessarily wait with you for what?
When I told the engineer that you can just spin up a coroutine or like maybe you can allocate some cores before you spin up a new server... he looked at me like I was crazy. He said I was doing things so low level it was like assembly language programming. Going to low level and that lambdas were so cheap it was inconsequential.
If you're reading this and you're thinking, wow that other engineer is right, well this quote from the article refers to you:
"I’m often alone on this. Engineers look at complex systems with many interesting parts and think “wow, a lot of system design is happening here!” In fact, a complex system usually reflects an absence of good design."
> Sometimes you want to roll your own queue system.
I have never wanted to do this.
> For instance, if you want to enqueue a job to run in a month, you probably shouldn’t put an item on the Redis queue.
This specific requirement sounds more like a cron job use case, not a queue case.
> In this case, I typically create a database table for the pending operation with columns for each param plus a scheduled_at column. I then use a daily job to check for these items with scheduled_at <= today, and either delete them or mark them as complete once the job has finished.
At this point I've decided the author doesn't understand when or why to use a queue. For a strictly scheduled event which must occur on day = (today + N), the proposed approach seems fine. However if you use this for a typical queue use-case you will end up reimplementing a queue in your database (poorly). This is typically far more complex than just using a managed queue service if it's available to you. By more complex, I mean more lines of code and more oncall burden. Furthermore, if you grow, a managed queue is often something that needs very little hand holding. Queues are great because they are simple- both at low and high volume. They are a technology that "just works", and often provide a lot of helpful features out of the box.
I don't know the author, but I've encountered this engineering philosophy before. It's a perspective often held by engineers whose ideas haven't been stress-tested by the long-term realities of a successful business.
It's the kind of advice you can sell to the 99% of startups that fail, and because they fail for other reasons, you never have to be proven wrong.
Think real carefully about breaking transactionality unless absolutely needed . It has been the single source of most problems i have seen over the past few years.
Keeping 2 different systems in sync is really hard do not do it if you dont have a real need to do it.
Monoliths are really good , there is absolutely no need to run microservices or any services for that matter other than a single monolith. The place where i work at is generating billions of dollars with a single monolith.Having said that there will come a time when some logic has to go to a different services , If you get there your company is really really successful :) .
Relational databases can do a lot more than what you think and they can absolutely scale well.
A good system needs to be as easy to understand and interpret as possible, A good system design is so mind-numbing my simple that a nincompoop can understand it. The only deviations from this policy should stem from other requirements like storage, performance, etc.
One thing that I often add is the people interacting with the system. They’re a part of it too. Most people don’t operate in an atomically consistent world; a lot of business processes are eventually consistent. But you do need to know where you have to have atomic operations! It depends on where the user expects it.
Systems thinking is very useful. From how your software is deployed to how the people using it in their work. Always be thinking about these things.
Maybe it’s just harder to design with something simple that is possible to extend and build on. Maybe I am missing something. That said I agree with the author from my decade of experience.
If the author meant “dictionary” in a sense of a hash map, that’s not quite correct. In relational databases, indexes are usually B-trees, which are ordered, unlike hash maps. A B-tree can help with range-searches, ORDER BY and even merge joins, not just equality-searches.
And I’m wondering why investors tolerate the expense: it’s surprising how much you can get done with simplicity and a small focused team.
It must be something about perception when they’re ready to sell the company.
Anything like this is trivially dismissible as absolute hogwash. It’s a shame that titles like this actually get the clicks needed to encourage more bullshite in the same vein.
Leadership can't tell the difference. If anything the worse designs seem more impressive. Engineers often enjoy a bad design because it creates more work and job security. When there's a lot of work it seems like you're getting things done. Managers enjoy a bad design because it helps empire building, now that there is more work we need to hire more people and do more manager-y things. There are also a lot of inexperienced engineers in the work force who have never seen a well designed system.
In an organization running these badly designed system it's a political suicide to argue the design is bad. If the business is successful even more so because everyone will assume that a successful business means well designed software. A successful business will directly reward a bad design.