https://twitter.com/sarahmei/status/1348477224467394560
My understanding was always that for relational data (e.g., social networks) you should use a relational database. Is the person in this tweet correct? If so what is a better option?
Once you reach a large scale, relational databases start being a problem for availability and replication of data across different availability zones. Operations become complicated (you have replication chains, master/slave setups, etc.)
If your data is relatively simple and doesn't require a lot of relations and foreign keys, then something like Cassandra can save a lot of headaches.
Btw, a common trick to make a relational database perform at scale by limiting joins is to "flatten data", i.e. replicate data across different tables to avoid joining them.
Finally, don't let yourself be fooled by anyone who claims they know "the better option." There is no better option. There is only a better option for a particular use case you're looking at, given the specific constraints at hands. That's what engineering is about, including software engineering.
If you want to learn more about designing storage systems by constraints, I recommend that you read the 2007 Dynamo paper from Amazon, and in particular section 2.3 "Design Considerations". Below is a link, you can easily find a PDF online if you need.
https://www.allthingsdistributed.com/2007/10/amazons_dynamo....
Basically all meaningful data in an application context has relationships. There is no real such thing as "non-relational data"
Instead the question really is do you want a planned, enforced schema or an unplanned, freeform one.
Use a SQL database for the former.
Take a long look in a mirror and question the decisions that made you the way you are if the latter.
It’s a fantastic way to build an exit strategy for situations where original developers a picked NoSQL database and current developers want to replace it with a more structured SQL database.
Mixing paradigms in one database is probably going to be happening more. Just like Postgres is offering a 'document' style, some document databases are offering documents with relations. It wouldn't surprise me to see document databases offer optional schemas. I think that the future is a mix of options and tools in one database (which JSONB columns are a first step for). Depending on the situation we'll just model differently. The best database might become the one that makes us use these different tools most elegantly together. The difference between a document and a table is only a 'flatten' and a 'schema' away.
Graph databases are considered 'NoSQL' yet they have relations and transactions. Schemaless is often also one of the properties give to NoSQL, but it's also a bit strange to consider that a NoSQL attribute. Some traditional databases offer schemaless options and databases like Cassandra has schema yet is considered NoSQL. I work at Fauna which has relations, stronger consistency than many traditional databases. It is schemaless at this point but that might change in the future. Since it doesn't offer SQL it's thrown into the NoSQL bucket with the assumptions that come along with it.
None of these one-liners in computer science make sense IMHO and we listen way too often to colleagues who use them. Similarly "Use SQL for enforced schema" might be accurate in many cases but in essence it depends on your situation, and we need to do research about what we use instead of following one-liners ;)
Storing more context in document helps obviously because you don't have to fetch the data many times, it's actually also done in relational databases whenever needed. But you can't store a lot in one document, that doesn't scale nor work.
For example, if someone changes its avatar or want to delete its account, do you want to parse all your social network documents to update an avatar or remove the comments on a tiny subset of them ? If a post is popular, are you going to update its document thousands of times per second ?
In practice you will most likely find a mix of everything. Relational databases, in memory data stores, cache layers, perhaps a few nosql documents database, some big data stuff and a probably some excel sheets.
The limitations that Sarah Mei identifies as clownpants is using a 32 bit primary key for an identifier for an ephemeral thing. That is again nothing to do with SQL vs NoSQL. It would affect both of them the same way.
So far I never used NoSQL.
NoSQL is for incompetent people who can't figure out how to convert a JSON request to a table structure. They just put the entire JSON as it is in a DB and call it NOSQL.
Anyone using NoSQL for anything is either lying or clueless.
If you’re dealing with enough data you want to start having an index here or there to speed up the queries you’re developing for whatever reason. I’ve done some data exploration work against things like the Python Package Index where the size of the dataset had put off more significantly resourced groups have abandoned projects in the past. For me to get useful data out of that in a reasonable timespan required repeatedly prototyping the queries what data I kept, and how I was loading it. UnQlite (sort of a nosql version of SQLite) was my secret weapon. I could build things in iterations extremely easily with each generation of the code wrapped up in simple scripts for easily measuring the timing of each stage allowing me to optimise things to eventually get the answers I was looking for with a job that took less than an entire weekend to run and with an optimal data set size that only ended up keeping a few GB of the much larger data I started with. This would have been WAY more work for me with with a SQL database.
Why would easily measuring the timing of each stage be way more work with an SQL database but way easier with a NoSQL database?
Besides being a good read overall, the book discusses topics like this one in detail and with a healthy attitude (people tend to have strong opinions on this)
I thought you were supposed to use a graph database for that, like dgraph. Do I remember incorrectly?
> Dgraph is a horizontally scalable and distributed GraphQL database with a graph backend.
---
Edit: found the source... According to https://www.infoworld.com/article/3251829/why-you-should-use...:
"However, as with any popular technology, there can be a tendency to apply graph databases to every problem. It’s important to make sure that you have a use case that is a good fit. For example, graphs are often applied to problem domains like:
- Social networks
- Recommendation and personalization
- Customer 360, including entity resolution (correlating user data from multiple sources)
- Fraud detection
- Asset management"
Realistically, it just seems like a low-effort attempt to dunk on Parler for likes and retweets.
- do you need relational data, or something more simple, or something more flexible ?
- do you need transaction integrity ? Transaction integrity is a nice feature, but you can also design all your code so that if something blows "in the middle", it is somehow repaired automatically in a further event.
Maybe a third point: most of our relational / transactional database technology is quite old. Could we do something better than SQL query language, common database types, and the actual database code that was very optimized for magnetic spinning disks, but maybe is not optimized for SSD ? Maybe, we would need something like SQLV2.
And my god how much hype bullshit is inserted in those technical discussions.
SQL will (maybe sadly?.. maybe not?) not go away. Many so-called 'NoSQL' are looking into providing SQL or already provided SQL (with or without limitations) to their users because they just want to use what they know. I would be stoked for an SQLV2 standard!
Of course, you will use materialized views for even better performance.
> ...the worst are full of passionate intensity.
that said, it's difficult to feel sympathy for people supporting a platform that encourages terrorism, murder, etc.
The type of join shouldn't be a problem, SQL engines should in most cases be able to determine the best join. In the cases it can't you can go start tweaking (although tricky to get right, especially if your data evolves, it's possible, you probably want to fix your query plan). B is however tricky and a performance loss since it's really a bit silly that data is flattened into a set each time to be then (probably) put into a nested (Object-Oriented or JSON) format to provide the data to the client. This is closely related to C, in a social graph you might have nodes (popular people or tweets) who have a much higher amount of links than others. That means if you do a regular join on tweets and comments and sort it, on the tweet you might not get beyond the first person. Instead, you probably only want the first x comments. That query might result in an amount of nested groups. So it might look more like the following SQL (wrote it by heart, probably not correct):
SELECT tweet.*, jsonb_agg(to_jsonb(comment)) ->> 0 as comments, FROM tweet JOIN comment ON tweet.id = comment.tweet_id
GROUP BY tweet.id HAVING COUNT(comment.tweet_id) < 64 LIMIT 64
That obviously becomes increasingly complex if you want a feed with comments, likes, retweets, people, etc.. all in one. There are reasons why two engineers that helped to scale twitter create a new database (https://fauna.com/) where I work. Although relational, the relations are done very differently. Instead of flattening sets, you would essentially walk the tree and on each level join. I did an attempt to explain that here for the GraphQL case: https://www.infoworld.com/article/3575530/understanding-grap...
TLDR, in my opinion you can definitely use a traditional relational database. But it might not be the most efficient choice due to the impedance mismatch. Relational applies to more than traditional SQL databases though, graph database or something like fauna is also relational and would be a better match (Fauna is similar in the sense that joins are very similar to how a graph database does these). Obviously I'm biased though since I work for Fauna.