Think about it for a sec: SQL is literally just a language to query and manipulate data. There's no reason that schema changes and data changes have to happen only through the one language, and only through one interface on one piece of software.
For whatever reason, this has just been how the most popular products have done it, and they largely just never changed their designs in 40 years. I like the language, and the general organization of the data is handy. But everything else about it is archaic.
Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".
Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?
Other protocols and languages have ways of dealing with these kinds of things. LDAP can refer you to a different endpoint to process what you need. Web servers can store, process, and retrieve the same content across many different endpoints in a variety of ways. Lots of technology exists that can easily replicate, snapshot, version-control, etc arbitrary pieces of data and expose them to any application using standard interfaces.
Why haven't we created a database yet which works more like the Unix operating system?
There are practically many ways of talking to database systems, if it isn't too troubling that some SQL is often happening somewhere. Like, there's Hasura, postgrest, etc.; or Mongo has a variety of drivers that support different inputs.
One might consider the most unix'y database to be Berkeley DB/Sleepycat, but that is probably not what you wanted. ;)
Not to be overly snarky, but have you tried? Database design is full of trade-offs.
I have a pet project I'm working on, which is a generi distributed system where each component is a microservice. It turns out there's lots of these things built already, mainly by systems engineers for obscure things (Airflow, Rundeck, Stackstorm being some examples). I'll probably think about how I can redesign my project with composeable databases in mind. I don't expect I'll ever have a working product, but it'll be useful to think about this problem.
There are many query languages and having one common one as a base is useful to transfer skills. Think of it as an on ramp to more specific dialects or technologies.
Do you need a distributed database to read a .txt file with cat, Emacs, and Firefox? No. Why not? Because there's an I/O standard they all use. Does that .txt file have to live on a single filesystem, or disk? No. Why not? Because the storage mediums all have a standard virtual filesystem interface.
There is no reason databases cannot do exactly the same thing. It's just that nobody has made them do it yet. They've stuck with the exact same paradigm, and that then drives how all of us build our systems, with this archaic 40 year old model that requires heavy-lifting maintenance windows and a lot of praying.
> Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".
This is not a feature of SQL, this is a feature of the database. Also, this sounds exactly like doing full-table replication to get the "old" data and then turning on log-based replication. You can do key-based replication if you really want to avoid log-based, but it's generally just a less efficient version of log-based replication.
> Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?
You can. Postgres supports both Perl and Python extensions that run in the RDBMS process, iirc. Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways, and it is very difficult to gain any benefits over just running a separate process that communicates over SQL.
So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.
There's also stuff like Presto[1] that allows you to run queries distributed over multiple databases, multiple types of databases, etc, etc, etc. In that case, conceptually, Presto is "the database" and all the records you refer to are remote.
Yet they always seem tied together eh? Somehow the conventions are stuck together, and that then affects how our systems work.
> Postgres supports both Perl and Python extensions that run in the RDBMS process
But I'm talking about not having to use the RDBMS process. If I have a text file on the disk, I can use a million different programs to access it. I don't have to run one program through another program just to open, read, write, and close the file with any program. Why don't we design our databases to work this way?
> Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways
Yes, it does sound bad. That's why I'd prefer an indirect method rather than having to wedge access through the RDBMS
> So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.
Yep! We architect entire systems-of-systems just because the model for our data management in an RDBMS is too rigid. We're building spaceships to get to the grocery store because we haven't yet figured out roads and cars.
People get very comfortable very quickly, even tech savvy folks. Having to learn another language will scare many away, even though the effort might be the same - it's perceived harder.
(I assume data for analytical purposes is not stored in their primary DB, which is fair to assume I believe)
I don't know if it would be worth the engineering effort to try and archive old data in a way that still makes it transparently accessible to users that go looking for it--especially when modern databases ought to be able to scale up and out without manual archiving.
5.7 is enormous by database standard , there is no way you can get good query performance on a 5.7 tb database without solid physical partitioning and heavily optimized queries, and most normal companies even with 200-500 GB database use datamarts to have good performance without a super complex architectures and geniuses working fro you in db admin department
the more i think about it, the more i think that 5.7 TB would be unusably huge, and if you have this much data, most wont even bother to partition, the db will be broken into several (hundreds) smaller databases
The only thing I can imagine generating 600k per user is putting something like "hit tracking" in the database. Which I've done - yes it adds up - but it's also relatively easy to move to some other kind of store.
I worked for an e-commerce site, with a few million customers, even more orders, data-duplication all over the place, and still we where using a perhaps a 200GB of database storage.
Now whether several smaller DBs is easier to manage compared to one big one, it would be debatable. However with that huge DB, I would prefer having several smaller one.
Wait a second. You run your production database on ephemeral storage? Wow.
I see the replication setup and the S3 WAL archiving and whatnot but still... that's brave.
May not be as durable as EBS, but it's enough for me to sleep soundly at night. And with a highly concurrent WAL-G download, it takes like an hour to catch up a new replica from scratch.
Netflix went full ephemeral storage for their Cassandra clusters since the beginning, at the time when they were just spinning disks. Years later, they still insist on doing this, and had to come up with creative solution to fix the uptime issue: https://netflixtechblog.medium.com/datastore-flash-upgrades-...
I wonder how big the delta was for CMB between EBS and ephemeral?
My experience was that it needs to rewrite all the data for some tables/indexes under some circumstances, and the db won't be available while that happens. So, unless your db can be down for the time it takes to rewrite all that data it isn't really an option. After having done the upgrade through streaming logical replication I'm not sure I would try pg_upgrade again.
I did the pg_upgrade style update a long time ago, so most details are fuzzy, but I remember setting up a string of replicas something like:
primary -> [read_replica, backups_replica]
read_replica -> [upgrade_replica]
upgrade_replica -> [read_replica_upgraded, backups_replica_upgraded]
This allowed us to do multiple practice rounds without putting any unnecessary load on the primary. I think we needed to re-initiate replication off 'upgrade_replica' after the upgrade, but we did the live update during low-load so the extra read load wasn't an issue.
The new cluster can then be pg_upgraded and rsynced all at once.
To be fair, our use case was probably close to pathological for pg_upgrade. We had lots of TOAST data and dozens to hundreds of indexes per table.
Once solved though horizontal is nice, if more involved to maintain.
You'll have to accept that transactions can fail due to conflicts, so if they are interactive, you'll have to retry manually.
Edit: I'd like hear criticism, instead of just seeing disapproval.
We thought about migrating to Citus, but I don't have a good idea of how to shard our dataset efficiently.
If we were to shard by user id, then creating a match between two people would require cross-shard transactions and joins. Sharding by geography is also tough because people move around pretty frequently.
My best guesses are
- either it is SAAS in which case shard it should make sense to shard by customer
- or it is something-to-consumer (social networking?) on which case I guess you'll have to take a step back and see if you can sacrifice one of your current assumptions
... but I feel I'm missing something since what I am saying feels a bit trivial.
That does seem like a fun exercise :).
The general rules of the game are: You can only scale up throughput of queries/transactions that only access 1 shard (some percentage going to 2 shards can be ok). You can only scale down response time of large operations that span across shard since they are parallelized. You should only join distributed tables on their distribution column. You can join reference tables on any column.
The thing that comes to mind is to use a reference table for any user data that is used to find/score matches. Reference tables are replicated to every node and can be joined with distributed tables and each other using arbitrary join clauses, so joining by score or distance is not a problem, but you need to store the data multiple times.
One of the immediate benefits of reference tables is that reads can be load-balanced across the nodes, either by using a setting (citus.task_assignment_policy = 'round-robin') or using a distributed table as a routing/parallelization scheme.
CREATE TABLE profiles (
user_id bigint primary key,
location geometry,
profile jsonb
);
SELECT create_reference_table('profiles');
CREATE TABLE users (
user_id bigint primary key references profiles (user_id),
name text,
email text
);
SELECT create_distributed_table('users', 'user_id');
-- replicate match_score function to all the nodes
SELECT create_distributed_function('match_score(jsonb,jsonb)');
-- look up profile of user 350, goes to 1 shard
SELECT * FROM users u, profiles p WHERE u.user_id = p.user_id AND u.user_id = 350;
-- find matches for user #240 within 5km, goes to 1 shard
SELECT b.user_id, match_score(a.profile, b.profile) AS score
FROM users u, profiles a, profiles b
WHERE u.user_id = 240 AND u.user_id = a.user_id
AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000
ORDER BY score DESC LIMIT 10;
The advantage of having the distributed users table in the join is mainly that you divide the work in a way that keeps each worker node's cache relatively hot for a specific subset of users, though you'll still be scanning most of the data to find matches.Where it gets a bit more interesting is if your dating site is opinionated / does not let you search, since you can then generate matches upfront in batches in parallel.
CREATE TABLE match_candidates (
user_id_a bigint references profiles (user_id),
user_id_b bigint references profiles (user_id),
score float,
primary key (user_id_a, user_id_b)
);
SELECT create_distributed_table('match_candidates', 'user_id_a', colocate_with :='users');
-- generate match candidates for all users in a distributed, parallel fashion
-- will generate a match candidate in both directions, assuming score is commutative
INSERT INTO match_candidates
SELECT a.user_id, b.user_id, match_score(a.profile,b.profile) AS score
FROM users u, profiles a, profiles b
WHERE u.user_id = a.user_id
AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000
ORDER BY score DESC LIMIT 10;
For interests/matches, it might make sense to have some redundancy in order to achieve reads that go to 1 shard as much possible. CREATE TABLE interests (
user_id_a bigint references profiles (user_id),
user_id_b bigint references profiles (user_id),
initiated_by_a bool,
mutual bool,
primary key (user_id_a, user_id_b)
);
SELECT create_distributed_table('interests', 'user_id_a', colocate_with :='users');
-- 240 is interested in 350, insert into 2 shards (uses 2PC)
BEGIN;
INSERT INTO interests VALUES (240, 350, true, false);
INSERT INTO interests VALUES (350, 240, false, false);
END;
-- people interested in #350, goes to 1 shard
SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 350 AND NOT initiated_by_a;
-- it's a match! update 2 shards (uses 2PC)
BEGIN;
UPDATE interests SET mutual = true WHERE user_id_a = 240 AND user_id_b = 350;
UPDATE interests SET mutual = true WHERE user_id_a = 350 AND user_id_b = 240;
END;
-- people #240 is matched with, goes to 1 shard
SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 240 AND mutual;
For data related to a specific match, you can perhaps use the smallest user ID as the distribution column to avoid the redundancy. CREATE TABLE messages (
user_id_a bigint,
user_id_b bigint,
from_a bool,
message_text text,
message_time timestamptz default now(),
message_id bigserial,
primary key (user_id_a, user_id_b, message_id),
foreign key (user_id_a, user_id_b) references interests (user_id_a, user_id_b) on delete cascade
);
SELECT create_distributed_table('messages', 'user_id_a', colocate_with :='interests');
-- user 350 sends a message to 240, goes to 1 shard
INSERT INTO messages VALUES (240, 350, false, 'hi #240!');
-- user 240 sends a message to 350, goes to 1 shard
INSERT INTO messages VALUES (240, 350, true, 'hi!');
-- user 240 looks at chat with user 350, goes to 1 shard
SELECT from_a, message_text, message_time
FROM messages
WHERE user_id_a = 240 AND user_id_b = 350
ORDER BY message_time DESC LIMIT 100;
This exercise goes on for a while. You still get the benefits of PostgreSQL and ability to scale up throughput of common operations or scale down response time of batch operations, but it does require careful data model choices.(Citus engineer who enjoys distributed systems puzzles)
The RDS update process is a single button and you have no way of knowing how long it will take. There are some tricks like turning off Multi AZ and taking a snapshot manually before starting the process but still - for large instances you could be waiting anywhere from 30 minutes to 3 hours for RDS to finish. With large instance types I have seen RDS take a full hour just to provision an instance, in the meantime you'll sit there hitting F5 not knowing if it will ever finish.
Apparently Aurora behaves differently, but I wasn't aware of that when we specced out the project.
Did they migrate into Amazon RDS while doing this? For smaller projects I've stopped doing the self managed postgresql thing. The pricing is higher (75%?) for RDS for some use cases but can be worth it.
Going to try RDS Proxy next.
I'm curious how much risk of data loss this added.
I guess the baseline is "we need to migrate before we run out of disk" I.e. you're either going to have data loss or a long period of unavailability if the migration cannot be carried out fast enough.
Locally attached disks are not ephemeral to instance reboots/power failures. However, the disks are wiped after instance terminations. On the official EC2 product pages this is called "instance storage" not "ephemeral storage."
Also liked the couple of gotchas which go to show no matter how smooth a data migration is, there'll be some bumps.
Just a little side note.
They were jumping from 9.6 to 12, not from 9.0 to 12.
Before Postgres 10 was released, the first two digits defined a "major" version). So from 9.6 to 12 it's three major releases (9.6 -> 10, 10 -> 11, 11 -> 12)
i wish the process for upgrading postgres were easier/more dynamic. i'm sure plenty of people are still using versions 9.6 or earlier.
If something goes wrong with their relationship with AWS, that could be business ending. :(
But this post wasn't about backups, so there might be a whole lot excluded from the diagram.
edit: For more on the nuances of Postgres tradeoffs for replication and transaction isolation: https://www.postgresql.org/docs/9.1/high-availability.html