The big names started using no-sql type stuff because their instances got 2-3 orders of magnitude larger, and that didn't work. It adds a lot of other overhead and problems doing all the denormalization though, but if you literally have multi-PB metadata stores, not like you have a choice.
Then everyone started copying them without knowing why.... and then everyone forgot how much you can actually do with a normal database.
And hardware has been getting better and cheaper, which makes it only more so.
Still not a good idea to store multi-PB metadata stores in a single DB though.
People tend to have a very bad sense of what constitutes large scale. It usually maps to "larger than the largest thing I've personally seen". So they hear "Use X instead of Y when operating at scale", and all of a sudden we have people implementing distributed datastore for a few MB of data.
Having gone downward in scale over the last few years of my career it has been eye opening how many people tell me X won't work due to "our scale", and I point out I have already used X in prior jobs for scale that's much larger than what we have.
There’s nothing inherently wrong with a big db instance. The cloud providers have fantastic automation around multi-az masters, read replicas and failover. They even do cross region or cross account replication.
That being said. Multi PB would raise an eyebrow.
Of course these systems have warm standbys, dedicated backup infrastructure and so it’s not really a “single machine”; but I’ve seen 80TiB Postgres instances back in 2011.
The biggest issue we have with these giant dbs is they require pretty massive amounts of RAM. That's currently our main bottle neck.
But I agree. While our design is pretty bad in a few ways, the amount of data that we are able to serve from these big DBs is impressive. We have something like 6 dedicated servers for a company with something like 300 apps. A hand full of them hit dedicated dbs.
Were I to redesign the system, I'd have more tiny dedicated dbs per app to avoid a lot of the noisy neighbor/scaling problems we've had. But at the same time, It's impressive how far this design has gotten us and appears to have a lot more legs on it.
When I come back to web 1.0 apps, I’m often surprised that it does a round-trip to the server in less than 200ms, and reloads the page seamlessly, including a full 5ms SQL query for 5k rows and returned them in the page (=a full 1MB of data, with basically no JS).
The other consideration that isn't initially obvious, is how you may hit an upper bound for resources in most modern environments. If your database is sitting on top of a virtual or containerized environment, your single instance database will be limited in resources (CPU/memory/network) to a single node of the cluster. You could also eventually hit the same problem on bare metal.
That said there are some very high density systems available. You may also not need the ability to scale as large as I am talking, or choose to shard and scale your database horizontally at later time.
If your project gets big enough you might also start wanting to replicate your data to localize it closer to the user. Another strategy might be to cache the data locally to the user.
There are positive and negatives with a single node or cluster. If retools database was clustered they would have been able to do a rolling upgrade though.
You can scale quite far vertically and avoid all the clustering headaches for a long time these days. With EPYCs you can get 128C/256T, 128PCIe lanes (= 32 4x NVMes = ~half a petabyte of low-latency storage, minus whatever you need for your network cards), 4TB of RAM in a single machine. Of course that'll cost you an arm and a leg and maybe a kidney too, but so would renting the equivalent in the cloud.
Not saying a single giant server won't work, but it does come with it's own set of very difficult-to-solve-once-you-build-it problems.
Sure, that's not the same thing as pure zero-downtime but for many applications it's OK to put the entire thing into read-only mode for a few minutes at a well selected time of day.
While it's in read-only mode (so no writes are being accepted) you can spin up a brand new DB server, upgrade it, finish copying data across - do all kinds of big changes. Then you switch read-only mode back off again when you're finished.
I've even worked with a team used this trick to migrate between two data centers without visible end-user downtime.
A trick I've always wanted to try for smaller changes is the ability to "pause" traffic at a load balancer - effectively to have a 5 second period where each incoming HTTP request appears to take 5 seconds longer to return, but actually it's being held by the load balancer until some underlying upgrade has completed.
Depends how much you can get done in 5 seconds though!
For PostgresQL or MySQL/MariaDB, your options are more limited. Here are two that come to mind, there may be more:
# The "Dual Writer" approach
1. Spin up a new database cluster on the new version. 2. Get all your data into it (including dual writes to both the old and new version). 3. Once you're confident that the new version is 100% up to date, switch to using it as your primary database. 4. Shut down the old cluster.
# The eventually consistent approach
1. Put a queue in front of each service for writes, where each service of your system has its own database. 2. When you need to upgrade the database, stop consuming from the queue, upgrade in place (bringing the DB down temporarily) and resume consumption once things are back online. 3. No service can directly read from another service's database. Eventually consistent caches/projections service reads during normal service operation and during the upgrade.
A system like this is more flexible, but suffers from stale reads or temporary service degradation.
1. Route all traffic through pgbouncer in transaction pooling mode.
2. Logically replicate from old to new.
For failover:
1. Ensure replication is not far behind.
2. Issue a PAUSE on pgbouncer.
3. Wait for replication to be fully caught up.
4. Update pgbouncer config to point to the new database.
5. Issue a RELOAD on pgbouncer.
6. Issue a RESUME on pgbouncer.
Zero downtime; < 2s additional latency for in-flight queries at time of op is possible (and I've done it at scale).
Doing an active no-downtime database migration basically involves having a coherent row-level merge policy (assuming you AT LEAST have a per-row last updated column), or other tricks. Or maybe you temporarily write cell-level timestamps and then drop it later.
Or if you have data that expires on a window, you just do double-writes for that period and then switch over.
I'm pretty sure that was the whole idea of RDBMS, to separate application from data. You badly lose the very moment when some of your data is in a different place -- on transactions, query planning, security, etc. -- so Codd thought "what if even different applications could use a single company-wide database?" Hence the "have everything in a single database" part should be the last compromise you're forced to make, not the first one.
The attitude “our company = scale and scale = nosql” is prevalent enough that even if you know better, it’s probably in your interest to play the game. It’s the one “scalability fact” everyone knows, and a shortcut to sounding smart in front of management when you can’t grasp or haven’t taken the time to dig in on the details.
NewSQL technology promises to make this more automated, which is definitely a good thing, but unless you are Google or have a use case that needs it, it probably isn't worth adopting it yet until they are more mature.
Stuff like “CRUD enterprise app. 1 large-ish Postgres node. 10k tenants. 100 tables with lots of foreign key lookups, 100gb on disk. Db is… kinda slow, and final web requests take ~1 sec.”
The toughest thing is knowing what is normal for multi tenant data with lots of relational info used (compared to more large and popular companies that tend to have relatively simple data models)
a lot of "kids these days" dont seem to learn that
by that I mean young folks born into this new world with endless cloud services and scaling-means-Google propaganda
a single modern server-class machine is essentially a supercomputer by 80s standards and too many folks are confused about just how much it can achieve if the software is written correctly
> We reasoned this was likely safe, as Retool’s product logic performs its own consistency checks, and also doesn’t delete from the referenced tables, meaning it was unlikely we’d be left with a dangling reference.
I was holding my breath here and I'm glad these were eventually turned back on.
Nobody should ever rely on their own product logic to ensure consistency of the database.
The database has features (constraints, transactions, etc) for this purpose which are guaranteed to work correctly and atomically in all situations such as database initiated rollbacks that your application will never have control over.
I've built some very high throughput Postgres backed systems in my years, and doing application side foreign key constraints (FKC) does have its benefits. Doing this client side will result in constraints that are usually, but not always in sync with data. However, this kind of almost-consistency lets you do much higher throughput queries. An FKC is a read on every write, for example, and does limit write throughput. Of course, this isn't ok for some workloads, and you do proper FKC in the DB, but if you don't need absolute consistency, you can make writes far cheaper.
Ideally, everything that needs to change when a row is deleted would be changed automatically and atomically using database-side constraints and triggers. In practice, applications often need to sync state with external services that the database knows nothing about, so I understand your concerns.
ON DELETE RESTRICT, on the other hand, will result in errors just like any other query error that you can handle in your application. Nothing happened, so there's nothing to be notified of.
I’ve seen certain large-table queries improve by 10x when foreign keys were added.
Happy to stick around and answer any questions you have.
As mentioned downthread, we're doing some work soon to remove the audit table from the database, which will cut storage usage by over 50%.
It's written in Python, spins up a queue.Queue object, populates it with ranges of rows that need to be copied (based on min < ID < max ranges), starts up a bunch of Python threads and then each of those threads uses os.system() to run this:
psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \
| psql "{dest_url}" -c "COPY {table_name} FROM STDIN"
This feels really smart to me. The Python GIL won't be a factor here. psql -c "..." | pigz -c > file.tsv.gzOur database is less than 1TB. One master (for writes + some reads) + multiple slaves (read-only).
Here's what we did -
1, Launched a new read-only db with pg9.6, let's call it DB_A.
2, Stopped all offline tasks, and only maintained a minimal fleet of online servers (e.g., web, api...).
3, Changed all db hosts (no matter master or slave) in /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use old read-only db with pg9.6, let's call it DB_B. From this point on, all write ops should fail.
4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and promoted it to be a master db.
5, Changed /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use DB_A for all db hosts. By this point, DB_A is a master db. And write ops should be good now.
6, Changed /etc/hosts for all other servers and brought back all services.
Step 4 is the most critical. If it fails or runs too long (e.g., more than 10 minutes), then we had to rollback by changing /etc/hosts on those online servers.
We carefully rehearsed these steps for an entire week, and timed each step. By the time we did it on production, we knew how many seconds/minutes each step would take. And we tried to automate as many things as possible in bash scripts.
We missed one or two pieces of reconnecting things afterwards, and some of that seems to be limitations of Heroku Postgres that we couldn’t change. Hopefully those keep improving.
https://cloud.google.com/blog/products/databases/cloud-sql-l...
The biggest problem with all the cloud providers is that you won't know exactly when this 10 minute downtime window will start
I guess the only advantage here is that you don't have to do 9->10->11->12->13->14 like in the past and maybe that was one of the blockers Azure has. AWS allows to skip some major versions but 9->14 is not possible.
The benefit of an inplace upgrade for postgres is you don't have to spin up another server, restore from backup, and run pg_upgrade yourself.
On top of that Azure postgres (limited to pg11) has essentially deprecated in place of their v2 Flexible tier with no official migration path.
If you want to pay multi-thousands dollars a month for a database server, it's WAY cheaper just to slap a server with a ton of drives in colocation.
Yeah, this migration might have been somewhat painful, but it looks like it's a situation the company experiences every few years.
An expensive can be used that year where a capital expenses vest over years so only a portion can be applied that year.
=> select pg_size_pretty(pg_database_size ('**'));
pg_size_pretty
----------------
99 TB
(1 row)
(The re-analyze of tables took a day or so though)
It's really just a handful of core people who did most of the work, crafting it so thoughtfully over the years and it has such a huge impact on the world.
Doing huge part of it before postgresql was as popular as it is today, spending countless hours on making some great design choices and implementing them carefully.
It seems unlikely any of them will read that but I'm so deeply grateful to these people. It allowed so many things to flourish on top thanks to it being open source and free.
Does Postgres still make sense for append only tables or maybe elastic or other kind of database would be more suitable?
We actually migrated from vanilla Postgres to Aurora that way with minimal risk / downtime, it was a really smooth process.
On the other hand, self built databases maintained by professionals were ALOT cheaper.
We are talking here about million of dollars bills only for databases per month.
Self hosted solution did cost around 300k per month.
This included precompiled kernel patches, tweaks to postgres engine etc.
Overall the investment of a year of work of dbadmins will probably return itself by ten times if not more.
I only wish my Oracle updates were so simple and bug-free.
The idea that you don't get a seamless upgrade of the database itself with PostgreSQL is absurd to me. The part about "maximizing the amount of time this upgrade buys is" is only necessary because of how difficult PostgreSQL makes upgrades. We upgrade to every new version of SQL Server. It's not that big of a deal.
With every PostgreSQL blog article I read, I become more and more of an SQL Server fanboy. At this point it's full-blown. So many "serious business" PostgreSQL ops posts are just nothingburgers in the SQL Server world.
- physical restore to new cluster
- pg_upgrade the new cluster
- catch up on logical wal logs from old cluster
- failover to new cluster
- STONITH
I think the above was not open to them because of the limitations of their managed PG instance. I haven't used Azure but GCP managed SQL has loads of limitations. It seems very common and I think is a major (and undiscussed) drawback of these managed instance.
But the truth is that very few of the people who use PG want to hear that things are better in the MS SQL community for reasons of prejudice and as a result you're being downvoted unfairly for pointing out PGs relative backwardness here.
Here's a nice thing about PostgreSQL over SQL Server to satiate the fans: SQL Server is absurdly expensive to run in the cloud. I can't believe anyone uses RDS for SQL Server. Even in EC2 it's horrifically expensive. That's the main reason we have an on-prem cluster.
Keep in mind that they’re upgrading from a database version that’s almost 6 years old. Postgres has improved a lot in the last 5 major versions since then.
Another thing here is that I’m pretty sure they could have just done the in-place upgrade and it would have been fine. I’ve run pg_upgrade myself for a bunch of major versions now and it’s easy and doesn’t require dumping / restoring anything. Maybe there’s something else going on that I’m missing though.
What setup are you running with sql server to have it automatically failover? Is it a multi master configuration or are the additional nodes just read replicas?
These days Postgres actually allows logical replication so your servers can be running different versions at the same time, which allows for much smoother upgrades (haven’t tried that myself yet, don’t quote me on it!)
SQL Server is designed to run in a Windows Server Failover Cluster; the SQL Server-side feature is called "Always On availability groups" in an HA configuration. It's a single-master arrangement, you can either have a fully passive secondary (that's what we do) or read-only replicas. The WSFC handles managing quorum, that's what causes the automatic failover as soon as >50% of the nodes are running the new version.
EDIT: I realized you're looking for the other PostgreSQL blog posts. Here's an example of two recent HN posts about PostgreSQL issues that I pulled out of my comment history. Both of these blog posts exist because PostgreSQL doesn't have query hints. SQL Server has them; I've dealt with issues like these blog posts describe but they have trivial fixes in the SQL Server world. Nothing to write a blog post about. I don't have a link handy regarding PostgreSQL's txn id wraparound problem, but SQL Server doesn't have that problem, either.
Very small companies might be able to do this on 'no load day' but from a pure business perspective, running your db twice is easier and way less risky.
You could have done this even without downtime by letting your connection proxy handling the switch.
Non distributed RDBMS is a great (yet underrated) choice. Thank you for the good writeup.
I was thinking you could have a much less delicate migration experience next time (some years from now). So you can go for a quicker parallel "dump and restore" migration.
For example:
- Client side sharding in the application layer: you could shard your customers' data across N smaller DB instances (consistent hashing on customer ID)
- Moving the append-only data somewhere else than postgres prior to the upgrade. You don't need RDBMS capabilities for that stuff anyway. Look at Elasticsearch, Clickhouse, or any DB oriented to time series data.
WDYT?
The first bullet point is on our radar for the near term. We have a very natural shard key in our schema (the customer ID), with AFAIK no relationships across that shard key. And once we start horizontally sharding, we can do cool things like putting your data in a shard geographically close to you, which will greatly increase app performance for our non US customers. Exciting stuff coming down the pike!
There could be all kinds of reasons why that is tho (if it is in fact the case).
I thought it was interesting that they upgraded 4 major version numbers in one go. I kept expecting to read something about version compatibility and configuration but was surprised there was none. Are major upgrades like this just less of an issue with Postgres in general?
I've run into version incompatibilities before, but it was my fault – they were expertly documented in the release notes and I just hadn't read them (or sufficiently tested the upgrade before the live performance of it).
Anyway, thanks for food for thought.
If you're using a hosted DB service, you're (probably) stuck in needing/wanting to rehearse using the hosted service (which is what the blog post describes).
If they were running the DB on 'regular server' cloud instances, it seems just as good to me to rehearse with other cloud server instances versus "standalone" servers.
audit events -> queue -> elastic -> blob storage
is so easy to maintain and we save TBs from living in the DB.When your audit trail is in DB, you can pretty easily surface audit events to your customers. Who changed what when is just another feature. Capturing audit events is also usually pretty smooth.
The folks doing the blob storage route, you would not BELIEVE the complexity they have to spin up to expose very simple histories etc. This matters a LOT in some spaces (financial etc), less so in others.
In my RDBMS model, who changed this field when from what to what is a basic select. You can even shard by recordID or similar if you want to reduce table scans, good select of indexes etc can be a huge help as well. In most cases users don't mind a bit of latency on these queries.
We won't have billions of rows though, so once uploaded to secondary storage we'll just clear the blob field and set a "processed" flag.
This way we can find all the relevant keys for a given order, invoice etc quickly based on a partial key search in the database, and transparently fetch from either db directly or secondary storage as needed.
e.g. https://docs.microsoft.com/en-us/azure/azure-sql/database/le...
Then the cloud and DBaaS was invented.
One invaluable thing, though: our application was from the beginning designed to do 100% of all the reads from a read-only slave _if the slave was up to sync_ (which it was 95% of the time). We could also identify testers/developers in the application itself, so we had them using the upgraded slave for two weeks before the actual upgrade.
This made it possible for us to filter out problems in the application/DB-layer, which were few, which means that we probably did a minor version upgrade.
But upgrading by replication is something I can recommend.
Generally you just upgrade the replicas; then promote a replica to be the new primary; then upgrade the old primary and turn it into a replica.
The actual "upgrade" step is quite fast, since it doesn't actually need to iterate over your tables' row data.
At large scale, the painful part of major-version MySQL upgrades tends to be performance testing, but that's performed separately and prior to the actual upgrade process. Third-party tools (pt-upgrade, proxysql mirroring, etc) help a lot with this.
THERE IS NO CLOUD: It’s just someone else’s computer