story

How we upgraded our 4TB Postgres database (opens in new tab)

retool.com

384 pointsmrbabbage4y ago210 comments

210 comments

It is amazing how many large-scale applications run on a single or a few large RDBMS. It seems like a bad idea at first: surely a single point of failure must be bad for availability and scalability? But it turns out you can achieve excellent availability using simple replication and failover, and you can get huge database instances from the cloud providers. You can basically serve the entire world with a single supercomputer running Postgres and a small army of stateless app servers talking to it.

lazide4y ago

Well, it's always been that way.

The big names started using no-sql type stuff because their instances got 2-3 orders of magnitude larger, and that didn't work. It adds a lot of other overhead and problems doing all the denormalization though, but if you literally have multi-PB metadata stores, not like you have a choice.

Then everyone started copying them without knowing why.... and then everyone forgot how much you can actually do with a normal database.

And hardware has been getting better and cheaper, which makes it only more so.

Still not a good idea to store multi-PB metadata stores in a single DB though.

jghn4y ago

> Then everyone started copying them without knowing why

People tend to have a very bad sense of what constitutes large scale. It usually maps to "larger than the largest thing I've personally seen". So they hear "Use X instead of Y when operating at scale", and all of a sudden we have people implementing distributed datastore for a few MB of data.

Having gone downward in scale over the last few years of my career it has been eye opening how many people tell me X won't work due to "our scale", and I point out I have already used X in prior jobs for scale that's much larger than what we have.

lazide4y ago

100% agree. I've also run across many cases where no-one bothered to even attempt any benchmarks or load tests on anything (either old or new solutions), compared latency, optimize anything, etc.

Sometimes making 10+ million dollar decisions off that gut feel with literally zero data on what is actually going on.

It rarely works out well, but hey, have to leave that opening for competition somehow I guess?

And I'm not talking about 'why didn't they spend 6 months optimizing that one call which would save them $50 type stuff'. I mean literally zero idea what is going on, what actual performance issues are, etc.

dmd4y ago

Yep. I've personally been in the situation where I had to show someone that I could do their analysis in a few seconds using the proverbial awk-on-a-laptop when they were planning on building a hadoop cluster in the cloud because "BIG DATA". (Their Big Data was 50 gigabytes.)

2 more replies

lostcolony4y ago

It's self reinforcing too. All the "system design" questions I've seen have started from the perspective of "we're going to run this at scale". Really? You're going to build for 50 million users -from the beginning-? Without first learning some lessons from -actual- use? That...seems non-ideal.

dzhiurgis4y ago

Place I’ve recently left had 10M record MongoDB table without indexes which would take tens of seconds to query. Celery was running in cron mode every 2 second or so meaning jobs would just pile up and redis eventually ran out of memory. No one understood why this was happening so just restart everything after pagerduty alert…

2 more replies

cortesoft4y ago

This is funny, because I suffer from the opposite issue... every time I try to bring up scaling issues on forums like HN, everyone says I don't actually need to worry because it can scale up to size X... but my current work is with systems at 100X size.

I feel like sometimes the pendulum has swung too far the other way, where people deny that there ARE people dealing with actual scale problems.

5 more replies

seunosewa4y ago

Large solid state drives really changed the game; your working set doesn't have to fit in RAM for decent performance anymore.

openthc4y ago

With postgres you'll want to tune those cost paramters however. Eg: lowering the random page cost will change how the planner does things on some queries. But don't just blindly change it -- like modify the value and run the benchmark again. The point is that the SSD is not 10x the cost of RAM (0.1 vs 1.0). In our example a few queries the planner move to, what I always assumed was the slower sequential scan -- but it's only slower depending on your table size (how tall and how wide). I mean, PG works awesome w/o tweaking that stuff but if you've got a few days to play with these values it's quite educational.

more_corn4y ago

Ideally you offer developers both a relational data store and a fast key-value data store. Train your developers to understand the pros and cons and then step back.

There’s nothing inherently wrong with a big db instance. The cloud providers have fantastic automation around multi-az masters, read replicas and failover. They even do cross region or cross account replication.

That being said. Multi PB would raise an eyebrow.

derekperkins4y ago

Most relational databases are fast enough for smaller key/value. Replace key/value with a blob store (still technically k/v) and I'm onboard

dijit4y ago

Even when it’s not a cloud provider, in fact, especially when it’s not a cloud provider: you can achieve insane scale from single instances.

Of course these systems have warm standbys, dedicated backup infrastructure and so it’s not really a “single machine”; but I’ve seen 80TiB Postgres instances back in 2011.

cogman104y ago

We are currently pushing close to 80tb mssql on prem instances.

The biggest issue we have with these giant dbs is they require pretty massive amounts of RAM. That's currently our main bottle neck.

But I agree. While our design is pretty bad in a few ways, the amount of data that we are able to serve from these big DBs is impressive. We have something like 6 dedicated servers for a company with something like 300 apps. A hand full of them hit dedicated dbs.

Were I to redesign the system, I'd have more tiny dedicated dbs per app to avoid a lot of the noisy neighbor/scaling problems we've had. But at the same time, It's impressive how far this design has gotten us and appears to have a lot more legs on it.

Nathanba4y ago

Can I ask you how large tables can generally get before querying becomes slower? I just can't intuitively wrap my head around how tables can grow from 10gb to 100gb and why this wouldnt worsen query performance by x10. Surely you do table partitions or cycle data out into archive tables to keep up the query performance of the more recent table data, correct?

6 more replies

Vladimof4y ago

I don't think that it's amazing.... I think that the new and shinny databases tried to make you think that it was not possible...

eastbound4y ago

React makes us believe everything must have 1-2s response to clicks and the maximum table size is 10 rows.

When I come back to web 1.0 apps, I’m often surprised that it does a round-trip to the server in less than 200ms, and reloads the page seamlessly, including a full 5ms SQL query for 5k rows and returned them in the page (=a full 1MB of data, with basically no JS).

Nextgrid4y ago

There’s shit tons of money to be made for both startups and developers if they convince us that problems solved decades ago aren’t actually solved so they can sell you their solution instead (which in most cases will have recurring costs and/or further maintenance).

Existenceblinks4y ago

Plus, HTML compression on wire is insane, easily 50%+.

markandrewj4y ago

Scaling databases vertically, like Oracle DB, in the past was the norm. It is possible to serve a large number of users, and data, from a single instance. There are some things worth considering though. First of all, no matter how reliable your database is, you will have to take it down eventually to do things like upgrades.

The other consideration that isn't initially obvious, is how you may hit an upper bound for resources in most modern environments. If your database is sitting on top of a virtual or containerized environment, your single instance database will be limited in resources (CPU/memory/network) to a single node of the cluster. You could also eventually hit the same problem on bare metal.

That said there are some very high density systems available. You may also not need the ability to scale as large as I am talking, or choose to shard and scale your database horizontally at later time.

If your project gets big enough you might also start wanting to replicate your data to localize it closer to the user. Another strategy might be to cache the data locally to the user.

There are positive and negatives with a single node or cluster. If retools database was clustered they would have been able to do a rolling upgrade though.

the84724y ago

> scalability

You can scale quite far vertically and avoid all the clustering headaches for a long time these days. With EPYCs you can get 128C/256T, 128PCIe lanes (= 32 4x NVMes = ~half a petabyte of low-latency storage, minus whatever you need for your network cards), 4TB of RAM in a single machine. Of course that'll cost you an arm and a leg and maybe a kidney too, but so would renting the equivalent in the cloud.

synicalx4y ago

It's all fun and games with the giant boxen until a faulty PSU blows up a backplane, you have to patch it, the DC catches on fire, support runs out of parts for it, network dies, someone misconfigures something etc etc.

Not saying a single giant server won't work, but it does come with it's own set of very difficult-to-solve-once-you-build-it problems.

strictfp4y ago

I agree in principle. But one major headache for us has been upgrading the database software without downtime. Is there any solution that does this without major headaches? I would love some out-of-the-box solution.

simonw4y ago

The best trick I know of for zero-downtime upgrades is to have a read-only mode.

Sure, that's not the same thing as pure zero-downtime but for many applications it's OK to put the entire thing into read-only mode for a few minutes at a well selected time of day.

While it's in read-only mode (so no writes are being accepted) you can spin up a brand new DB server, upgrade it, finish copying data across - do all kinds of big changes. Then you switch read-only mode back off again when you're finished.

I've even worked with a team used this trick to migrate between two data centers without visible end-user downtime.

A trick I've always wanted to try for smaller changes is the ability to "pause" traffic at a load balancer - effectively to have a 5 second period where each incoming HTTP request appears to take 5 seconds longer to return, but actually it's being held by the load balancer until some underlying upgrade has completed.

Depends how much you can get done in 5 seconds though!

kgeist4y ago

>The best trick I know of for zero-downtime upgrades is to have a read-only mode.

I've done something similar, although it wasn't about upgrading the database. We needed to not only migrate data between different DB instances, but also between completely different data models (as part of refactoring). We had several options, such as proper replication + schema migration in the target DB, or by making the app itself write to two models at the same time (which would require a multi-stage release). It all sounded overly complex to me and prone to error, due to a lot of asynchronous code/queues running in parallel. I should also mention that our DB is sharded per tenant (i.e. per an organization). What I came up with was much simpler: I wrote a simple script which simply marked a shard read-only (for this feature), transformed and copied data via a simple HTTP interface, then marked it read-write again, and proceeded to the next shard. All other shards were read-write at a given moment. Since the migration window only affected a single shard at any given moment, no one noticed anything: for a tenant, it translated to 1-2 seconds of not being able to save. In case of problems it would also be easier to revert a few shards than the entire database.

1 more reply

brentjanderson4y ago

Depends on the database - I know that CockroachDB supports rolling upgrades with zero downtime, as it is built with a multi-primary architecture.

For PostgresQL or MySQL/MariaDB, your options are more limited. Here are two that come to mind, there may be more:

# The "Dual Writer" approach

1. Spin up a new database cluster on the new version. 2. Get all your data into it (including dual writes to both the old and new version). 3. Once you're confident that the new version is 100% up to date, switch to using it as your primary database. 4. Shut down the old cluster.

# The eventually consistent approach

1. Put a queue in front of each service for writes, where each service of your system has its own database. 2. When you need to upgrade the database, stop consuming from the queue, upgrade in place (bringing the DB down temporarily) and resume consumption once things are back online. 3. No service can directly read from another service's database. Eventually consistent caches/projections service reads during normal service operation and during the upgrade.

A system like this is more flexible, but suffers from stale reads or temporary service degradation.

jtc3314y ago

Dual writing has huge downsides: namely you're now moving consistency into the application, and it's almost guaranteed that the databases won't match in any interesting application.

2 more replies

jtc3314y ago

Preconditions:

1. Route all traffic through pgbouncer in transaction pooling mode.

2. Logically replicate from old to new.

For failover:

1. Ensure replication is not far behind.

2. Issue a PAUSE on pgbouncer.

3. Wait for replication to be fully caught up.

4. Update pgbouncer config to point to the new database.

5. Issue a RELOAD on pgbouncer.

6. Issue a RESUME on pgbouncer.

Zero downtime; < 2s additional latency for in-flight queries at time of op is possible (and I've done it at scale).

karmakaze4y ago

The way I've done it with MySQL since 5.7 is to use multiple writers of which only one is actively used by clients. Take one out, upgrade it, put it back into replication but not serving requests until caught up. Switch the clients to writing to the upgraded one then upgrade the others.

georgewfraser4y ago

This is such a huge problem. It's even worse than it looks: because users are slow to upgrade, changes to the database system take years to percolate down to the 99th percentile user. The decreases the incentive to do certain kinds of innovation. My opinion is that we need to fundamentally change how DBMS are engineered and deployed to support silent in-the-background minor version upgrades, and probably stop doing major version bumps that incorporate breaking changes.

ww5204y ago

The system needs to be architected in certain way to make upgrade without downtime. Something like the Command and Query Responsibility Segregation (CQRS) would work. A update queue serves as the explicit transaction log keeping track of the updates from the frontend applications, while the databases at the end of the queue applies updates and serves as the querying service. Upgrading the live database just means having a standby database with new version software replaying all the changes from the queue to catch up to the latest changes, pausing the live database from taking new changes from the queue when the new db has caught up, switching all client connections to the new db, and shutting down the old db.

AtlasBarfed4y ago

Cassandra can do it since it has cell level timestamps, so you can mirror online writes and clone existing data to the new database, and there's no danger of newer mutations being overwritten by the bulk restored data.

Doing an active no-downtime database migration basically involves having a coherent row-level merge policy (assuming you AT LEAST have a per-row last updated column), or other tricks. Or maybe you temporarily write cell-level timestamps and then drop it later.

Or if you have data that expires on a window, you just do double-writes for that period and then switch over.

msh4y ago

Migrate the data to a new host having the new version.

jhgb4y ago

> It is amazing how many large-scale applications run on a single or a few large RDBMS. It seems like a bad idea at first: surely a single point of failure must be bad for availability and scalability?

I'm pretty sure that was the whole idea of RDBMS, to separate application from data. You badly lose the very moment when some of your data is in a different place -- on transactions, query planning, security, etc. -- so Codd thought "what if even different applications could use a single company-wide database?" Hence the "have everything in a single database" part should be the last compromise you're forced to make, not the first one.

icedchai4y ago

Caches also help a ton (redis, memcache...)

nesarkvechnep4y ago

Also HTTP caching. It's always funny to me why people, not you in particular, reach for Redis when they don't even use HTTP caching.

thedougd4y ago

They didn't build their APIs with an understanding of HTTP verbs (ala RESTful). Mistakes such as POST with a query in body to search for X.

1 more reply

closeparen4y ago

I have been on so many interview loops where interviewers faulted the architecture skill or experience of candidates because they talked about having used relational databases or tried to use them in design questions.

The attitude “our company = scale and scale = nosql” is prevalent enough that even if you know better, it’s probably in your interest to play the game. It’s the one “scalability fact” everyone knows, and a shortcut to sounding smart in front of management when you can’t grasp or haven’t taken the time to dig in on the details.

A1kmm4y ago

And a lot of applications can be easily sharded (e.g. between customers). So you can have a read-heavy highly replicated database that says which customer is in which shard, and then most of your writes are easily sharded across RDBMS primaries.

NewSQL technology promises to make this more automated, which is definitely a good thing, but unless you are Google or have a use case that needs it, it probably isn't worth adopting it yet until they are more mature.

rtpg4y ago

I would love to have stats of real world companies on this front.

Stuff like “CRUD enterprise app. 1 large-ish Postgres node. 10k tenants. 100 tables with lots of foreign key lookups, 100gb on disk. Db is… kinda slow, and final web requests take ~1 sec.”

The toughest thing is knowing what is normal for multi tenant data with lots of relational info used (compared to more large and popular companies that tend to have relatively simple data models)

syngrog664y ago

plus caching, indexes and smart code algorithms go a looooooooong way

a lot of "kids these days" dont seem to learn that

by that I mean young folks born into this new world with endless cloud services and scaling-means-Google propaganda

a single modern server-class machine is essentially a supercomputer by 80s standards and too many folks are confused about just how much it can achieve if the software is written correctly

cube004y ago

> To resolve this, we ended up choosing to leave foreign key constraints unenforced on a few large tables.
> We reasoned this was likely safe, as Retool’s product logic performs its own consistency checks, and also doesn’t delete from the referenced tables, meaning it was unlikely we’d be left with a dangling reference.

I was holding my breath here and I'm glad these were eventually turned back on.

Nobody should ever rely on their own product logic to ensure consistency of the database.

The database has features (constraints, transactions, etc) for this purpose which are guaranteed to work correctly and atomically in all situations such as database initiated rollbacks that your application will never have control over.

oppositelock4y ago

It's difficult to make a blanket statement like this.

I've built some very high throughput Postgres backed systems in my years, and doing application side foreign key constraints (FKC) does have its benefits. Doing this client side will result in constraints that are usually, but not always in sync with data. However, this kind of almost-consistency lets you do much higher throughput queries. An FKC is a read on every write, for example, and does limit write throughput. Of course, this isn't ok for some workloads, and you do proper FKC in the DB, but if you don't need absolute consistency, you can make writes far cheaper.

javajosh4y ago

The trade-offs between foreign key constraints vs none are almost identical to the trade-offs between static typing vs dynamic typing. Nowadays people realize that when they turn off these features is that they'll eventually have to re-implement them later.

WJW4y ago

You make this claim as if this happens to every company sooner or later, but if a company the size of GitHub can still do without ( https://github.com/github/gh-ost/issues/331#issuecomment-266...) it does become a little bit of a "you do not have google problems" type discussion.

(Perhaps you do have such problems, I don't know where you work! But 99%+ of companies don't have such problems and never will.)

2 more replies

mb77334y ago

I don't think that is a valid comparison. Static typing doesn't reduce throughput. It's almost the opposite.

1 more reply

phphphphp4y ago

Does that pattern you describe require any considerations when writing code? I’m thinking of applications I’ve worked on where events are triggered by change, and so a database rolling back independent of my application would be a nightmare. I treat the database as a place to store data, not an authority: the application is the authority. Do you approach it differently? Thanks!

brightball4y ago

The database is the only place that can be the authority because the application can have race conditions. It’s the only way to guarantee data integrity.

1 more reply

kijin4y ago

ON DELETE CASCADE can be dangerous when used with applications that expect to be notified of deletions, like in your case.

Ideally, everything that needs to change when a row is deleted would be changed automatically and atomically using database-side constraints and triggers. In practice, applications often need to sync state with external services that the database knows nothing about, so I understand your concerns.

ON DELETE RESTRICT, on the other hand, will result in errors just like any other query error that you can handle in your application. Nothing happened, so there's nothing to be notified of.

icedchai4y ago

You'd be surprised. I used to work on a product where the lead developer made sure foreign keys were NOT enabled on production. They were only "allowed" in dev. Some teams have a strict "no foreign keys" rule.

tatersolid4y ago

Does this mental giant know that modern databases use foreign keys to optimize queries, sometimes even eliding JOIN operations entirely if the schema, foreign keys, and other indexes make it safe to do so?

I’ve seen certain large-table queries improve by 10x when foreign keys were added.

1 more reply

arthurcolle4y ago

That is the dumbest thing I've ever heard

icedchai4y ago

I agree. This same guy also had other interesting database practices. I eventually quit because it was so bad.

jasfi4y ago

True, but the DB constraints often aren't enough. I like to have a verify program to check that the DB is as expected.

weird-eye-issue4y ago

You don't trust DB constraints?

rattray4y ago

They're limited in what they can express; your application often has invariants you want to enforce/maintain that can't be (performantly) expressed with DB constraints, and must be validated another way.

As great as it can be to enforce rules within the database, a lot of them usually end up needing to be enforced at the application layer instead. Especially when performance at scale comes into play.

1 more reply

PeterisP4y ago

DB constraints can verify an important but inherently limited, simplified subset of data integrity.

For a crude example, it's trivial for DB constraints verify (via a foreign key constraint) that all your contracts belong to some customer, but very difficult for DB constraints to verify that all your currently active contracts belong to a currently active customer, even if the definition of 'active' is some relatively simple business logic.

So in my experience it's not that rare to have some code-based data integrity tests that run various sanity checks on production data to verify things that DB constraints can not.

deathanatos4y ago

Depends on the database, sometimes the database config, as to whether they'll actually be enforced or not, or in what situations data might evade enforcement of the constraints…

Applies to vendors, too. Had some data in Rackspace Files where "list files" would say X existed, but "GET X" got you a 404. Had an AWS RDS instance; query on it returned no results. Adding the "don't use the index" index hint caused it to return data. (Allegedly, this bug was fixed, but we had migrated off by that point, so I never got to confirm it.)

Conversely, I do like DB constraints, because if the DB constraint doesn't exist, then I guarantee you the production DB has a row that is a counter-example to whatever constraint you think the data should obey…

1 more reply

mrbabbageOP4y ago

Hey folks—I wrote the post! This was my biggest Postgres project to date, and it proved quite tricky since I didn't rehearse with a test database of the same size. I learned a bunch about Postgres, not least the incredibly powerful NOT VALID option for safely and quickly adding constraints.

Happy to stick around and answer any questions you have.

wrs4y ago

Wondering why you couldn’t use in-place upgrade (pg_upgrade [0]). Not supported in Azure?

[0] https://www.postgresql.org/docs/current/pgupgrade.html

majewsky4y ago

I'm guessing this would still take more than one hour since it needs to rewrite all (or most) of the 4TB?

aidos4y ago

Not in my experience. You can use the —-link flag to get it to use hard links so it doesn’t need to move the data at all. Have been through this process myself a few times and it only took seconds on a 50-100GB db. I’m always a little surprised with how week it works.

1 more reply

palijer4y ago

I'm curious why a test run on a proper sized replica database wasn't in the testing plans. That is something I've been ensuring happens for a while now for similar projects.

davidshepherd74y ago

I'm probably missing something, but it sounds like using Warp has a bunch of downsides vs "just" creating a read only replica using logical replication and then failing over. Did you choose Warp only because of Azure's limitations or were there other reasons?

somtum4y ago

Great post! Did you migrate your old database to Azure Flexible Server, Hyperscale (Citus) or a standalone VM, as Postgres 13 does not seem to be available for Azure Postgres Single Server.

tianzhou4y ago

Good write-up. I am wondering what's the process of performing regular schema change (e.g. ALTER TABLE) on a 4TB instance looks like?

aoms4y ago

This instance must cost you a ton

mrbabbageOP4y ago

More than I care to admit!

As mentioned downthread, we're doing some work soon to remove the audit table from the database, which will cut storage usage by over 50%.

loopdoend4y ago

Let me save you a bunch of money: https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after...

1 more reply

tmikaeld4y ago

Seems to be around 1784$/monthly according to the pricing calculator

clessg4y ago

> we're doing some work soon to remove the audit table from the database

Out of pure curiosity, what are you replacing it with (if anything)? Just a simple rotating log file?

2 more replies

Teletio4y ago

What's your strategy?

Using compression? Second instance for old data? Creating in-between states?

Just curious :)

abrkn4y ago

Great read! Thanks

tmikaeld4y ago

How come you didn't consider a distributed database like CochroachDB, instead of Postgres?

nrmitchi4y ago

I'm not OP, but they were upgrading from Postgres 9.6, which at least implies that this initial db was from ~2017.

This is barely past the initial release of Cockroach. It would have been kind of crazy for the Retool team to use an experimental db with a lack of history when building up their product (that was not dependent on experimental new features)

tmikaeld4y ago

Baidu migrated from an old MySQL cluster to CockroachDB for their 40TBs of databases, now runs on 20 clusters.

When does it stop being experimental to you?

https://www.cockroachlabs.com/customers/

1 more reply

simonw4y ago

I found the script they used for copying data really interesting: https://gist.github.com/peterwj/0614bf6b6fe339a3cbd42eb93dc5...

It's written in Python, spins up a queue.Queue object, populates it with ranges of rows that need to be copied (based on min < ID < max ranges), starts up a bunch of Python threads and then each of those threads uses os.system() to run this:

    psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \
      | psql "{dest_url}" -c "COPY {table_name} FROM STDIN"

This feels really smart to me. The Python GIL won't be a factor here.

teej4y ago

For ETL out of Postgres, it is very hard to beat psql. Something as simple as this will happily saturate all your available network, CPU, and disk write. Wrapping it in Python helps you batch it out cleanly.

    psql -c "..." | pigz -c > file.tsv.gz

mrbabbageOP4y ago

Thanks Simon! I can indeed confirm that this script managed to saturate the database's hardware capacity (I recall CPU being the bottleneck, and I had to dial down the parallelism to leave some CPU for actual application queries).

SahAssar4y ago

Sounds to me like this is the exact thing that the normal parallel command was made for, not sure python is needed here if the end result is shelling out to os.system anyway.

wenbin4y ago

For listennotes.com, we did a postgres 9.6 => 11 upgrade (in 2019), and 11 => 13 upgrade (in 2021). ~0 downtime for read ops, and ~45 seconds downtime for write ops.

Our database is less than 1TB. One master (for writes + some reads) + multiple slaves (read-only).

Here's what we did -

1, Launched a new read-only db with pg9.6, let's call it DB_A.

2, Stopped all offline tasks, and only maintained a minimal fleet of online servers (e.g., web, api...).

3, Changed all db hosts (no matter master or slave) in /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use old read-only db with pg9.6, let's call it DB_B. From this point on, all write ops should fail.

4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and promoted it to be a master db.

5, Changed /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use DB_A for all db hosts. By this point, DB_A is a master db. And write ops should be good now.

6, Changed /etc/hosts for all other servers and brought back all services.

Step 4 is the most critical. If it fails or runs too long (e.g., more than 10 minutes), then we had to rollback by changing /etc/hosts on those online servers.

We carefully rehearsed these steps for an entire week, and timed each step. By the time we did it on production, we knew how many seconds/minutes each step would take. And we tried to automate as many things as possible in bash scripts.

karmelapple4y ago

We did something similar recently jumping from 10 to 13. We took measurements, did some dry runs, and came up with strategies to ensure our read-only followers would work fine and we’d have a minimum downtime for writes.

We missed one or two pieces of reconnecting things afterwards, and some of that seems to be limitations of Heroku Postgres that we couldn’t change. Hopefully those keep improving.

aolle4y ago

wow

1 more reply

giovannibonetti4y ago

By the way, Google Cloud recently launched in-place upgrade of Postgres instances. A few days ago we used it to upgrade our multi TB database in my company as well.

https://cloud.google.com/blog/products/databases/cloud-sql-l...

aeyes4y ago

What is "in-place" about this? According to the docs you'll have ~10min downtime and you'll loose table statistics which is exactly what happens when you run pg_upgrade manually.

The biggest problem with all the cloud providers is that you won't know exactly when this 10 minute downtime window will start

I guess the only advantage here is that you don't have to do 9->10->11->12->13->14 like in the past and maybe that was one of the blockers Azure has. AWS allows to skip some major versions but 9->14 is not possible.

sa464y ago

In-place is a separate concept from zero downtime. Similarly, an inplace upgrade of your OS doesn't mean you can continue using the OS during the upgrade; it means you get to keep your data without restoring from an external backup.

The benefit of an inplace upgrade for postgres is you don't have to spin up another server, restore from backup, and run pg_upgrade yourself.

llama0524y ago

Azure doesn't really offer migrations paths, and their database migration tool has a ton of edge cases (not supported over 1tb etc) so while pg_upgrade is nice, Azure doesn't really have a path to use that.

On top of that Azure postgres (limited to pg11) has essentially deprecated in place of their v2 Flexible tier with no official migration path.

davidkuennen4y ago

Woah, this is great. Been waiting for this, since Cloud SQL has been very reliable in the past few years I've been using it, but upgrading was always a pain.

gamegod4y ago

If you're running a 4 TB Postgres database, but you still have to worry about this level of maintenance, what's the value proposition of using a hosted service? There's usually insane markup on any hosted Postgres instance.

If you want to pay multi-thousands dollars a month for a database server, it's WAY cheaper just to slap a server with a ton of drives in colocation.

orangepurple4y ago

Might be an accounting scam^H^H^H^H trick to book the costs as an operating expense vs a capital expenditure. In general capital expenditures have to be planned carefully, held on the books for years, and show a return on investment. Perhaps an accountant can provide more detail.

mywittyname4y ago

I feel like the explanation is a much more simple: it's easier to use Azure/AWS/GCP for everything.

Yeah, this migration might have been somewhat painful, but it looks like it's a situation the company experiences every few years.

ipaddr4y ago

This is a big reason why yearly pricing vs buying hardware is popular (one of many reasons).

An expensive can be used that year where a capital expenses vest over years so only a portion can be applied that year.

sega_sai4y ago

Using pg_upgrade I recently updated this 100Tb sized DB (from PG12 to PG13) in ~10 min of downtime.

=> select pg_size_pretty(pg_database_size ('**'));

pg_size_pretty

----------------

99 TB

(1 row)

(The re-analyze of tables took a day or so though)

comboy4y ago

Yeah, my petty 6Tb also went fine with pg_upgrade and practically no downtime. Upgrade slave, promote to master, upgrade master and then promote it back. It's a marvelous piece of technology.

It's really just a handful of core people who did most of the work, crafting it so thoughtfully over the years and it has such a huge impact on the world.

Doing huge part of it before postgresql was as popular as it is today, spending countless hours on making some great design choices and implementing them carefully.

It seems unlikely any of them will read that but I'm so deeply grateful to these people. It allowed so many things to flourish on top thanks to it being open source and free.

riku_iki4y ago

Curious what is your storage story? Where DB is actually stored? Some NAS?

sega_sai4y ago

It's DAS. Mostly it's in one box of RAIDed 32x5.5Tb drives with a couple of tablespaces/WAL elsewhere. The DB is mostly read-only and not many concurrent users, so that's probably not the most typical case.

valzam4y ago

I am generally a fan of using as few moving parts as possible but if > 60-70% (2TB + a "few hundred GB") of your prod database are an append only audit log surely if would make sense to split that part into a separate DB server? Especially when you are using a hosted service. It sounds like both uptime and consistency requirements are very different between these two parts of the production data.

ranyefet4y ago

My thoughts exactly.

Does Postgres still make sense for append only tables or maybe elastic or other kind of database would be more suitable?

abraxas4y ago

Postgres makes a lot of sense with append only tables. You can easily partition them by time (usually) and thus have an easy way to break up the index trees as well as using a cheap indexing scheme like BRIN and being able to just drop old chunks as they become irrelevant.

burai4y ago

I used to work on a company that had MongoDB as the main database. Leaving a lot of criticism aside, the replicaset model for Mongo made the upgrades much easier than the ones in other type of databses.

stingraycharles4y ago

While that’s true, managed services on eg AWS provide hot replica’s as well, which you can use to upgrade the database and do a failover to the new version.

We actually migrated from vanilla Postgres to Aurora that way with minimal risk / downtime, it was a really smooth process.

booleanbetrayal4y ago

Low / zero downtime is totally achievable with pg_logical and really boils down to whether or not you want to try to adopt bi-directional writes (and conflict management / integrity issues), or if you're willing to just have a brief session termination event and swapover. To me, the latter has generally been preferable as conflict management systems tend to be more complicated in reality (based on business logic / state) than what pg_logcical provides. Interested if people here have had success with bi-directional writes though.

hankman864y ago

Having successfully built (and sold!) a technology startup myself, I would always, always opt for a managed database service. Yes, it’s more expensive on paper and you want to run the numbers and choose the right offering. But nothing beats the peace of mind of storing your customers’ data on a system that others (Google Cloud in our case) look after. Not to mention that you’re better off focussing on your core value proposition and spending your scarce resources there than to have a database administrator on your payroll.

pojzon4y ago

This only works to some scale. Startup from my previous contract reached the point where “using bigger instance” would be losing them money.

On the other hand, self built databases maintained by professionals were ALOT cheaper.

We are talking here about million of dollars bills only for databases per month.

Self hosted solution did cost around 300k per month.

This included precompiled kernel patches, tweaks to postgres engine etc.

Overall the investment of a year of work of dbadmins will probably return itself by ten times if not more.

forinti4y ago

I recently upgraded a 9.x database to 14 using pg_upgrade and it was just simple and fast. No issues whatsoever.

I only wish my Oracle updates were so simple and bug-free.

electroly4y ago

In SQL Server you just... do the upgrade. You install the upgrade on your nodes starting with the passive nodes, and it will automatically failover from the old version to the new version once half the nodes have been upgraded. No downtime, but your redundancy drops when some nodes have been upgraded but the cluster hasn't fully been upgraded yet. You certainly don't have to dump and restore your database. Without giving private numbers, our database is much bigger than OP's 4TB; dump and restore would be wildly unacceptable.

The idea that you don't get a seamless upgrade of the database itself with PostgreSQL is absurd to me. The part about "maximizing the amount of time this upgrade buys is" is only necessary because of how difficult PostgreSQL makes upgrades. We upgrade to every new version of SQL Server. It's not that big of a deal.

With every PostgreSQL blog article I read, I become more and more of an SQL Server fanboy. At this point it's full-blown. So many "serious business" PostgreSQL ops posts are just nothingburgers in the SQL Server world.

calpaterson4y ago

PG is far behind SQL Server on ease of upgrade but the method described in this post is not the best practice right now, which I think is:

- physical restore to new cluster

- pg_upgrade the new cluster

- catch up on logical wal logs from old cluster

- failover to new cluster

- STONITH

I think the above was not open to them because of the limitations of their managed PG instance. I haven't used Azure but GCP managed SQL has loads of limitations. It seems very common and I think is a major (and undiscussed) drawback of these managed instance.

But the truth is that very few of the people who use PG want to hear that things are better in the MS SQL community for reasons of prejudice and as a result you're being downvoted unfairly for pointing out PGs relative backwardness here.

electroly4y ago

I'm curious how this works for on-prem. Our SQL Server cluster is on-prem; we can't just spin up another cluster. An important aspect of the SQL Server upgrade process is it doesn't require any extra nodes. What did people do for pgsql upgrades before everyone moved to the cloud?

Here's a nice thing about PostgreSQL over SQL Server to satiate the fans: SQL Server is absurdly expensive to run in the cloud. I can't believe anyone uses RDS for SQL Server. Even in EC2 it's horrifically expensive. That's the main reason we have an on-prem cluster.

1 more reply

Symbiote4y ago

They did not dump and restore.

They created a replica database running the new version, then switched over to it. Not too dissimilar to what you describe, although more work since they started out with only a single instance without replication support.

electroly4y ago

They _ultimately_ didn't dump and restore, but it was the first thing they tried. It didn't work; it actually failed catastrophically for them. They describe this under the "Implementing logical replication" section. Their ultimate solution is what they tried after the dump-and-restore based DMS method failed and they took an unplanned outage due to yet more PostgreSQL-specific issues (unvacuumed tuples).

All of this is exactly what I'm talking about. This blog post describes kind of a nightmare process for something that is trivial in SQL Server. They actually needed a third party product from Citus just to successfully upgrade PostgreSQL! Stunning.

Beltalowda4y ago

I don't think they "needed" the Citus tool, per se, it was just the easiest option. I don't know much about MS-SQL, but no doubt PostgreSQL has areas that can be improved, or even that outright suck.

The main barrier against adopting MS-SQL is just the pricing and that it's not open source. Another thing that PostgreSQL seems to do a lot better than MS-SQL is in the extensibility department, hence we have things like TimescaleDB, Citus, EdgeDB, and a whole lot more. I can't really find anything like that for MS-SQL, but perhaps I missed it?

1 more reply

VincentEvans4y ago

Postgres dump and restore tooling is very poor performance-wise , easily 10x slower compared to SQL Server. I love Postgres dearly and prefer to use it despite that, but I wish Postgres devs renewed their interest in improving neglected dump/restore tooling.

aidos4y ago

It’s been a long time since I used SQL Server so I don’t know that upgrade process well (I’m willing to believe it’s smoother though, especially wrt to replication / failover).

Keep in mind that they’re upgrading from a database version that’s almost 6 years old. Postgres has improved a lot in the last 5 major versions since then.

Another thing here is that I’m pretty sure they could have just done the in-place upgrade and it would have been fine. I’ve run pg_upgrade myself for a bunch of major versions now and it’s easy and doesn’t require dumping / restoring anything. Maybe there’s something else going on that I’m missing though.

What setup are you running with sql server to have it automatically failover? Is it a multi master configuration or are the additional nodes just read replicas?

These days Postgres actually allows logical replication so your servers can be running different versions at the same time, which allows for much smoother upgrades (haven’t tried that myself yet, don’t quote me on it!)

electroly4y ago

I believe pg_upgrade isn't guaranteed to always work; it's possible they might change the table storage such that it's unreadable in a new version, and pg_upgrade is documented to fail if so. However, I don't think it's ever happened. That may just be an abundance of caution in the documentation. I wonder why the author of this article didn't mention this possibility.

SQL Server is designed to run in a Windows Server Failover Cluster; the SQL Server-side feature is called "Always On availability groups" in an HA configuration. It's a single-master arrangement, you can either have a fully passive secondary (that's what we do) or read-only replicas. The WSFC handles managing quorum, that's what causes the automatic failover as soon as >50% of the nodes are running the new version.

1 more reply

Teletio4y ago

Amy real facts you can show besides your sentiment on blog posts?

electroly4y ago

What facts are you looking for? I just described the steps from this document: https://docs.microsoft.com/en-us/sql/sql-server/failover-clu... -- specifically, the "Perform a rolling upgrade or update" section. There's nothing else to my post other than contrasting the SQL Server upgrade process to the one described in the article, and musing about my growing appreciation for SQL Server; I apologize if it seemed like it was going to be deeper than that.

EDIT: I realized you're looking for the other PostgreSQL blog posts. Here's an example of two recent HN posts about PostgreSQL issues that I pulled out of my comment history. Both of these blog posts exist because PostgreSQL doesn't have query hints. SQL Server has them; I've dealt with issues like these blog posts describe but they have trivial fixes in the SQL Server world. Nothing to write a blog post about. I don't have a link handy regarding PostgreSQL's txn id wraparound problem, but SQL Server doesn't have that problem, either.

- https://news.ycombinator.com/item?id=30296490

- https://news.ycombinator.com/item?id=29981737

Teletio4y ago

The first upgrade Strategie is not the normal or easy one on anything production btw.

Very small companies might be able to do this on 'no load day' but from a pure business perspective, running your db twice is easier and way less risky.

You could have done this even without downtime by letting your connection proxy handling the switch.

sscarduzio4y ago

Hey @mrbabbage! Retool customer here! Great service :)

Non distributed RDBMS is a great (yet underrated) choice. Thank you for the good writeup.

I was thinking you could have a much less delicate migration experience next time (some years from now). So you can go for a quicker parallel "dump and restore" migration.

For example:

- Client side sharding in the application layer: you could shard your customers' data across N smaller DB instances (consistent hashing on customer ID)

- Moving the append-only data somewhere else than postgres prior to the upgrade. You don't need RDBMS capabilities for that stuff anyway. Look at Elasticsearch, Clickhouse, or any DB oriented to time series data.

WDYT?

mrbabbageOP4y ago

The second bullet point is underway! Getting audit events out of the main database will be a major headache saver.

The first bullet point is on our radar for the near term. We have a very natural shard key in our schema (the customer ID), with AFAIK no relationships across that shard key. And once we start horizontally sharding, we can do cool things like putting your data in a shard geographically close to you, which will greatly increase app performance for our non US customers. Exciting stuff coming down the pike!

sscarduzio4y ago

Oh cool! Yes locality is very important! Great idea :)

VincentEvans4y ago

Why is SQL Server able to backup and restore databases 100s of gigabytes in size in single-digit minutes, while Postgres is at least 10x slower?

aeorgnoieang4y ago

Is that even true? I don't think I've seen or read a direct apples-to-apples comparison of the two.

There could be all kinds of reasons why that is tho (if it is in fact the case).

bogomipz4y ago

>"Last fall, we migrated this database from Postgres version 9.6 to version 13 with minimal downtime."

I thought it was interesting that they upgraded 4 major version numbers in one go. I kept expecting to read something about version compatibility and configuration but was surprised there was none. Are major upgrades like this just less of an issue with Postgres in general?

aeorgnoieang4y ago

I think so? PostgreSQL is very well written software AFACT.

I've run into version incompatibilities before, but it was my fault – they were expertly documented in the release notes and I just hadn't read them (or sufficiently tested the upgrade before the live performance of it).

jeffrallen4y ago

Nifty. But I can't help thinking this was harder than it needed to be in the cloud. Because frankly, 4 TB is not big: my home Synology backup server is 4 TB. Making a pair of standalone Linux servers to rehearse this locally, and with full control of which Postgres modules and other software to use, would have made things easier, it seems.

Anyway, thanks for food for thought.

aeorgnoieang4y ago

I wouldn't think using "standalone" versus whatever would make much of a difference.

If you're using a hosted DB service, you're (probably) stuck in needing/wanting to rehearse using the hosted service (which is what the blog post describes).

If they were running the DB on 'regular server' cloud instances, it seems just as good to me to rehearse with other cloud server instances versus "standalone" servers.

Spivak4y ago

It's crazy how many apps opt for an RDBMS for append-only data like audit events. It's so tantalizing at the beginning but turns into a nightmare time marches forward.

   audit events -> queue -> elastic -> blob storage

is so easy to maintain and we save TBs from living in the DB.

onphonenow4y ago

Actually, I've seen more problems with folks mixing lots of different tools up then I have from folks doing an append only audit event in a RDBMS.

When your audit trail is in DB, you can pretty easily surface audit events to your customers. Who changed what when is just another feature. Capturing audit events is also usually pretty smooth.

The folks doing the blob storage route, you would not BELIEVE the complexity they have to spin up to expose very simple histories etc. This matters a LOT in some spaces (financial etc), less so in others.

In my RDBMS model, who changed this field when from what to what is a basic select. You can even shard by recordID or similar if you want to reduce table scans, good select of indexes etc can be a huge help as well. In most cases users don't mind a bit of latency on these queries.

jeffbee4y ago

My only experience in the financial sector indicates the opposite. The firm held its trading history for tens of thousands of accounts going back 60 years in a SQL Server. Anyone who wanted a question answered had to submit it for overnight analysis and get the answer the next day. But in an optimal non-RDMS representation, said trading history could be condensed to a single 200MB flat file that could be queried interactively, in microseconds. Dumping the RDBMS for most use cases pretty much revolutionized the daily experience for the people at that firm.

onphonenow4y ago

This seems borderline impossible? I'd be curious if there were missing indexes or bad index selection or efforts to do things like joins? Normally for very large data sets you can shard by account if needed if that's the common filter if there is some insane table scan. Audit stuff tends to be "sparse" so if you can get an index which just tells you which pages have results, that is usually a 100X speedup with a pretty small index size.

But agreed, a daily dump to something can go a long way to unlocking other tools - in govt this is especially true not because the DBMS is hard to use, but because so many layers of consultants and others in the way it's not usable.

baq4y ago

SQL Server can easily handle such datasets if columnstore is employed. I wouldn't be surprised if a single weekend of building a proper index (...being very generous here) wouldn't make their DB go literally 100x faster.

1 more reply

icedchai4y ago

Were they running this on a 386? Why didn't they optimize their database?

1 more reply

nrmitchi4y ago

One of the biggest thing that keeping audit records in your DB gives you is transactionality around your audit logs. Sending audit events to an external system (quite often) loses this, and the resources to address this before you have to are way larger than a slightly larger AWS/GCP/Azure/<insert-computer-provider-here> bill.

magicalhippo4y ago

We're implementing something similar to what OP describes, but we'll keep the "queue" in the DB in order to insert the application audit event in the same transaction as the data change. A background process then uploads to secondary storage.

We won't have billions of rows though, so once uploaded to secondary storage we'll just clear the blob field and set a "processed" flag.

This way we can find all the relevant keys for a given order, invoice etc quickly based on a partial key search in the database, and transparently fetch from either db directly or secondary storage as needed.

mrbabbageOP4y ago

We (Retool) are going to be doing this very soon and cut our database size by 50%+. And you're exactly right: it's so easy to get started by sticking audits (or other append-only data schema) in an RDBMS, but it quickly becomes a headache and bottleneck.

baq4y ago

up until some size the headache from maintaining a separate datastore is bigger. everything should be in the RDBMS until proven otherwise for the sake of simplicity. it's actually amazing how much you can squeeze out of 'old school' databases.

e.g. https://docs.microsoft.com/en-us/azure/azure-sql/database/le...

abrkn4y ago

What will you use as a queue and storage?

1 more reply

luhn4y ago

Mind expanding on what the "nightmare" is?

abbbi4y ago

Another question: how do you backup a 4 TB Postgres Database? WAL file push? Filesystem snapshots?

dboreham4y ago

Perhaps time to re-architect to avoid having a huge monolith database, particularly since the SaaS product is inherently multi-tenant?

TruthWillHurt4y ago

Wow this takes me back 20 years ago when we did this kind of partial-dump+sync migrations of mysql.

Then the cloud and DBaaS was invented.

more_corn4y ago

Set site read only. Snapshot master. Create new db from snapshot. Point site to new db?

aeorgnoieang4y ago

Sure – and fail to meet their constraints

lijogdfljk4y ago

How does MySQL fair in this type of setup? Do the DBs differ greatly?

thatwasunusual4y ago

I worked for a company that did a similar "upgrade by replication", but with MySQL. It's quite a few years ago, so I don't remember the versions involved, but it was quite straight-forward once we had done _weeks_ of test runs on a dev environment.

One invaluable thing, though: our application was from the beginning designed to do 100% of all the reads from a read-only slave _if the slave was up to sync_ (which it was 95% of the time). We could also identify testers/developers in the application itself, so we had them using the upgraded slave for two weeks before the actual upgrade.

This made it possible for us to filter out problems in the application/DB-layer, which were few, which means that we probably did a minor version upgrade.

But upgrading by replication is something I can recommend.

evanelias4y ago

MySQL's built-in replication has always been logical replication, and it officially supports replicating from an older-version primary to newer-version replicas. So similar concept to what's described here, but much simpler upgrade process.

Generally you just upgrade the replicas; then promote a replica to be the new primary; then upgrade the old primary and turn it into a replica.

The actual "upgrade" step is quite fast, since it doesn't actually need to iterate over your tables' row data.

At large scale, the painful part of major-version MySQL upgrades tends to be performance testing, but that's performed separately and prior to the actual upgrade process. Third-party tools (pt-upgrade, proxysql mirroring, etc) help a lot with this.

rkwasny4y ago

"However, on our 4 TB production database, the initial dump and restore never completed: DMS encountered an error but failed to report the error to us."

THERE IS NO CLOUD: It’s just someone else’s computer

j / k navigate · click thread line to collapse

210 comments

georgewfraser4y ago

lazide4y ago

Well, it's always been that way.

Then everyone started copying them without knowing why.... and then everyone forgot how much you can actually do with a normal database.

And hardware has been getting better and cheaper, which makes it only more so.

Still not a good idea to store multi-PB metadata stores in a single DB though.

jghn4y ago

> Then everyone started copying them without knowing why

lazide4y ago

100% agree. I've also run across many cases where no-one bothered to even attempt any benchmarks or load tests on anything (either old or new solutions), compared latency, optimize anything, etc.

Sometimes making 10+ million dollar decisions off that gut feel with literally zero data on what is actually going on.

It rarely works out well, but hey, have to leave that opening for competition somehow I guess?

dmd4y ago

2 more replies

lostcolony4y ago

dzhiurgis4y ago

2 more replies

cortesoft4y ago

I feel like sometimes the pendulum has swung too far the other way, where people deny that there ARE people dealing with actual scale problems.

5 more replies

seunosewa4y ago

Large solid state drives really changed the game; your working set doesn't have to fit in RAM for decent performance anymore.

openthc4y ago

more_corn4y ago

Ideally you offer developers both a relational data store and a fast key-value data store. Train your developers to understand the pros and cons and then step back.

That being said. Multi PB would raise an eyebrow.

derekperkins4y ago

Most relational databases are fast enough for smaller key/value. Replace key/value with a blob store (still technically k/v) and I'm onboard

dijit4y ago

Even when it’s not a cloud provider, in fact, especially when it’s not a cloud provider: you can achieve insane scale from single instances.

Of course these systems have warm standbys, dedicated backup infrastructure and so it’s not really a “single machine”; but I’ve seen 80TiB Postgres instances back in 2011.

cogman104y ago

We are currently pushing close to 80tb mssql on prem instances.

The biggest issue we have with these giant dbs is they require pretty massive amounts of RAM. That's currently our main bottle neck.

Nathanba4y ago

6 more replies

Vladimof4y ago

I don't think that it's amazing.... I think that the new and shinny databases tried to make you think that it was not possible...

eastbound4y ago

React makes us believe everything must have 1-2s response to clicks and the maximum table size is 10 rows.

Nextgrid4y ago

Existenceblinks4y ago

Plus, HTML compression on wire is insane, easily 50%+.

markandrewj4y ago

If your project gets big enough you might also start wanting to replicate your data to localize it closer to the user. Another strategy might be to cache the data locally to the user.

There are positive and negatives with a single node or cluster. If retools database was clustered they would have been able to do a rolling upgrade though.

the84724y ago

> scalability

synicalx4y ago

Not saying a single giant server won't work, but it does come with it's own set of very difficult-to-solve-once-you-build-it problems.

strictfp4y ago

simonw4y ago

The best trick I know of for zero-downtime upgrades is to have a read-only mode.

Sure, that's not the same thing as pure zero-downtime but for many applications it's OK to put the entire thing into read-only mode for a few minutes at a well selected time of day.

I've even worked with a team used this trick to migrate between two data centers without visible end-user downtime.

Depends how much you can get done in 5 seconds though!

kgeist4y ago

>The best trick I know of for zero-downtime upgrades is to have a read-only mode.

1 more reply

brentjanderson4y ago

Depends on the database - I know that CockroachDB supports rolling upgrades with zero downtime, as it is built with a multi-primary architecture.

For PostgresQL or MySQL/MariaDB, your options are more limited. Here are two that come to mind, there may be more:

# The "Dual Writer" approach

# The eventually consistent approach

A system like this is more flexible, but suffers from stale reads or temporary service degradation.

jtc3314y ago

Dual writing has huge downsides: namely you're now moving consistency into the application, and it's almost guaranteed that the databases won't match in any interesting application.

2 more replies

jtc3314y ago

Preconditions:

1. Route all traffic through pgbouncer in transaction pooling mode.

2. Logically replicate from old to new.

For failover:

1. Ensure replication is not far behind.

2. Issue a PAUSE on pgbouncer.

3. Wait for replication to be fully caught up.

4. Update pgbouncer config to point to the new database.

5. Issue a RELOAD on pgbouncer.

6. Issue a RESUME on pgbouncer.

Zero downtime; < 2s additional latency for in-flight queries at time of op is possible (and I've done it at scale).

karmakaze4y ago

georgewfraser4y ago

ww5204y ago

AtlasBarfed4y ago

Or if you have data that expires on a window, you just do double-writes for that period and then switch over.

msh4y ago

Migrate the data to a new host having the new version.

jhgb4y ago

icedchai4y ago

Caches also help a ton (redis, memcache...)

nesarkvechnep4y ago

Also HTTP caching. It's always funny to me why people, not you in particular, reach for Redis when they don't even use HTTP caching.

thedougd4y ago

They didn't build their APIs with an understanding of HTTP verbs (ala RESTful). Mistakes such as POST with a query in body to search for X.

1 more reply

closeparen4y ago

A1kmm4y ago

rtpg4y ago

I would love to have stats of real world companies on this front.

Stuff like “CRUD enterprise app. 1 large-ish Postgres node. 10k tenants. 100 tables with lots of foreign key lookups, 100gb on disk. Db is… kinda slow, and final web requests take ~1 sec.”

The toughest thing is knowing what is normal for multi tenant data with lots of relational info used (compared to more large and popular companies that tend to have relatively simple data models)

syngrog664y ago

plus caching, indexes and smart code algorithms go a looooooooong way

a lot of "kids these days" dont seem to learn that

by that I mean young folks born into this new world with endless cloud services and scaling-means-Google propaganda

a single modern server-class machine is essentially a supercomputer by 80s standards and too many folks are confused about just how much it can achieve if the software is written correctly

cube004y ago

I was holding my breath here and I'm glad these were eventually turned back on.

Nobody should ever rely on their own product logic to ensure consistency of the database.

oppositelock4y ago

It's difficult to make a blanket statement like this.

javajosh4y ago

WJW4y ago

(Perhaps you do have such problems, I don't know where you work! But 99%+ of companies don't have such problems and never will.)

2 more replies

mb77334y ago

I don't think that is a valid comparison. Static typing doesn't reduce throughput. It's almost the opposite.

1 more reply

phphphphp4y ago

brightball4y ago

The database is the only place that can be the authority because the application can have race conditions. It’s the only way to guarantee data integrity.

1 more reply

kijin4y ago

ON DELETE CASCADE can be dangerous when used with applications that expect to be notified of deletions, like in your case.

ON DELETE RESTRICT, on the other hand, will result in errors just like any other query error that you can handle in your application. Nothing happened, so there's nothing to be notified of.

icedchai4y ago

tatersolid4y ago

I’ve seen certain large-table queries improve by 10x when foreign keys were added.

1 more reply

arthurcolle4y ago

That is the dumbest thing I've ever heard

icedchai4y ago

I agree. This same guy also had other interesting database practices. I eventually quit because it was so bad.

jasfi4y ago

True, but the DB constraints often aren't enough. I like to have a verify program to check that the DB is as expected.

weird-eye-issue4y ago

You don't trust DB constraints?

rattray4y ago

As great as it can be to enforce rules within the database, a lot of them usually end up needing to be enforced at the application layer instead. Especially when performance at scale comes into play.

1 more reply

PeterisP4y ago

DB constraints can verify an important but inherently limited, simplified subset of data integrity.

So in my experience it's not that rare to have some code-based data integrity tests that run various sanity checks on production data to verify things that DB constraints can not.

deathanatos4y ago

Depends on the database, sometimes the database config, as to whether they'll actually be enforced or not, or in what situations data might evade enforcement of the constraints…

1 more reply

mrbabbageOP4y ago

Happy to stick around and answer any questions you have.

wrs4y ago

Wondering why you couldn’t use in-place upgrade (pg_upgrade [0]). Not supported in Azure?

[0] https://www.postgresql.org/docs/current/pgupgrade.html

majewsky4y ago

I'm guessing this would still take more than one hour since it needs to rewrite all (or most) of the 4TB?

aidos4y ago

1 more reply

palijer4y ago

I'm curious why a test run on a proper sized replica database wasn't in the testing plans. That is something I've been ensuring happens for a while now for similar projects.

davidshepherd74y ago

somtum4y ago

Great post! Did you migrate your old database to Azure Flexible Server, Hyperscale (Citus) or a standalone VM, as Postgres 13 does not seem to be available for Azure Postgres Single Server.

tianzhou4y ago

Good write-up. I am wondering what's the process of performing regular schema change (e.g. ALTER TABLE) on a 4TB instance looks like?

aoms4y ago

This instance must cost you a ton

mrbabbageOP4y ago

More than I care to admit!

As mentioned downthread, we're doing some work soon to remove the audit table from the database, which will cut storage usage by over 50%.

loopdoend4y ago

Let me save you a bunch of money: https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after...

1 more reply

tmikaeld4y ago

Seems to be around 1784$/monthly according to the pricing calculator

clessg4y ago

> we're doing some work soon to remove the audit table from the database

Out of pure curiosity, what are you replacing it with (if anything)? Just a simple rotating log file?

2 more replies

Teletio4y ago

What's your strategy?

Using compression? Second instance for old data? Creating in-between states?

Just curious :)

abrkn4y ago

Great read! Thanks

tmikaeld4y ago

How come you didn't consider a distributed database like CochroachDB, instead of Postgres?

nrmitchi4y ago

I'm not OP, but they were upgrading from Postgres 9.6, which at least implies that this initial db was from ~2017.

tmikaeld4y ago

Baidu migrated from an old MySQL cluster to CockroachDB for their 40TBs of databases, now runs on 20 clusters.

When does it stop being experimental to you?

https://www.cockroachlabs.com/customers/

1 more reply

simonw4y ago

I found the script they used for copying data really interesting: https://gist.github.com/peterwj/0614bf6b6fe339a3cbd42eb93dc5...

    psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \
      | psql "{dest_url}" -c "COPY {table_name} FROM STDIN"

This feels really smart to me. The Python GIL won't be a factor here.

teej4y ago

    psql -c "..." | pigz -c > file.tsv.gz

mrbabbageOP4y ago

SahAssar4y ago

Sounds to me like this is the exact thing that the normal parallel command was made for, not sure python is needed here if the end result is shelling out to os.system anyway.

wenbin4y ago

For listennotes.com, we did a postgres 9.6 => 11 upgrade (in 2019), and 11 => 13 upgrade (in 2021). ~0 downtime for read ops, and ~45 seconds downtime for write ops.

Our database is less than 1TB. One master (for writes + some reads) + multiple slaves (read-only).

Here's what we did -

1, Launched a new read-only db with pg9.6, let's call it DB_A.

2, Stopped all offline tasks, and only maintained a minimal fleet of online servers (e.g., web, api...).

4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and promoted it to be a master db.

5, Changed /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use DB_A for all db hosts. By this point, DB_A is a master db. And write ops should be good now.

6, Changed /etc/hosts for all other servers and brought back all services.

Step 4 is the most critical. If it fails or runs too long (e.g., more than 10 minutes), then we had to rollback by changing /etc/hosts on those online servers.

karmelapple4y ago

We missed one or two pieces of reconnecting things afterwards, and some of that seems to be limitations of Heroku Postgres that we couldn’t change. Hopefully those keep improving.

aolle4y ago

wow

1 more reply

giovannibonetti4y ago

By the way, Google Cloud recently launched in-place upgrade of Postgres instances. A few days ago we used it to upgrade our multi TB database in my company as well.

https://cloud.google.com/blog/products/databases/cloud-sql-l...

aeyes4y ago

What is "in-place" about this? According to the docs you'll have ~10min downtime and you'll loose table statistics which is exactly what happens when you run pg_upgrade manually.

The biggest problem with all the cloud providers is that you won't know exactly when this 10 minute downtime window will start

sa464y ago

The benefit of an inplace upgrade for postgres is you don't have to spin up another server, restore from backup, and run pg_upgrade yourself.

llama0524y ago

On top of that Azure postgres (limited to pg11) has essentially deprecated in place of their v2 Flexible tier with no official migration path.

davidkuennen4y ago

Woah, this is great. Been waiting for this, since Cloud SQL has been very reliable in the past few years I've been using it, but upgrading was always a pain.

gamegod4y ago

If you want to pay multi-thousands dollars a month for a database server, it's WAY cheaper just to slap a server with a ton of drives in colocation.

orangepurple4y ago

mywittyname4y ago

I feel like the explanation is a much more simple: it's easier to use Azure/AWS/GCP for everything.

Yeah, this migration might have been somewhat painful, but it looks like it's a situation the company experiences every few years.

ipaddr4y ago

This is a big reason why yearly pricing vs buying hardware is popular (one of many reasons).

An expensive can be used that year where a capital expenses vest over years so only a portion can be applied that year.

sega_sai4y ago

Using pg_upgrade I recently updated this 100Tb sized DB (from PG12 to PG13) in ~10 min of downtime.

=> select pg_size_pretty(pg_database_size ('**'));

pg_size_pretty

----------------

99 TB

(1 row)

(The re-analyze of tables took a day or so though)

comboy4y ago

Yeah, my petty 6Tb also went fine with pg_upgrade and practically no downtime. Upgrade slave, promote to master, upgrade master and then promote it back. It's a marvelous piece of technology.

It's really just a handful of core people who did most of the work, crafting it so thoughtfully over the years and it has such a huge impact on the world.

Doing huge part of it before postgresql was as popular as it is today, spending countless hours on making some great design choices and implementing them carefully.

It seems unlikely any of them will read that but I'm so deeply grateful to these people. It allowed so many things to flourish on top thanks to it being open source and free.

riku_iki4y ago

Curious what is your storage story? Where DB is actually stored? Some NAS?

sega_sai4y ago

valzam4y ago

ranyefet4y ago

My thoughts exactly.

Does Postgres still make sense for append only tables or maybe elastic or other kind of database would be more suitable?

abraxas4y ago

burai4y ago

stingraycharles4y ago

While that’s true, managed services on eg AWS provide hot replica’s as well, which you can use to upgrade the database and do a failover to the new version.

We actually migrated from vanilla Postgres to Aurora that way with minimal risk / downtime, it was a really smooth process.

booleanbetrayal4y ago

hankman864y ago

pojzon4y ago

This only works to some scale. Startup from my previous contract reached the point where “using bigger instance” would be losing them money.

On the other hand, self built databases maintained by professionals were ALOT cheaper.

We are talking here about million of dollars bills only for databases per month.

Self hosted solution did cost around 300k per month.

This included precompiled kernel patches, tweaks to postgres engine etc.

Overall the investment of a year of work of dbadmins will probably return itself by ten times if not more.

forinti4y ago

I recently upgraded a 9.x database to 14 using pg_upgrade and it was just simple and fast. No issues whatsoever.

I only wish my Oracle updates were so simple and bug-free.

electroly4y ago

calpaterson4y ago

PG is far behind SQL Server on ease of upgrade but the method described in this post is not the best practice right now, which I think is:

- physical restore to new cluster

- pg_upgrade the new cluster

- catch up on logical wal logs from old cluster

- failover to new cluster

- STONITH

electroly4y ago

1 more reply

Symbiote4y ago

They did not dump and restore.

electroly4y ago

Beltalowda4y ago

I don't think they "needed" the Citus tool, per se, it was just the easiest option. I don't know much about MS-SQL, but no doubt PostgreSQL has areas that can be improved, or even that outright suck.

1 more reply

VincentEvans4y ago

aidos4y ago

It’s been a long time since I used SQL Server so I don’t know that upgrade process well (I’m willing to believe it’s smoother though, especially wrt to replication / failover).

Keep in mind that they’re upgrading from a database version that’s almost 6 years old. Postgres has improved a lot in the last 5 major versions since then.

What setup are you running with sql server to have it automatically failover? Is it a multi master configuration or are the additional nodes just read replicas?

electroly4y ago

1 more reply

Teletio4y ago

Amy real facts you can show besides your sentiment on blog posts?

electroly4y ago

- https://news.ycombinator.com/item?id=30296490

- https://news.ycombinator.com/item?id=29981737

Teletio4y ago

The first upgrade Strategie is not the normal or easy one on anything production btw.

Very small companies might be able to do this on 'no load day' but from a pure business perspective, running your db twice is easier and way less risky.

You could have done this even without downtime by letting your connection proxy handling the switch.

sscarduzio4y ago

Hey @mrbabbage! Retool customer here! Great service :)

Non distributed RDBMS is a great (yet underrated) choice. Thank you for the good writeup.

I was thinking you could have a much less delicate migration experience next time (some years from now). So you can go for a quicker parallel "dump and restore" migration.

For example:

- Client side sharding in the application layer: you could shard your customers' data across N smaller DB instances (consistent hashing on customer ID)

WDYT?

mrbabbageOP4y ago

The second bullet point is underway! Getting audit events out of the main database will be a major headache saver.

sscarduzio4y ago

Oh cool! Yes locality is very important! Great idea :)

VincentEvans4y ago

Why is SQL Server able to backup and restore databases 100s of gigabytes in size in single-digit minutes, while Postgres is at least 10x slower?

aeorgnoieang4y ago

Is that even true? I don't think I've seen or read a direct apples-to-apples comparison of the two.

There could be all kinds of reasons why that is tho (if it is in fact the case).

bogomipz4y ago

>"Last fall, we migrated this database from Postgres version 9.6 to version 13 with minimal downtime."

aeorgnoieang4y ago

I think so? PostgreSQL is very well written software AFACT.

jeffrallen4y ago

Anyway, thanks for food for thought.

aeorgnoieang4y ago

I wouldn't think using "standalone" versus whatever would make much of a difference.

If you're using a hosted DB service, you're (probably) stuck in needing/wanting to rehearse using the hosted service (which is what the blog post describes).

If they were running the DB on 'regular server' cloud instances, it seems just as good to me to rehearse with other cloud server instances versus "standalone" servers.

Spivak4y ago

It's crazy how many apps opt for an RDBMS for append-only data like audit events. It's so tantalizing at the beginning but turns into a nightmare time marches forward.

   audit events -> queue -> elastic -> blob storage

is so easy to maintain and we save TBs from living in the DB.

onphonenow4y ago

Actually, I've seen more problems with folks mixing lots of different tools up then I have from folks doing an append only audit event in a RDBMS.

When your audit trail is in DB, you can pretty easily surface audit events to your customers. Who changed what when is just another feature. Capturing audit events is also usually pretty smooth.

jeffbee4y ago

onphonenow4y ago

baq4y ago

1 more reply

icedchai4y ago

Were they running this on a 386? Why didn't they optimize their database?

1 more reply

nrmitchi4y ago

magicalhippo4y ago

We won't have billions of rows though, so once uploaded to secondary storage we'll just clear the blob field and set a "processed" flag.

mrbabbageOP4y ago

baq4y ago

e.g. https://docs.microsoft.com/en-us/azure/azure-sql/database/le...

abrkn4y ago

What will you use as a queue and storage?

1 more reply

luhn4y ago

Mind expanding on what the "nightmare" is?

abbbi4y ago

Another question: how do you backup a 4 TB Postgres Database? WAL file push? Filesystem snapshots?

dboreham4y ago

Perhaps time to re-architect to avoid having a huge monolith database, particularly since the SaaS product is inherently multi-tenant?

TruthWillHurt4y ago

Wow this takes me back 20 years ago when we did this kind of partial-dump+sync migrations of mysql.

Then the cloud and DBaaS was invented.

more_corn4y ago

Set site read only. Snapshot master. Create new db from snapshot. Point site to new db?

aeorgnoieang4y ago

Sure – and fail to meet their constraints

lijogdfljk4y ago

How does MySQL fair in this type of setup? Do the DBs differ greatly?

thatwasunusual4y ago

This made it possible for us to filter out problems in the application/DB-layer, which were few, which means that we probably did a minor version upgrade.

But upgrading by replication is something I can recommend.

evanelias4y ago

Generally you just upgrade the replicas; then promote a replica to be the new primary; then upgrade the old primary and turn it into a replica.

The actual "upgrade" step is quite fast, since it doesn't actually need to iterate over your tables' row data.

rkwasny4y ago

"However, on our 4 TB production database, the initial dump and restore never completed: DMS encountered an error but failed to report the error to us."

THERE IS NO CLOUD: It’s just someone else’s computer

j / k navigate · click thread line to collapse