You can create a replication slot, take a snapshot, restore the snapshot to a new instance, advance the LSN and replicate from there - boom, you have a logical replica with all the data. Then you upgrade your logical replica.
This article from Instacart shows how to do it: https://archive.ph/K5ZuJ
If I remember correctly the article has some small errors but I haven't done this in a while and I don't exactly remember what was wrong. But in general the process works, I have done it like this several times upgrading TB-sized instances.
This is a great recipe but needs small but important correction. We need to be careful with plugging pg_upgrade in this physical-to-logical replica conversion process: if we first start logical replication and then running pg_upgrade, there are risks of corruption – see discussion in pgsql-hackers https://www.postgresql.org/message-id/flat/20230217075433.u5.... To solve this, we need first to create logical slot, advance the new cluster to slot's LSN position (not starting logical replication yet), then run pg_upgrade, and only then logical replication – when the new cluster is already running on new PG version.
This is exactly how we (Postgres.ai) recently have helped GitLab upgrade multiple multi-TiB clusters under heavy load without any downtime at all (also involving PgBouncer's PAUSE/RESUME) - there will be a talk by Alexander Sosna presented later this week https://www.postgresql.eu/events/pgconfeu2023/schedule/sessi... and there are some plans to publish details about it.
I am not sure why I never ran into this problem, unfortunately I don't have access to my notes anymore because I no longer work on this.
This approach has solved so many problems for me. I can do full vacuum, I can change integer columns to bigint, I can do major version upgrades, I can even move instances across AWS regions all with minimal downtime.
It's really great to see that people continue to tinker with it and that there are active discussions on the mailing list to keep improving logical replication. It's come a long way since the first implementation. Thanks for your contribution!
If you resume replication with an incorrect LSN replication will break immediately. I have spent way too much time trying to do this on my own before the blog post was written and I have seen it fail over and over again.
To give you more confidence, try with the LSN from the "redo starts at" log message. It looks close but it will always fail.
https://www.instacart.com/company/how-its-made/zero-downtime...
> Modern customers expect 100% availability.
This is not my preference as a customer, nor has it been my experience as a vendor. For many workloads consistency is much more important than availability. I’m often relieved when I see a vendor announce a downtime window because it suggests they’re being sensible with my data.
Indeed, having periodic maintenance windows expected up-front probably leads to more robust architectures overall: customers building in the failsafes they need to tolerate downtime leads to more resilience. Teams that can trust their customers in that way can, in turn, take the time they need to make the investments they need to build a better product.
Perhaps this will be the blog post we write after our next major version upgrade: expectation setting around downtime _is_ the way to very high uptime.
In order to make sure that (internal) consumers of that service could handle the downtime, they introduced some artificially.
I have potentially a weird experience path here — worked with Galera a bunch early on because when we asked customers if they wanted HA they said “yes absolutely” so we sunk a ton of time into absolutely never ever going down.
When we finally presented the trade off space (basically that 10 minute downtime windows occasionally could basically guarantee that we wouldn’t have data loss) we ended up building a very different product.
AWS neither provides nor promises 100% availability. AWS will have SLAs on various services with the penalty only being a discount on your bill.
It's _your_ job to make your service resilient to a point where you are comfortable with your mitigations.
Well, you aren't gonna get it, it's a myth, like "5 nines" and such are, based on that businesses can foresee the unforeseen and plan ahead.
Whether a service is distributed or not, at some point some issue will come up and availability is going to stop for a while.
doubt.jpeg
If you have a complex system, you have incidents, you have downtime. A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses. You're not a hospital and you're not the power station. So much fake work gets done because people think their services are more important than they are. The engineering time you invested into this, invested into the product, or in making the rest of your dev team faster, would've likely made your users much happier. Specially if you can queue your notifications up and catch up after the downtime window.
If you have enterprise contracts with SLAs defining paybacks for 15min downtime windows, then I guess you could justify it, but most people don't. And like I mentioned, you likely already have a handful of incidents of the same or higher duration in practice anyway.
This is specially relevant with database migrations where the difference in work to create a migration of "little downtime" to "zero downtime" is usually significant. In this case though, seeing as this was a one time thing (newer versions of PostgreSQL on RDS allow it out of the box) it is specially hard to justify in my opinion, as opposed to if this was going to be reused across many versions or many databases powering the service.
A key learning for me from this migration was how nice it can be to track and mitigate all of the risks you can think of for a project like this. The risk of an in-place upgrade in the end seemed higher than the risks associated with the route we chose, outage windows notwithstanding.
As a bonus, if we need this approach in the future, this blog post should give us a head start, saving us many weeks of work. We hope it helps other teams in similar situations do the same.
1. You snapshot your RDS database (or use one of the existing ones I hope you have)
2. You restore that snapshot into a database running in parallel without live traffic.
3. You run the test upgrade there and check how long it takes.
4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.
I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach.
I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.
By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn.
We had three windows at 1 am where any new critical patients would be diverted to a different hospital. The first we used for major maintenance to the breakers in the switchgear, the second we used for modifications to the bus work, and the last outage was to test the operation of the new control system.
They do a transfer to diesel every month and the whole hospital is aware of it in case it results in a blackout.
Doesn’t epic cover everything from patient admission to medical imaging?
If you have replicas they'll upgrade in parallel and will reboot at random times for even more fun.
So unless you can afford random unavailability in a timeframe which can last several hours (depending on DB size) the logical replication approach is the only way to do upgrades on RDS.
The bigger the instance, the harder the problem.
If Jira is down fifteen minutes a day that rarely affects me. I have other tasks in my work queue that I can substitute. Worst case with multiple outages there’s always documentation I promised someone. But when the entire Atlassian suite goes tits up at the same time, it gets harder for me to keep a buffer of work going. Getting every app in your enterprise using the same storage array is a good way to go from 5% productivity loss to 95%.
Except that there will be competitors who don't have a downtime every month.
And who are thus placing my needs ahead of their own.
Because your outage is my outage as well.
Who said anything about downtime every month? Most companies I know do major DB version upgrades once every 2 years max, often less frequently.
A service with some short and pre-announced downtimes is better than one that fails randomly every once in a while. It's also better than one that runs extremely old versions of their software, with old bugs and vulnerabilities.
You are right that when you 'sell' the downtime to customers you have to tell them what they are getting in return.
What? As a customer, this would piss me off to no end and honestly be a dealbreaker for something like payments or general hosting.
It's pushing dysfunction onto your customers, and if your customers are technically experienced, they'd know it's a completely avoidable problem.
If they're technically experienced, they know every 9 costs exponentially more money, and probably agree that it's a good tradeoff.
AWS RDS postgress 11.13 > 15.5
We ended up going with a relatively straight forward approach of unidirectional replication using pglogical. We have some experience doing the same migration from Google Cloud sql to AWS rds with zero downtime as well, which made us pretty confident that this will work and not impact customers in any visible way.
pglogical makes it relatively straight forwards to do this kind of migration. It's not always fast, but if you're happy with waiting for a few days while it gradually replicates the full database across to the new instances.
For us it gave us a bit more freedom in changing the storage type and size which was more difficult to do with some of the alternative approaches. We had oversized our storage to get more iops, so we wanted to change storage type as well as reducing the size of the storage, which meant we couldn't do the simple restore from a snapshot.
Also, this isn't "zero downtime" -- there's a few seconds down time while service cuts over to the new server.
The article omits details on how consistency was preserved -- you can't just point your application at both servers for some period of time, for example. Possibly you can serve reads from both (but not really), but writes absolutely have to be directed to only one server. Article doesn't mention this.
Lastly, there was no mention of a rollback option -- in my experience performing this kind of one-off fork lift on a large amount of data, things sometimes go off the rails late at night. Therefore you always need a plan for how you can revert to the previous step, go to bed with the assurance that service will still be up in the morning. Specifically that is hard if you've already sent write transactions to the new server but for some reason need to cut back to the old one. Data is now inconsistent.
> Can't initialize a replica from a backup
You could, but you're not going to get any of the constant writes happening during the backup. You will have missing writes on the restored system without some kind of replication involved unless you move up to the application layer.
For example, you could update your app to apply dual writes. I'm aware of teams that have replatformed entire applications on to completely different DBs that way (e.g. going from an RDBMS to something completely different like Apache Cassandra).
For our situation, dual-writes seemed more risky than just doing the dirty work of setting up streaming replication using out of the box Postgres features. But, for some teams it could be a better move.
> This isn't "zero downtime"
and
> The article omits details on how consistency was preserved
In the post we go into detail about how we preserved consistency & avoided API downtime, but the gist is that the app was connected to both databases, but not using the new one by default. We then sent a signal to all instances of our app to cut over using Launch Darkly, which maintains a low-latency connection to all instances of our app.
For the first second after that signal, the servers queued up database requests to allow for replication to catch up. This caused a brief spike in latency that was within intentionally calculated tolerances. After that pause, requests flowed as usual but against the new database and the cut over was complete.
We included a force-disconnect against any pending traffic against the old database as well, with a 500 ms timeout. This timeout was much higher than our p99 query times, so no running queries were force terminated. This ensured that the old database's traffic had ceased, and gave replication plenty of time to catch up.
> No mention of a rollback option
Although it didn't make the cut for the blog post, we considered setting up a fallback database on PG 11.9 and replicating the 15.3 database into that third database. If we needed to abort, we could roll forward to this database on the same version.
We opted to not do this after practicing our upgrade procedure multiple times in staging to ensure we could do this successfully. Having practiced the procedure multiple times gave us confidence when it came to performing the cut over. We also used canary deployments in production to verify certain read-only workloads against the database, treating the 15.3 instance as a read replica.
To your point about it being late at night, we intentionally did this in the early evening on a weekend to avoid "fat finger" type mistakes. The cut over was carefully scripted and rehearsed to reduce the risk of human error as well.
In the event that we needed to rollback, the system was also prepared to flip back to the old database in the event of a catastrophic failure. This would have lead to some data loss against the new database, and we were prepared to reconcile any key pieces of the system in that scenario. To minimize the risk of data loss, we paused certain background tasks in the system briefly during the cutover to reduce the number of writes applied against the system. These details didn't make the blog post as we were going for more of the specifics to Postgres and less to Knock-specific considerations. Teams trying to apply this playbook will always need to build their own inventory of risks and seek to mitigate them in a context-dependent way.
Edit: More detail about rollback procedure
-- IETF Draft Spec: https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-01.html
CREATE SEQUENCE uuidv7_seq MAXVALUE 4095; -- A 12-bit sequence
CREATE OR REPLACE FUNCTION generate_uuidv7()
RETURNS uuid AS $$
DECLARE
unixts bigint;
msec bigint;
seq bigint;
rand bigint;
uuid_hex varchar;
BEGIN
-- Get current UNIX epoch in milliseconds
unixts := (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::bigint;
-- Extract milliseconds
msec := unixts % 1000; -- Milliseconds
-- Get next value from the sequence for the "motonic clock sequence counter" value
seq := NEXTVAL('uuidv7_seq');
-- Generate a random 62-bit number
rand := (RANDOM() * 4611686018427387903)::bigint; -- 62-bit random number
-- Construct the UUID
uuid_hex := LPAD(TO_HEX(((unixts << 28) + (msec << 16) + (7 << 12) + seq)), 16, '0') ||
LPAD(TO_HEX((2 << 62) + rand), 16, '0');
-- Return the UUID
RETURN uuid_hex::uuid;
END;
$$ LANGUAGE plpgsql VOLATILE;
SELECT generate_uuidv7();It is a great read, but I can't shake the feeling that it's about a bunch of sailors that, instead of going around a huge storm, decided to go through it knowing fully well that it could end in tragedy.
Is the small upgrades out of the question in this case? As in "each small one costs us as much downtime as a big one, so we put it off for as long as we could" (they hint at that in the intro, but I might be reading too much into it).
I've relied on this as the minor upgrade method since it was available and it has worked as advertised, with no perceivable issues. This may be traffic and operation dependent obviously but worth having a look at.
Worth saying we do the minor upgrades incrementally, intra-day and a few weeks to a month after they are available, as a matter of routine, with a well documented process. Overhead is minimal to practically zero.
But anyway, good job, Postgres is quite a DBA unfriendly system (although better than it used to be still pretty bad)
I'd look into it more next time if it weren't for the fact that AWS now supports Blue/Green upgrades on Aurora for our version of Postgres. But, it's an interesting approach for sure.
If anyone here knows how to get LSN numbers after an upgrade/cluster replacement. I would love to hear about it since its always painful to get Debezium reconnected when a cluster dies.
Also, it's super tedious work, and mistakes could happen during any step. Lastly, this updage is deeply coupled with application logic. Already feel a pain.
Why don't you just use aurora, then it's 0-downtime going forward?
1. There was zero downtime - no dropped requests, no 5xx errors. There _was_ a latency spike that was carefully tuned to be within timeout limits for our customers, but we dropped zero requests from the cut over.
2. Yes, it's very tedious, and in its own way painful. We also did a MongoDB upgrade recently and, while we still took the time to verify our workloads on the more recent versions, because Mongo is an AP system, it's trivial to failover to the new version and move on.
That said, the application-level logic changes were not particularly complicated. The script to orchestrate the cutover was application-specific, and I think for migrations like this you have to do the work to get it done right.
I'd also add that the tedium of doing it right, while ideally avoidable, is precisely why customers pay us to do handle this complexity on their behalf. Sometimes you've just got to do the work. They want a service that's up all the time. While no one can guarantee that, we strive for it within reason, and even then going to "unreasonable" lengths to have a better customer experience is exactly what makes many products unreasonably good.
Stretching the work out and taking each step carefully did avoid critical mistakes. We had a few missteps along the way, and we were able to rollback without critically affecting the service. Doing an in-place upgrade, trying to minimize the time spent on this problem, would have been far more risky than spreading that risk out over the whole process we took. Of course, each team needs to figure out what's going to work for their situation & constraints.
3. We do use Aurora, but our instance was old enough to not be supported for zero-downtime patch upgrades (ZDP) which does not handle major version upgrades. They also recently released blue/green deployments for Aurora Postgres clusters, which may be a way to do what we did without having to resort to as many changes.
(The BEAM is the virtual machine for the Erlang ecosystem, analagous to the JVM for Java. Knock runs on Elixir, which is built on Erlang & the BEAM).
The strategies in the post should work for any size database. The limit becomes more a matter of individual table sizes, since we propose using an incremental approach to synchronizing one table at a time.
https://devcenter.heroku.com/articles/heroku-postgres-follow...
Did the people making these decisions never take Computer Science classes? Even a student taking a data structures module would realize this is a bad idea. There's actually more like two dozen different reasons it's a bad idea.
Using a datastore with a black box query planner that explicitly doesn't allow you to force particular indices (using hints or similar) is a more subtle mistake but will inevitably bite you eventually. Likewise a datastore that uses black-box MVCC and doesn't let you separate e.g. writing data from updating indices.