Zero downtime Postgres upgrades (opens in new tab)

(knock.app)

389 pointsbrentjanderson2y ago115 comments

115 comments

84 comments · 19 top-level

vasco2y ago· 20 in thread

> No amount of downtime - scheduled or otherwise - is acceptable for a service like Knock

doubt.jpeg

If you have a complex system, you have incidents, you have downtime. A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses. You're not a hospital and you're not the power station. So much fake work gets done because people think their services are more important than they are. The engineering time you invested into this, invested into the product, or in making the rest of your dev team faster, would've likely made your users much happier. Specially if you can queue your notifications up and catch up after the downtime window.

If you have enterprise contracts with SLAs defining paybacks for 15min downtime windows, then I guess you could justify it, but most people don't. And like I mentioned, you likely already have a handful of incidents of the same or higher duration in practice anyway.

This is specially relevant with database migrations where the difference in work to create a migration of "little downtime" to "zero downtime" is usually significant. In this case though, seeing as this was a one time thing (newer versions of PostgreSQL on RDS allow it out of the box) it is specially hard to justify in my opinion, as opposed to if this was going to be reused across many versions or many databases powering the service.

brentjandersonOP2y ago

OP here: It’s true that all services have downtime for one reason or another. We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data. Having a replica on PG 15 that was up to date with production data was invaluable for verifying our workloads worked as expected. Using a live replica makes it possible to trial run in production with minimal impact.

A key learning for me from this migration was how nice it can be to track and mitigate all of the risks you can think of for a project like this. The risk of an in-place upgrade in the end seemed higher than the risks associated with the route we chose, outage windows notwithstanding.

As a bonus, if we need this approach in the future, this blog post should give us a head start, saving us many weeks of work. We hope it helps other teams in similar situations do the same.

vasco2y ago

> We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data

1. You snapshot your RDS database (or use one of the existing ones I hope you have)

2. You restore that snapshot into a database running in parallel without live traffic.

3. You run the test upgrade there and check how long it takes.

4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.

I agree it's a good project to exercise some "migration" muscle, it just doesn't seem like the payoff is there when, like I mentioned above, AWS supports this out of the box from now on since you upgraded to a version compatible with their zero downtime native approach.

I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.

By the way, I realize how I come across, and let me tell you I say this having myself done projects like this where looking back I think we did them more because they were cool than because they made sense. Live and learn.

1 more reply

jmhmd2y ago

It’s funny to me as a physician to see “you’re not a hospital” as an example of a system that cannot tolerate downtime. Epic, probably the biggest EHR provider in the US, has planned downtime for upgrades at least monthly, for 30-60 min each.

swamp_donkey2y ago

I designed control panel modifications and programmed an upgrade to a hospital diesel generation system so they could transfer from diesel back to utility without an outage, and have planned transfer of load to diesel without turning the lights out.

We had three windows at 1 am where any new critical patients would be diverted to a different hospital. The first we used for major maintenance to the breakers in the switchgear, the second we used for modifications to the bus work, and the last outage was to test the operation of the new control system.

They do a transfer to diesel every month and the whole hospital is aware of it in case it results in a blackout.

hedora2y ago

So, the ER just shuts down for that hour?

Doesn’t epic cover everything from patient admission to medical imaging?

1 more reply

quickthrower22y ago

Fine as long as there is a workaround or the impact has been assessed.

aeyes2y ago

Except that there is no way to upgrade a Postgres instance on RDS with a planned 15 minute downtime. You can't control when the reboots happen, you start the process and the cutover might kick in an hour later, two hours later, three hours later - you don't know when the reboots are going to happen and you can't control it.

If you have replicas they'll upgrade in parallel and will reboot at random times for even more fun.

So unless you can afford random unavailability in a timeframe which can last several hours (depending on DB size) the logical replication approach is the only way to do upgrades on RDS.

The bigger the instance, the harder the problem.

hinkley2y ago

The real problem with downtime is when all systems are down at the same time.

If Jira is down fifteen minutes a day that rarely affects me. I have other tasks in my work queue that I can substitute. Worst case with multiple outages there’s always documentation I promised someone. But when the entire Atlassian suite goes tits up at the same time, it gets harder for me to keep a buffer of work going. Getting every app in your enterprise using the same storage array is a good way to go from 5% productivity loss to 95%.

threeseed2y ago

> A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses

Except that there will be competitors who don't have a downtime every month.

And who are thus placing my needs ahead of their own.

Because your outage is my outage as well.

toomuchtodo2y ago

Unreasonable customers are best sent to competitors. Let them be their problem. All revenue is not equal.

hn_throwaway_992y ago

> Except that there will be competitors who don't have a downtime every month.

Who said anything about downtime every month? Most companies I know do major DB version upgrades once every 2 years max, often less frequently.

eru2y ago

It depends on what you are comparing. It's all about opportunity costs.

A service with some short and pre-announced downtimes is better than one that fails randomly every once in a while. It's also better than one that runs extremely old versions of their software, with old bugs and vulnerabilities.

You are right that when you 'sell' the downtime to customers you have to tell them what they are getting in return.

PetahNZ2y ago

15 minutes to migrate a large DB? It takes days just to run an alter column on our DB.

globular-toast2y ago

Someone once said to me: if you can't handle planned downtime, how are you going to handle unplanned downtime?

opportune2y ago

> A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses

What? As a customer, this would piss me off to no end and honestly be a dealbreaker for something like payments or general hosting.

It's pushing dysfunction onto your customers, and if your customers are technically experienced, they'd know it's a completely avoidable problem.

yjftsjthsd-h2y ago

> and if your customers are technically experienced, they'd know it's a completely avoidable problem.

If they're technically experienced, they know every 9 costs exponentially more money, and probably agree that it's a good tradeoff.

FpUser2y ago

Frankly I do not recall a single service without downtime, this includes banks I use. Yes I'd be mightily upset if said downtimes had lasted for days. 15 min - I do not give a flying hoot as long as it is not too often.

1 more reply

quickthrower22y ago

If Google Docs were down for 15 minutes while you were trying to get say a CV together or refer to some notes it would be pretty frustrating. SaaS is replacing the desktop so the expectation is similar, I can access my data whenever I want. And 2am might be OK except many SaaS have global customers.

eru2y ago

That's why you announce your planned downtime long in advance, and put plenty of notice where customers can see it, even if they ignore emails etc.

1 more reply

JackSlateur2y ago

Google has a lot of downtime Never got 503 on google.com ? Or docs/meet down ?

natbennett2y ago· 9 in thread

The approach here is interesting and well-documented! However, this line gives me pause—

> Modern customers expect 100% availability.

This is not my preference as a customer, nor has it been my experience as a vendor. For many workloads consistency is much more important than availability. I’m often relieved when I see a vendor announce a downtime window because it suggests they’re being sensible with my data.

brentjandersonOP2y ago

OP Here - that's great feedback! Our hope is to build confidence in both the reliability of our product _and_ the consistency of the workloads. Of course, presenting the illusion of consistency while being flaky is far worse than managing customer expectations and taking intentional downtime to, in the long run, have better uptime.

Indeed, having periodic maintenance windows expected up-front probably leads to more robust architectures overall: customers building in the failsafes they need to tolerate downtime leads to more resilience. Teams that can trust their customers in that way can, in turn, take the time they need to make the investments they need to build a better product.

Perhaps this will be the blog post we write after our next major version upgrade: expectation setting around downtime _is_ the way to very high uptime.

eru2y ago

Google famously turn off a critical internal service for a minute our so, because they had promised 99.999% (or something like that) of uptime, but hadn't actually gone down in a few years.

In order to make sure that (internal) consumers of that service could handle the downtime, they introduced some artificially.

2 more replies

natbennett2y ago

Yeah I’d be a lot more confident about this if you talked some about consistency vs. availability and the details of your workload that made you want to choose this trade off.

I have potentially a weird experience path here — worked with Galera a bunch early on because when we asked customers if they wanted HA they said “yes absolutely” so we sunk a ton of time into absolutely never ever going down.

When we finally presented the trade off space (basically that 10 minute downtime windows occasionally could basically guarantee that we wouldn’t have data loss) we ended up building a very different product.

gfody2y ago

depends who the customer is, I'm a customer of AWS and I expect 100% availability, mostly because my customers are everywhere in the world and there's no available window for downtime

guiriduro2y ago

If you have this 100% availability expectation you're going to have to face the reality that DBMS versions fall out of support, you will have to upgrade or AWS will force-upgrade you their way, the AWS-provided default mechanism has significant DB-size dependent downtime (in order to maintain consistency, and you really don't want to lose that), and that the only alternative is to go through the pain of sifting through your database estate and logically replicating table by table with verification as shown in this article, with care especially for large tables and reindexing, and you can't avoid that if you have the (IMO mostly unreasonable) expectation of 100% availability. Change the wheel mid-journey or take a pitstop.

1 more reply

macspoofing2y ago

>I'm a customer of AWS and I expect 100% availability,

AWS neither provides nor promises 100% availability. AWS will have SLAs on various services with the penalty only being a discount on your bill.

It's _your_ job to make your service resilient to a point where you are comfortable with your mitigations.

nightfly2y ago

But you don't expect 100% availability from every server for every service in every region do you?

coldtea2y ago

>I'm a customer of AWS and I expect 100% availability

Well, you aren't gonna get it, it's a myth, like "5 nines" and such are, based on that businesses can foresee the unforeseen and plan ahead.

Whether a service is distributed or not, at some point some issue will come up and availability is going to stop for a while.

natbennett2y ago

If you expect 100% and you’re making business decisions based on that expectation, I encourage you to increase your sophistication about reliability before it costs you a great deal of money.

aeyes2y ago· 7 in thread

There is a better way than fully copying table content one by one which is very I/O heavy and will not work if you have very large tables.

You can create a replication slot, take a snapshot, restore the snapshot to a new instance, advance the LSN and replicate from there - boom, you have a logical replica with all the data. Then you upgrade your logical replica.

This article from Instacart shows how to do it: https://archive.ph/K5ZuJ

If I remember correctly the article has some small errors but I haven't done this in a while and I don't exactly remember what was wrong. But in general the process works, I have done it like this several times upgrading TB-sized instances.

samokhvalov2y ago

> You can create a replication slot, take a snapshot, restore the snapshot to a new instance, advance the LSN and replicate from there - boom, you have a logical replica with all the data. Then you upgrade your logical replica.

This is a great recipe but needs small but important correction. We need to be careful with plugging pg_upgrade in this physical-to-logical replica conversion process: if we first start logical replication and then running pg_upgrade, there are risks of corruption – see discussion in pgsql-hackers https://www.postgresql.org/message-id/flat/20230217075433.u5.... To solve this, we need first to create logical slot, advance the new cluster to slot's LSN position (not starting logical replication yet), then run pg_upgrade, and only then logical replication – when the new cluster is already running on new PG version.

This is exactly how we (Postgres.ai) recently have helped GitLab upgrade multiple multi-TiB clusters under heavy load without any downtime at all (also involving PgBouncer's PAUSE/RESUME) - there will be a talk by Alexander Sosna presented later this week https://www.postgresql.eu/events/pgconfeu2023/schedule/sessi... and there are some plans to publish details about it.

aeyes2y ago

Thank you for linking this insightful discussion.

I am not sure why I never ran into this problem, unfortunately I don't have access to my notes anymore because I no longer work on this.

This approach has solved so many problems for me. I can do full vacuum, I can change integer columns to bigint, I can do major version upgrades, I can even move instances across AWS regions all with minimal downtime.

It's really great to see that people continue to tinker with it and that there are active discussions on the mailing list to keep improving logical replication. It's come a long way since the first implementation. Thanks for your contribution!

brentjandersonOP2y ago

OP here: we looked at this and were not confident in manually advancing the LSN as proposed, and detecting any inconsistency if we missed any replication as a result. Table by table seemed more reliable, despite being more painstaking.

aeyes2y ago

As long as you have the correct LSN there is no way for this to go wrong.

If you resume replication with an incorrect LSN replication will break immediately. I have spent way too much time trying to do this on my own before the blog post was written and I have seen it fail over and over again.

To give you more confidence, try with the LSN from the "redo starts at" log message. It looks close but it will always fail.

1 more reply

vittore2y ago

We have updated article https://tech.instacart.com/zero-downtime-postgresql-cutovers...

imbradn2y ago

That article covers the basis of how we do upgrades at Instacart, but is quite old. This is a more modern look at how we accomplish the process. We have used this process to upgrade a lot of very large and very active databases successfully.

https://www.instacart.com/company/how-its-made/zero-downtime...

mmontagna92y ago

You caught our off by one bug :)

throwawaaarrgh2y ago· 7 in thread

> Postgres sits at the heart of everything our systems do.

Did the people making these decisions never take Computer Science classes? Even a student taking a data structures module would realize this is a bad idea. There's actually more like two dozen different reasons it's a bad idea.

thestepafter2y ago

Would be interested to hear more about your opinion on why using a database is a mistake.

lmm2y ago

Using a datastore for which true master-master HA is at best a bolted-on afterthought when you explicitly want a zero-downtime system is a mistake in a pretty obvious way.

Using a datastore with a black box query planner that explicitly doesn't allow you to force particular indices (using hints or similar) is a more subtle mistake but will inevitably bite you eventually. Likewise a datastore that uses black-box MVCC and doesn't let you separate e.g. writing data from updating indices.

throwawaaarrgh2y ago

I meant using a database for more than relational read-heavy data queries. I would need to write a small book. Tl;dr the data model, communication model, locking model, and operational model all have specific limitations designed around a specific use case and straying from that case invites problems that need workarounds that create more problems.

1 more reply

pphysch2y ago

Is this a bit? The median CS undergrad has zero experience with large & successful software systems in the real world. Of course they wouldn't understand!

yjftsjthsd-h2y ago

Yeah - in fact, this is probably a great example of stuff you don't learn in class that gets really clear in the real world:) Operational concerns trump a lot of other things, and shoving everything you can into 1 database technology is so much better to manage that it covers a lot of suboptimal fit.

peter_l_downs2y ago

What do you mean? I don’t understand, how is using a database an architectural mistake?

throwawaaarrgh2y ago

It's a mistake to use one specific computer science concept (RDBMS) to solve 50 different problems. They mentioned logging and scheduling, two things RDBMS are not designed for and have specific limitations around. From just a general architecture perspective it's literally a single point of failure and limitation for every single aspect of the system. And it's vendor specific, it's not like you can just plug plsql code into any other RDBMS and expect it to work. It's so obviously a bad idea it's hard to comprehend taking it seriously

3 more replies

tehlike2y ago· 4 in thread

The sequence thing is definitely interesting, I stopped using them a while ago, using mostly sequential uuid (or uuid v7), or use something like HiLo https://en.wikipedia.org/wiki/Hi/Lo_algorithm

pmarreck2y ago

This PL/pgSQL function might help others looking to keep uuidv7 generation responsibility within the database until it's natively supported:

  -- IETF Draft Spec: https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-01.html

  CREATE SEQUENCE uuidv7_seq MAXVALUE 4095; -- A 12-bit sequence

  CREATE OR REPLACE FUNCTION generate_uuidv7()
  RETURNS uuid AS $$
  DECLARE
    unixts bigint;
    msec bigint;
    seq bigint;
    rand bigint;
    uuid_hex varchar;
  BEGIN
    -- Get current UNIX epoch in milliseconds
    unixts := (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::bigint;

    -- Extract milliseconds
    msec := unixts % 1000; -- Milliseconds

    -- Get next value from the sequence for the "motonic clock sequence counter" value
    seq := NEXTVAL('uuidv7_seq');

    -- Generate a random 62-bit number
    rand := (RANDOM() * 4611686018427387903)::bigint; -- 62-bit random number

    -- Construct the UUID
    uuid_hex := LPAD(TO_HEX(((unixts << 28) + (msec << 16) + (7 << 12) + seq)), 16, '0') ||
                LPAD(TO_HEX((2 << 62) + rand), 16, '0');

    -- Return the UUID
    RETURN uuid_hex::uuid;
  END;
  $$ LANGUAGE plpgsql VOLATILE;

  SELECT generate_uuidv7();

tehlike2y ago

Keeping id generation inside the app is useful, you can batch multiple statements (e.g. insert product, insert product details in a single query, or other sorts of dependencies). You don't have to wait for first insertion to finish to get the id of the record, for example.

1 more reply

brentjandersonOP2y ago

OP here - we avoid sequences in all but one part of our application due to a dependency. We use [KSUIDs][1] and UUID v4 in various places. This one "gotcha" applies to any sequence, so it's worth calling out as general advice when running a migration like this.

[1]: https://segment.com/blog/a-brief-history-of-the-uuid/

tehlike2y ago

Definitely great call out. Thanks for writing this.

ohduran2y ago· 4 in thread

Not to downplay the absolute behemoth of a task they manage to pull out successfully...but why not upgrading as new versions came along, with less fanfare?

It is a great read, but I can't shake the feeling that it's about a bunch of sailors that, instead of going around a huge storm, decided to go through it knowing fully well that it could end in tragedy.

Is the small upgrades out of the question in this case? As in "each small one costs us as much downtime as a big one, so we put it off for as long as we could" (they hint at that in the intro, but I might be reading too much into it).

brentjandersonOP2y ago

OP here - we would have used the same approach for the minor upgrades. This isn’t a case of “we procrastinated ourselves into a corner” and more a matter of “if it isn’t broke, don’t fix it” recognizing we would need to make the jump eventually.

NomDePlum2y ago

Just for your information, minor upgrades on Aurora Postgres does now claim increased resilience across minor upgrades, there are some caveats despite the Zero Downtime naming: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

I've relied on this as the minor upgrade method since it was available and it has worked as advertised, with no perceivable issues. This may be traffic and operation dependent obviously but worth having a look at.

Worth saying we do the minor upgrades incrementally, intra-day and a few weeks to a month after they are available, as a matter of routine, with a well documented process. Overhead is minimal to practically zero.

whalesalad2y ago

Upgrading N versions is just as much as a threat to availability regardless if N is 1 or 3.

CubsFan10602y ago

Each one incurs some downtime. If their real answer is less than 60 seconds, then they’d have incurred that multiple times on the road to 15.

CubsFan10602y ago· 3 in thread

AWS supports blue green deployments now. https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-rd...

paulryanrogers2y ago

Having just tried a few weeks back ... don't rely on it for PostgreSQL yet. After a few back and forths my experiment got stalled for hours before AWS UI admitted the switch over didn't take. Thankfully it failed safe. But I have no faith in being able to time the actual switch over for any GB+ dataset.

brentjandersonOP2y ago

You’re right! OP here. We were on 11.9 which is not supported by Blue/green deployments for Aurora. Maybe next time.

dmattia2y ago

Would upgrading to 11.21 and then using blue/green have been easier? I'm asking as someone with RDS postgres-aurora running 11.9 right now, so I'm genuinely curious on your thoughts

1 more reply

TechIsCool2y ago· 2 in thread

With the mention of AWS RDS and Aurora, I am curious if you had thought about creating a replication slot, adding a replica to the cluster and then promoting the replica to its own cluster. Then connecting the new cluster to the original with the replication slot based on the position of the snapshot. This would save the large original replication time and also keep the sequences consistent without manual intervention.

brentjandersonOP2y ago

That's a very interesting approach, I'm not sure if the sequences would remain consistent under that model or not. AWS RDS Aurora also requires you to drop replication slots when performing version upgrades, so we would unfortunately have lost the LSNs for replication slots that we use to synchronize with other services (e.g. data warehouse).

I'd look into it more next time if it weren't for the fact that AWS now supports Blue/Green upgrades on Aurora for our version of Postgres. But, it's an interesting approach for sure.

TechIsCool2y ago

Yeah its been nice to leverage this while working on some of our larger multi TB non-partitioned clusters. We have seen snapshots restore in under 10 minutes across AWS Accounts (same region) as long as you already have one snapshot shipped with the same KMS keys. We have been upgrading DBs to lift out of RDS into Aurora Serverless.

If anyone here knows how to get LSN numbers after an upgrade/cluster replacement. I would love to hear about it since its always painful to get Debezium reconnected when a cluster dies.

2 more replies

whalesalad2y ago· 2 in thread

Another epic win for the BEAM!

brentjandersonOP2y ago

OP here - we have more coming about the role that the BEAM VM played in this migration too.

(The BEAM is the virtual machine for the Erlang ecosystem, analagous to the JVM for Java. Knock runs on Elixir, which is built on Erlang & the BEAM).

whalesalad2y ago

I’m stoked to hear more. It’s phenomenal tech.

andrewmcwatters2y ago· 2 in thread

Uh... How big was their database? Did I miss it? I don't think they said.

brentjandersonOP2y ago

OP here. We don’t specify, but it’s big enough that it’s not reasonable to do a dump and restore style upgrade.

The strategies in the post should work for any size database. The limit becomes more a matter of individual table sizes, since we propose using an incremental approach to synchronizing one table at a time.

andrewmcwatters2y ago

I'll take your word for it. Anyway, thanks for sharing the article.

october81402y ago· 2 in thread

Heroku just does this. At my old job we would scale the database using replication multiple times a week depending on expected traffic.

https://devcenter.heroku.com/articles/heroku-postgres-follow...

why-el2y ago

Not quite I don't think. For a busy database, The Heroku followers will not catch up to your upgraded database as quickly, so during an upgrade using Heroku's physical replication (as opposed to logical), there will be a time period where your freshly upgraded primary is on its own as the followers are being issued and brought up to date.

thejosh2y ago

Except Heroku has an issue of your backup etc is too large (despite paying for the correct size), it would cause the replica to go down and spin up a new one, and this process could take hours.

shayonj2y ago· 1 in thread

This is great! I wrote a tool that automates most of the things you came across. If you find it useful or would like to extend it with your feedback/ideas, I'd love to have them! Thanks for sharing

https://github.com/shayonj/pg_easy_replicate

brentjandersonOP2y ago

Neat tool! Some of our findings for large tables could be interesting for a tool like this, making it easier to apply the right strategy to the right tables. Having something like this with those strategies could be indispensable to teams running a migration like this in the future.

dboreham2y ago· 1 in thread

Surprised you can't initialize a replica from a backup. That would have saved all the farting around streaming the old stable DB content to the new server.

Also, this isn't "zero downtime" -- there's a few seconds down time while service cuts over to the new server.

The article omits details on how consistency was preserved -- you can't just point your application at both servers for some period of time, for example. Possibly you can serve reads from both (but not really), but writes absolutely have to be directed to only one server. Article doesn't mention this.

Lastly, there was no mention of a rollback option -- in my experience performing this kind of one-off fork lift on a large amount of data, things sometimes go off the rails late at night. Therefore you always need a plan for how you can revert to the previous step, go to bed with the assurance that service will still be up in the morning. Specifically that is hard if you've already sent write transactions to the new server but for some reason need to cut back to the old one. Data is now inconsistent.

brentjandersonOP2y ago

OP here:

> Can't initialize a replica from a backup

You could, but you're not going to get any of the constant writes happening during the backup. You will have missing writes on the restored system without some kind of replication involved unless you move up to the application layer.

For example, you could update your app to apply dual writes. I'm aware of teams that have replatformed entire applications on to completely different DBs that way (e.g. going from an RDBMS to something completely different like Apache Cassandra).

For our situation, dual-writes seemed more risky than just doing the dirty work of setting up streaming replication using out of the box Postgres features. But, for some teams it could be a better move.

> This isn't "zero downtime"

and

> The article omits details on how consistency was preserved

In the post we go into detail about how we preserved consistency & avoided API downtime, but the gist is that the app was connected to both databases, but not using the new one by default. We then sent a signal to all instances of our app to cut over using Launch Darkly, which maintains a low-latency connection to all instances of our app.

For the first second after that signal, the servers queued up database requests to allow for replication to catch up. This caused a brief spike in latency that was within intentionally calculated tolerances. After that pause, requests flowed as usual but against the new database and the cut over was complete.

We included a force-disconnect against any pending traffic against the old database as well, with a 500 ms timeout. This timeout was much higher than our p99 query times, so no running queries were force terminated. This ensured that the old database's traffic had ceased, and gave replication plenty of time to catch up.

> No mention of a rollback option

Although it didn't make the cut for the blog post, we considered setting up a fallback database on PG 11.9 and replicating the 15.3 database into that third database. If we needed to abort, we could roll forward to this database on the same version.

We opted to not do this after practicing our upgrade procedure multiple times in staging to ensure we could do this successfully. Having practiced the procedure multiple times gave us confidence when it came to performing the cut over. We also used canary deployments in production to verify certain read-only workloads against the database, treating the 15.3 instance as a read replica.

To your point about it being late at night, we intentionally did this in the early evening on a weekend to avoid "fat finger" type mistakes. The cut over was carefully scripted and rehearsed to reduce the risk of human error as well.

In the event that we needed to rollback, the system was also prepared to flip back to the old database in the event of a catastrophic failure. This would have lead to some data loss against the new database, and we were prepared to reconcile any key pieces of the system in that scenario. To minimize the risk of data loss, we paused certain background tasks in the system briefly during the cutover to reduce the number of writes applied against the system. These details didn't make the blog post as we were going for more of the specifics to Postgres and less to Knock-specific considerations. Teams trying to apply this playbook will always need to build their own inventory of risks and seek to mitigate them in a context-dependent way.

Edit: More detail about rollback procedure

wihoho1232y ago· 1 in thread

The title is very misleading given it still has downtime for at least several seconds.

Also, it's super tedious work, and mistakes could happen during any step. Lastly, this updage is deeply coupled with application logic. Already feel a pain.

Why don't you just use aurora, then it's 0-downtime going forward?

brentjandersonOP2y ago

OP Here -

1. There was zero downtime - no dropped requests, no 5xx errors. There _was_ a latency spike that was carefully tuned to be within timeout limits for our customers, but we dropped zero requests from the cut over.

2. Yes, it's very tedious, and in its own way painful. We also did a MongoDB upgrade recently and, while we still took the time to verify our workloads on the more recent versions, because Mongo is an AP system, it's trivial to failover to the new version and move on.

That said, the application-level logic changes were not particularly complicated. The script to orchestrate the cutover was application-specific, and I think for migrations like this you have to do the work to get it done right.

I'd also add that the tedium of doing it right, while ideally avoidable, is precisely why customers pay us to do handle this complexity on their behalf. Sometimes you've just got to do the work. They want a service that's up all the time. While no one can guarantee that, we strive for it within reason, and even then going to "unreasonable" lengths to have a better customer experience is exactly what makes many products unreasonably good.

Stretching the work out and taking each step carefully did avoid critical mistakes. We had a few missteps along the way, and we were able to rollback without critically affecting the service. Doing an in-place upgrade, trying to minimize the time spent on this problem, would have been far more risky than spreading that risk out over the whole process we took. Of course, each team needs to figure out what's going to work for their situation & constraints.

3. We do use Aurora, but our instance was old enough to not be supported for zero-downtime patch upgrades (ZDP) which does not handle major version upgrades. They also recently released blue/green deployments for Aurora Postgres clusters, which may be a way to do what we did without having to resort to as many changes.

T-Winsnes2y ago

We're going through this right now with hava.io

AWS RDS postgress 11.13 > 15.5

We ended up going with a relatively straight forward approach of unidirectional replication using pglogical. We have some experience doing the same migration from Google Cloud sql to AWS rds with zero downtime as well, which made us pretty confident that this will work and not impact customers in any visible way.

pglogical makes it relatively straight forwards to do this kind of migration. It's not always fast, but if you're happy with waiting for a few days while it gradually replicates the full database across to the new instances.

For us it gave us a bit more freedom in changing the storage type and size which was more difficult to do with some of the alternative approaches. We had oversized our storage to get more iops, so we wanted to change storage type as well as reducing the size of the storage, which meant we couldn't do the simple restore from a snapshot.

AtlasBarfed2y ago

Oh you mean like aws promised us during the "sales engineering" but failed to deliver when a major version upgrade was forced upon us?

oopsthrowpass2y ago

There is no problem in distributed systems that can't be solved with a well placed sleep(1000) :D

But anyway, good job, Postgres is quite a DBA unfriendly system (although better than it used to be still pretty bad)

fosterfriends2y ago

Great write up y’all! Writing this detailed of a post isn’t easy, and it works to build confidence in your technical prowess. Keep up the great work :)

hurtbird2y ago

Anyone interested in having an all-in-one tool that does zero-downtime Postgres updates where the only input is database credentials? Nothing to setup manually.

j / k navigate · click thread line to collapse

115 comments

84 comments · 19 top-level

vasco2y ago· 20 in thread

> No amount of downtime - scheduled or otherwise - is acceptable for a service like Knock

doubt.jpeg

brentjandersonOP2y ago

As a bonus, if we need this approach in the future, this blog post should give us a head start, saving us many weeks of work. We hope it helps other teams in similar situations do the same.

vasco2y ago

> We discussed taking an outage window, but one thing that we kept coming back to was how we might trial run the upgrade with production data

1. You snapshot your RDS database (or use one of the existing ones I hope you have)

2. You restore that snapshot into a database running in parallel without live traffic.

3. You run the test upgrade there and check how long it takes.

4. You destroy the test database and announce a maintenance window for the same duration the test took + buffer.

I think the only way this makes sense is if you do it for the blog post and use that to hire and for marketing, signaling your engineering practices and that you care about reliability.

1 more reply

jmhmd2y ago

swamp_donkey2y ago

They do a transfer to diesel every month and the whole hospital is aware of it in case it results in a blackout.

hedora2y ago

So, the ER just shuts down for that hour?

Doesn’t epic cover everything from patient admission to medical imaging?

1 more reply

quickthrower22y ago

Fine as long as there is a workaround or the impact has been assessed.

aeyes2y ago

If you have replicas they'll upgrade in parallel and will reboot at random times for even more fun.

So unless you can afford random unavailability in a timeframe which can last several hours (depending on DB size) the logical replication approach is the only way to do upgrades on RDS.

The bigger the instance, the harder the problem.

hinkley2y ago

The real problem with downtime is when all systems are down at the same time.

threeseed2y ago

> A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses

Except that there will be competitors who don't have a downtime every month.

And who are thus placing my needs ahead of their own.

Because your outage is my outage as well.

toomuchtodo2y ago

Unreasonable customers are best sent to competitors. Let them be their problem. All revenue is not equal.

hn_throwaway_992y ago

> Except that there will be competitors who don't have a downtime every month.

Who said anything about downtime every month? Most companies I know do major DB version upgrades once every 2 years max, often less frequently.

eru2y ago

It depends on what you are comparing. It's all about opportunity costs.

You are right that when you 'sell' the downtime to customers you have to tell them what they are getting in return.

PetahNZ2y ago

15 minutes to migrate a large DB? It takes days just to run an alter column on our DB.

globular-toast2y ago

Someone once said to me: if you can't handle planned downtime, how are you going to handle unplanned downtime?

opportune2y ago

> A 15min downtime window announced in advance is fine for approximately 100% of SaaS businesses

What? As a customer, this would piss me off to no end and honestly be a dealbreaker for something like payments or general hosting.

It's pushing dysfunction onto your customers, and if your customers are technically experienced, they'd know it's a completely avoidable problem.

yjftsjthsd-h2y ago

> and if your customers are technically experienced, they'd know it's a completely avoidable problem.

If they're technically experienced, they know every 9 costs exponentially more money, and probably agree that it's a good tradeoff.

FpUser2y ago

1 more reply

quickthrower22y ago

eru2y ago

That's why you announce your planned downtime long in advance, and put plenty of notice where customers can see it, even if they ignore emails etc.

1 more reply

JackSlateur2y ago

Google has a lot of downtime Never got 503 on google.com ? Or docs/meet down ?

natbennett2y ago· 9 in thread

The approach here is interesting and well-documented! However, this line gives me pause—

> Modern customers expect 100% availability.

brentjandersonOP2y ago

Perhaps this will be the blog post we write after our next major version upgrade: expectation setting around downtime _is_ the way to very high uptime.

eru2y ago

Google famously turn off a critical internal service for a minute our so, because they had promised 99.999% (or something like that) of uptime, but hadn't actually gone down in a few years.

In order to make sure that (internal) consumers of that service could handle the downtime, they introduced some artificially.

2 more replies

natbennett2y ago

Yeah I’d be a lot more confident about this if you talked some about consistency vs. availability and the details of your workload that made you want to choose this trade off.

gfody2y ago

depends who the customer is, I'm a customer of AWS and I expect 100% availability, mostly because my customers are everywhere in the world and there's no available window for downtime

guiriduro2y ago

1 more reply

macspoofing2y ago

>I'm a customer of AWS and I expect 100% availability,

AWS neither provides nor promises 100% availability. AWS will have SLAs on various services with the penalty only being a discount on your bill.

It's _your_ job to make your service resilient to a point where you are comfortable with your mitigations.

nightfly2y ago

But you don't expect 100% availability from every server for every service in every region do you?

coldtea2y ago

>I'm a customer of AWS and I expect 100% availability

Well, you aren't gonna get it, it's a myth, like "5 nines" and such are, based on that businesses can foresee the unforeseen and plan ahead.

Whether a service is distributed or not, at some point some issue will come up and availability is going to stop for a while.

natbennett2y ago

If you expect 100% and you’re making business decisions based on that expectation, I encourage you to increase your sophistication about reliability before it costs you a great deal of money.

aeyes2y ago· 7 in thread

There is a better way than fully copying table content one by one which is very I/O heavy and will not work if you have very large tables.

This article from Instacart shows how to do it: https://archive.ph/K5ZuJ

samokhvalov2y ago

aeyes2y ago

Thank you for linking this insightful discussion.

I am not sure why I never ran into this problem, unfortunately I don't have access to my notes anymore because I no longer work on this.

brentjandersonOP2y ago

aeyes2y ago

As long as you have the correct LSN there is no way for this to go wrong.

To give you more confidence, try with the LSN from the "redo starts at" log message. It looks close but it will always fail.

1 more reply

vittore2y ago

We have updated article https://tech.instacart.com/zero-downtime-postgresql-cutovers...

imbradn2y ago

https://www.instacart.com/company/how-its-made/zero-downtime...

mmontagna92y ago

You caught our off by one bug :)

throwawaaarrgh2y ago· 7 in thread

> Postgres sits at the heart of everything our systems do.

thestepafter2y ago

Would be interested to hear more about your opinion on why using a database is a mistake.

lmm2y ago

Using a datastore for which true master-master HA is at best a bolted-on afterthought when you explicitly want a zero-downtime system is a mistake in a pretty obvious way.

throwawaaarrgh2y ago

1 more reply

pphysch2y ago

Is this a bit? The median CS undergrad has zero experience with large & successful software systems in the real world. Of course they wouldn't understand!

yjftsjthsd-h2y ago

peter_l_downs2y ago

What do you mean? I don’t understand, how is using a database an architectural mistake?

throwawaaarrgh2y ago

3 more replies

tehlike2y ago· 4 in thread

The sequence thing is definitely interesting, I stopped using them a while ago, using mostly sequential uuid (or uuid v7), or use something like HiLo https://en.wikipedia.org/wiki/Hi/Lo_algorithm

pmarreck2y ago

This PL/pgSQL function might help others looking to keep uuidv7 generation responsibility within the database until it's natively supported:

  -- IETF Draft Spec: https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-01.html

  CREATE SEQUENCE uuidv7_seq MAXVALUE 4095; -- A 12-bit sequence

  CREATE OR REPLACE FUNCTION generate_uuidv7()
  RETURNS uuid AS $$
  DECLARE
    unixts bigint;
    msec bigint;
    seq bigint;
    rand bigint;
    uuid_hex varchar;
  BEGIN
    -- Get current UNIX epoch in milliseconds
    unixts := (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::bigint;

    -- Extract milliseconds
    msec := unixts % 1000; -- Milliseconds

    -- Get next value from the sequence for the "motonic clock sequence counter" value
    seq := NEXTVAL('uuidv7_seq');

    -- Generate a random 62-bit number
    rand := (RANDOM() * 4611686018427387903)::bigint; -- 62-bit random number

    -- Construct the UUID
    uuid_hex := LPAD(TO_HEX(((unixts << 28) + (msec << 16) + (7 << 12) + seq)), 16, '0') ||
                LPAD(TO_HEX((2 << 62) + rand), 16, '0');

    -- Return the UUID
    RETURN uuid_hex::uuid;
  END;
  $$ LANGUAGE plpgsql VOLATILE;

  SELECT generate_uuidv7();

tehlike2y ago

1 more reply

brentjandersonOP2y ago

[1]: https://segment.com/blog/a-brief-history-of-the-uuid/

tehlike2y ago

Definitely great call out. Thanks for writing this.

ohduran2y ago· 4 in thread

Not to downplay the absolute behemoth of a task they manage to pull out successfully...but why not upgrading as new versions came along, with less fanfare?

brentjandersonOP2y ago

NomDePlum2y ago

whalesalad2y ago

Upgrading N versions is just as much as a threat to availability regardless if N is 1 or 3.

CubsFan10602y ago

Each one incurs some downtime. If their real answer is less than 60 seconds, then they’d have incurred that multiple times on the road to 15.

CubsFan10602y ago· 3 in thread

AWS supports blue green deployments now. https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-rd...

paulryanrogers2y ago

brentjandersonOP2y ago

You’re right! OP here. We were on 11.9 which is not supported by Blue/green deployments for Aurora. Maybe next time.

dmattia2y ago

Would upgrading to 11.21 and then using blue/green have been easier? I'm asking as someone with RDS postgres-aurora running 11.9 right now, so I'm genuinely curious on your thoughts

1 more reply

TechIsCool2y ago· 2 in thread

brentjandersonOP2y ago

I'd look into it more next time if it weren't for the fact that AWS now supports Blue/Green upgrades on Aurora for our version of Postgres. But, it's an interesting approach for sure.

TechIsCool2y ago

If anyone here knows how to get LSN numbers after an upgrade/cluster replacement. I would love to hear about it since its always painful to get Debezium reconnected when a cluster dies.

2 more replies

whalesalad2y ago· 2 in thread

Another epic win for the BEAM!

brentjandersonOP2y ago

OP here - we have more coming about the role that the BEAM VM played in this migration too.

(The BEAM is the virtual machine for the Erlang ecosystem, analagous to the JVM for Java. Knock runs on Elixir, which is built on Erlang & the BEAM).

whalesalad2y ago

I’m stoked to hear more. It’s phenomenal tech.

andrewmcwatters2y ago· 2 in thread

Uh... How big was their database? Did I miss it? I don't think they said.

brentjandersonOP2y ago

OP here. We don’t specify, but it’s big enough that it’s not reasonable to do a dump and restore style upgrade.

andrewmcwatters2y ago

I'll take your word for it. Anyway, thanks for sharing the article.

october81402y ago· 2 in thread

Heroku just does this. At my old job we would scale the database using replication multiple times a week depending on expected traffic.

https://devcenter.heroku.com/articles/heroku-postgres-follow...

why-el2y ago

thejosh2y ago

Except Heroku has an issue of your backup etc is too large (despite paying for the correct size), it would cause the replica to go down and spin up a new one, and this process could take hours.

shayonj2y ago· 1 in thread

This is great! I wrote a tool that automates most of the things you came across. If you find it useful or would like to extend it with your feedback/ideas, I'd love to have them! Thanks for sharing

https://github.com/shayonj/pg_easy_replicate

brentjandersonOP2y ago

dboreham2y ago· 1 in thread

Surprised you can't initialize a replica from a backup. That would have saved all the farting around streaming the old stable DB content to the new server.

Also, this isn't "zero downtime" -- there's a few seconds down time while service cuts over to the new server.

brentjandersonOP2y ago

OP here:

> Can't initialize a replica from a backup

> This isn't "zero downtime"

and

> The article omits details on how consistency was preserved

> No mention of a rollback option

Edit: More detail about rollback procedure

wihoho1232y ago· 1 in thread

The title is very misleading given it still has downtime for at least several seconds.

Also, it's super tedious work, and mistakes could happen during any step. Lastly, this updage is deeply coupled with application logic. Already feel a pain.

Why don't you just use aurora, then it's 0-downtime going forward?

brentjandersonOP2y ago

OP Here -

T-Winsnes2y ago

We're going through this right now with hava.io

AWS RDS postgress 11.13 > 15.5

AtlasBarfed2y ago

Oh you mean like aws promised us during the "sales engineering" but failed to deliver when a major version upgrade was forced upon us?

oopsthrowpass2y ago

There is no problem in distributed systems that can't be solved with a well placed sleep(1000) :D

But anyway, good job, Postgres is quite a DBA unfriendly system (although better than it used to be still pretty bad)

fosterfriends2y ago

Great write up y’all! Writing this detailed of a post isn’t easy, and it works to build confidence in your technical prowess. Keep up the great work :)

hurtbird2y ago

Anyone interested in having an all-in-one tool that does zero-downtime Postgres updates where the only input is database credentials? Nothing to setup manually.

j / k navigate · click thread line to collapse