PostgreSQL HA cluster failure: a post-mortem (opens in new tab)

(gocardless.com)

322 pointsstevedomin8y ago117 comments

117 comments

80 comments · 18 top-level

foobarbazetc8y ago· 16 in thread

I’ve been involved in various Postgres roles (developer on it, consultant for it for BigCo, using it at my startup now) for around 18 years.

I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.

The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.

That said, there was a lot of bad luck involved in this incident.

scurvy8y ago

Our PostgreSQL failover plan is very pedestrian, but it works well (even across datacenters). We run streaming replication off of the primary to a pair of replicas, with one replica in another datacenter. The primary write DB advertises a loopback IP out OSPF into a top-of-rack switch, where it's aggregated by BGP and distributed throughout our network. There's a health check script [0] running every 3 seconds that makes sure PG is happy and that it is still writeable.

If we want to failover (nothing automatic), we stop the primary (or it's already dead) and the route is withdrawn. An operator touches the recovery file on new primary, the health checker sees that, and the IP is announced back into the network. Yes, it's a "VIP", but it's one controlled by our operations team, not automation software. One nice things about this is that you can failover across datacenters (remember it's advertised into our network over BGP) without reconfiguring DNS or messing with application servers.

While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want).

I will definitely agree with you that manual operator intervention is better than automated failover.

[0] https://github.com/unixsurfer/anycast_healthchecker

bringtheaction8y ago

I'd love to read more about this.

If you have the time and motivation to write a detailed post that would be much welcome by me and many others as well I am sure.

1 more reply

scurvy8y ago

Pacemaker/Corosync is a good tool if you know exactly how it works. However, very few people know exactly how it works. This is a very old problem for that team if you check the mailing lists. If they offered a few "blessed" configurations or even an online config generator, I'm sure that people would have nicer things to say about it.

I've used it heartbeat/pacemaker/corosync with DRBD for over a decade and I'm pretty pleased with it...now. But it took a bit of trial by fire to get it right. Luckily I never lost any data and found the issues in testing. Which gets at the heart of any failover mechanism -- it's all useless unless you test it on a regular basis (just like your backups).

thijsvandien8y ago

I'd generalize this to many kinds of failover setups, DB or non-DB. Practically each and every one I've seen is in itself less reliable than the thing it's covering for.

pmuk8y ago

It amazes me that in 2017 HA for RDBs is still so brittle and complicated. I had hoped this problem would have been solved by now!

1 more reply

gaius8y ago

I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever.

That mirrors my experience - PG, DRDB, anything else, this form of clustering is a complete joke. If it's important, it's worth spending the money on a grown-solution. Veritas has its flaws sure but in 20+ years on half a dozen different OS's it's never let me down when the balloon went up.

liveoneggs8y ago

your findings are supported by Baron Schwartz as well (https://www.xaprb.com/blog/2012/09/17/is-automated-failover-... and others)

fipar8y ago

I think it's relevant that Baron's post mentions a Github incident, and Github has been doing automatic failover of MySQL masters for a while now: https://githubengineering.com/orchestrator-github/

This post is also very relevant to the topic: http://code.openark.org/blog/mysql/mysql-high-availability-t...

The post you link is still relevant to this discussion, but we can't ignore that 5 years have passed. It would be good to see 2017's Baron revisit the topic :)

morrbo8y ago

How does pgbouncer make this process easy? Just because there are less connections to go to the final DB? I've also got a random question you might be able to answer (having a hard time googling)... when using pgsql in streaming replication mode, are created functions replicated/updated to everything else as well?

(just learning about postgres and saw an opportunity to ask someone in the know) Cheers!

jeltz8y ago

One reason is that some applications do not properly support reconnecting to the database if the connection is lost. With pgbouncer that is not an issue. Another is to avoid having floating IPs or updating the DNS.

1 more reply

fake-name8y ago

I'd assume it's because you can just reconfigure to point pgbouncer to redirect everything to the secondary, rather then having to update all the applications using pgbouncer. It centralizes the configuration of which database is active.

1 more reply

snuxoll8y ago

PgBouncer makes it easy to do a controlled failover without having to deal with floating IP's on your postgresql cluster.

pjungwir8y ago

> are created functions replicated/updated to everything else as well?

Yes.

Note if those functions have an implementation compiled from C, you do need to install the .so on the standbys though.

2 more replies

merb8y ago

why not just use https://github.com/zalando/patroni

takeda8y ago

My understanding is that the problem is not really with pacemaker/corosync. Those tools also are always consistent as ZK/etcd/Consul. There is also SONITH to make sure the node that goes down can't cause damage once it is back.

The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.

When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.

It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.

1 more reply

cromantin8y ago

Patrons is great. Running dockerised Postgres with consul backend for years without a hitch. Haproxy as lb. What that? A replica need reboot. Just reboot. Primary? Just failover to replica and reboot. Undesired reboots recovers in under 10 seconds. During which just primary is not available but replicas are.

echelon8y ago· 8 in thread

I'm told that MySQL replication blows Postgres out of the water by my company's data team, but they could just be biased since that is their area of expertise. I work on server code and don't really have much familiarity with the operations of running replica chains.

Postgres seems like a better choice for personal projects since it has a lot of nifty features. I'm also wary of Oracle, but that's my own attitude talking. For a startup eventually wanting to scale, would the better choice be to use MySQL out of the gates? Am I being mislead about Postgres clusters and availability?

Serious (naive) question; not wanting to start a flame war.

sitharus8y ago

MySQL has had replication for _longer_, but I would the MySQL team has a history of releasing half-baked functionality with a long list of gotchas. PostgreSQL has only had replication in the core product for the past couple of versions, previously you had to rely on third party tools which had some interesting behaviours, or that were entirely commercial.

Since PostgreSQL 10 WAL and logical replication strategies can support just about any sort of replication you desire, except multi-master.

YorickPeterse8y ago

PostgreSQL's replication itself is perfectly fine, albeit a bit bare-bones (e.g. it's just replication plus the ability to trigger a failover). For something like cluster management and automated failovers you'll need e.g. https://repmgr.org/.

benmmurphy8y ago

I think for a while MySQL had a much better replication story but that has changed now with logical replication now in postgresql. What most concerns me about MySql is how the configuration can shoot you in the foot. At my previous company the slave replica for a database was writable and I'm not even sure how this is a valid configuration :/ and of course someone ended up running a query on the slave and corrupting its data.

kakwa_8y ago

I found configuring an HA setup easier to do with MySQL. The ability to configure a master <-> master setup simply is really helpful. Of course, having writes on both sides is generally a bad idea, but it's quite simple to have master - hot standby setup, and not go through the slave promotion step.

An HA setup can be a master <-> master setup with a VIP using VRRP and a check script ( keepalived). Of course, you have to remain cautious about network partition. Another thing that might be interesting in some use cases is the ability to skip some replication errors. This is specially interesting in cases where consistency is not critical.

I actually did some setup like that with a replication ring (4 full masters), and an additional daemon re-configuring the ring dynamically when a node was down. It also monitored the transaction log, trimming them if they were about to fill up the disk and setting the GTIDs to the new values. I added some skip error to not block replication. However it was for a very simple DB (session DB containing just one table, but an SQL db was required). Basically I switched from a CA DB to an AP DB, and it's nice to be able to do these kind of things.

I know those are too simplistic setups not taking into account all the failure modes. But it also make them easier to understand and to debug.

dijit8y ago

Having ran MySQL in prod for a decade and PostgreSQL in prod for half a decade I can say without doubt that your data team is telling fibs.

Firstly we consider that there are multiple replication possibilities of both technologies- however I'm going to assume the defaults because that's pretty much what everyone uses except if there's an actual case for using something else. It's the exception.

But by default MySQL uses statement based replication (in a weird binary format with log positions and stuff) and postgresql does logical replication (as in, you transmit the binary differences of what you'll be doing to the replica's database files directly and the replica just follows along)

Both of these approaches have trade-offs depending on what you want.

Statement based replication is great if you want to have _different_ datasets on each side, You can transform the data or remove huge chunks of it on a slave and use it for a dedicated purpose. However that applies the other way, you can never really be 100% sure that your replica looks exactly like your master.

this bit me a few times with MySQL when I assumed that because the replica was 'up to date' with the master and it was set to read only, that the data had integrity- it absolutely did not.

antoncohen8y ago

I don't think the claim of MySQL replication being better is related to statement vs. row vs. binary diff. I think it is about the tooling and community knowledge about replication, and about running MySQL in large scale production environments in general.

MySQL is run more often at extremely large scale (Facebook, YouTube, Twitter, Dropbox, etc.) than Postgres. That results in very battle tested and/or featureful tooling like orchestrator (https://github.com/github/orchestrator), MHA (https://github.com/yoshinorim/mha4mysql-manager), ProxySQL (http://www.proxysql.com/), and gh-ost (https://github.com/github/gh-ost), along knowledge and best practices shared by those organization.

1 more reply

deberon8y ago

Having also supported hundreds of production MySQL databases, statement based replication is absolutely inferior. But it should also be noted that row based replication (similar to streaming replication in that the actual data changes are synced) has been supported in MySQL since v5.1.5 (current is v5.7). And row based replication is the default since v5.7.7 https://dev.mysql.com/doc/refman/5.7/en/replication-formats....

1 more reply

chousuke8y ago

Did you mean "physical replication"? Logical replication corresponds to MySQL's model, I think, whereas WAL streaming is just copying over the changed bytes as they get written to the WAL on the master database.

I like Postgres's physical replication for its straightforwardness. It's pretty easy to tell if your replica is up to date unless something really weird is going on. (undetected data corruption?).

That said, PostgreSQL doesn't really make replication appear easy, so I can understand people thinking that even a basic master-slave setup is difficult (In my experience its behaviour is much easier to understand than with MySQL). However, MySQL is ahead in multi-master user friendliness, and setting up eg. a simple galera cluster is pretty easy.

Whether an "easy" multi-master galera set up is actually production-quality is another matter entirely, but it is not difficult to get up and running.

1 more reply

tomc19858y ago· 7 in thread

Makes me sad that running your own instances is now an "elephant in the room." No pride in old-school do-it-yourself nerditry these days :/

rosser8y ago

Getting HA right is hard. DIY-ing it incurs risk, possibly deliberately, out of Not-Invented-Here-ism.

Source: PostgreSQL DBA for over a decade; have built multiple HA environments; have seen many ways of "doing it wrong", and how those can end up biting their creators.

nh28y ago

On the other hand:

With hosted Postgres, when a failure does happen, isn't it much harder to get at the log files? They seem extremely useful to diagnose the problem and make sure it doesn't happen again, as the article shows. What's your experiene here, can you get at logs easily with hosted Posgres offerings?

And it seems the only way to get reliable Postgres HA for everyone, and to weed out the bugs, is if more people run Posgres HA themselves. For example, I find Stolon and Patroni great, but I would be more relaxed about them if they had 100x more users.

1 more reply

majidazimi8y ago

It's not hard. The problem is operation team rarely exercise failures. Configure a test HA cluster in lab, and test it. If it works, push it to production. The production system, should be continuously tested with real failures to see whether fail over mechanisms actually taking over the responsibility or not.

OF COURSE every thing is going to work in the lab, BUT MAY BE there is some other corner case in the production that you haven't considered yet. --- Louis C.K. after a sleepless night of switching primary DB to secondaries.

derefr8y ago

It's still "respectable" to run your own n=1 Postgres instance, maybe with WAL-E backup. It's sensible, as well, to create your own read-replicas to scale OLAP; and even to do your own shared-nothing sharding across regions. These are all "set and forget" enough that they can be the responsibility of your regular devops.

But, when you get to the point where you need multi-master replication, you're making a mistake if you aren't dedicating an ops person (i.e. a DBA) to managing your growing cluster. If you can't afford that ops person, much better to just pay a DBaaS provider to handle the ops for you, than to get hosed when your cluster falls apart and your week gets shot putting it back together.

Sinjo8y ago

Right!

A thing that scares me is anyone saying they're running their own HA cluster (not single instance) for cost reasons. Infra people are not cheaper than the hosted solutions (Amazon RDS, Google Cloud SQL, Heroku Postgres).

2 more replies

illumin88y ago

I love nerding out over this personally, but if you're a startup, given the plethora of well managed offerings, you're frankly foolish to invest resources on this. Even if you eventually reach the point where it makes financial sense to hire a full time ops person or DBA, the opportunity cost of having a smart engineer (and it does take a smart engineer to manage a multi-master database) work on infrastructure instead of your actual product, is just stupid.

How many startups have failed because they spent too much money building "cool, nerdy, infrastructure" instead of just building a product?

cookiecaper8y ago

The danger here is differentiating between "the infrastructure" and "the product". Useful database and/or infrastructure work _is_ "building the product".

There's nothing necessarily wrong with using pre-baked or hosted components when they fit the bill, but pretending like they're unrelated concerns is going down a bad road. A lot of recent fads are based on this self-centered, lazy fantasy from devs that $LANGUAGE_OF_THE_MONTH is the only thing that matters and it's a dark, sad situation.

There's a pretty consistent inverse relationship between technical quality and popularity because time and money spent on technical/engineering resources is time and money not spent on marketing and sales resources that bring cash in the door.

Ever wonder why, with a few exceptions, it never seems that the products everyone knows about are comparable to what you can find after a little bit of research online? This is why. The people who are building good stuff are spending the time and resources on building good stuff, whereas the people who aren't are spending the time and resources on making sure they're the path of least resistance.

So in that sense, yes, you are right. It is dumb to spend any time or money on anything other than the bare minimum skeleton needed to allow your sales people to start pimping your stuff.

Whether or not people recognize your product's superiority is more or less irrelevant, because first, they won't, and because second, the extra effort it takes to swim upstream and use your product instead of the mainstream solution won't really be worth the gains for most people no matter how much better it is. You can probably rattle 15 examples of software off the top of your head that is just like this. PostgreSQL is actually a great example of it.

Amazon has run amok feeding people who don't really deserve the title "developer" a load of crap about how you can click buttons in their wizards and be like a super-real grown-up coder-hero without any having to learn any of that outdated command line mumbo-jumbo. It's 2017 after all! Don't worry about that gobbligook hocus-pocus that the smelly old man in the network closet keeps muttering under his breath. That's for smelly old people and third-party Amazon contractors. You have Very Important JavaScript to write, just as soon as you finish dragging Legos--err--"Mega-Elastic Dynamo-tastic Sumerian-Beanstalkinator Units" around on AWS.

How have other professions handled this issue? After all, most people wouldn't know the difference between a safe bridge and an unsafe one, and most people wouldn't know the difference between safely-prepared food and unsafely-prepared food (until they've already eaten it). The profit incentive is to put the bare minimum in place and then sell sell sell.

We may not like the heavy hand of regulation that will clamp down on the software industry, permanently and officially gate it behind the blood-sucking ivory tower of the academic priesthood, and strip it of all vitality and creativity, but with the attitudes that have become prevalent over the last few years, we have no one to blame but ourselves.

2 more replies

dboreham8y ago· 6 in thread

This is why you should be extremely wary of anything that is only run once in a a blue moon. And very wary of such things that when run, are being run to save your bacon.

Klathmon8y ago

I always advocate for monthly "pull the plug on the box" tests.

If you don't need "high availability", then it will test your backups and restore process, and if you do need "high availability", it will ensure your failover processes are running smoothly.

Not to mention it trains everyone involved what to do in an emergency since it should be second nature by the time it really happens.

If you can't go "Full netflix" and unleash a chaos monkey on your servers, at least setup a maintenance period where downtime is somewhat expected, and do it then.

seanwilson8y ago

> This is why you should be extremely wary of anything that is only run once in a a blue moon.

I find this similar to when you launch a project that hasn't been used in production yet. Bugs should be expected because it hasn't been battle tested.

rosser8y ago

"Your backups are only as good as their last successful restore."

CodeWriter238y ago

https://feedbin.com/blog/2017/05/05/testing-huge-postgresql-...

quicklyfrozen8y ago

And wary of using options that might be outside typical usage (such as -INF for the backup VIP).

king_phil8y ago

I don't see how using infinity in a pacemaker config would be outside typical usage

kodablah8y ago· 6 in thread

I too am coming up on a need for no-downtime HA failover for Postgres. I too am not allowed to use a hosted PaaS-ish solution like RDS. I was considering Citus's multi master impl (I don't need to spread the load, just need HA). I had not considered Pacemaker. Has GoCardless investigated this option and have any insight to give? HA has traditionally been a real pain point for traditional RDBMS's in my experience.

craigkerstiens8y ago

Craig from Citus here. Unfortunately, at this time Citus isn't really focused solely on solving the HA implementation for single node Postgres. Rather, we're focused on when you need to scale for performance issues. Our multi-master setup is targeting use cases that need higher throughput of 500,000+ single wrote inserts per second or say higher than 5 million writes per second when using ingestion with Postgres \copy.

Sinjo8y ago

To be honest we've not looked into Citus in any depth.

My early impression of it (can't speak for the rest of the team) was that it was mostly aimed at sharding analytics workloads, but parts of the docs (e.g. https://docs.citusdata.com/en/v7.1/admin_guide/cluster_manag...) make it sound like it handles OLTP workloads too.

Maybe I've been ignoring it for bad reasons!

EDIT: Managing Postgres clusters is something that a lot of people are working on. Thought I'd mention two projects that have me excited right now:

  - Patroni https://github.com/zalando/patroni
  - Stolon https://github.com/sorintlab/stolon

Stolon's client proxy approach in particular looks interesting, and reminds me of how people are using Envoy (https://github.com/envoyproxy/envoy), albeit as a TCP proxy rather than one that understands and can do fun stuff with the database's protocol. I wonder if we'll start to see more Envoy filters for different databases!

craigkerstiens8y ago

Craig from Citus here. Since we grew transactional support a couple of years ago and a number of the features we've supported since then much of our traction has come from those outgrowing single node Postgres and needing more performance. So in short we're very much focused on handling and supporting OLTP workloads.

We do also support some analytics workloads, less so data warehousing, when there is a need for end user facing analytics where higher concurrency and real-time responsiveness is key.

clarkdave8y ago

We’ve been using Patroni in production and it has been great. We use it with consul & pgbouncer and it can failover in under a minute with a small number of dropped requests (mostly bound by how many clients your pgbouncer can hold at once while the new master gets going). Controlled failover for upgrades or maintenance can be as quick as 10 seconds.

mbrynard8y ago

When deciding to go with Patroni, did you have a look at CruncyDB? We're deciding between the two and kubernetes support on CrunchyDB and documentation seems to be more comprehensive.

1 more reply

mdekkers8y ago

Look at HAProxy, Patroni and Zookeeper

ahoka8y ago· 5 in thread

"The RAID controller logged the simultaneous loss of 3 disks from the array. All subsequent read and write operations against it failed."

People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)

luke00168y ago

> People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)

At worst it does both. In most cases it really does just remove a single point of failure (a disk). Other non-RAID configurations likely use a shared controller too. Moving that single point of failure to a different controller doesn't make it any worse.

tensor8y ago

I suppose it depends on which is more reliable, HDDs or the controller. If the likelihood of a HHD failure is much higher than a controller failure, then it still makes sense to go with RAID.

majidazimi8y ago

Always use software RAID. I'd rather loose 30% on efficiency than creating a single point of failure with a random hardware. Software RAID on modern OSes (Linux, FreeBSD, ... ) are pretty darn reliable and fast.

fb038y ago

And if your OS breaks you can always boot from another one and recover your cluster.

Now try that with an obscure Adaptec card which has been EOL for the last 4 years :-)

king_phil8y ago

And the SATA/SAS controller of your Mainboard is invincible? Or the Mainboard itself? Or the CPU? What's your backup solution if they are faulty?

1 more reply

jskrablin8y ago· 3 in thread

Pacemaker is known to wreak havoc if it gets angry. The usual path to quick recovery when the cluster goes crazy like this is to make really sure what's the most up to date replica, shut down Pacemaker completely, assign VIP manually to a healthy replica and promote it manually. Then once you're up and back in the business figure out how to rebuild the cluster.

barkingcat8y ago

If this is indeed true, doesn't this negate the purpose of pacemaker to begin with? It's like anti-software. When you run with it in your environment, to recover from a failure (which seems to me what HA software should be about) you have to turn it off first or else it will destroy your recovery attempts.

It's like a perverse version of chaos-monkey, except you want it to destroy you when you are most vulnerable.

jskrablin8y ago

It's great when it works as expected. When it doesn't... then the fun begins. I've found it quite fragile, components versions sensitive, configuration sensitive, etc. Most of the time I've seen Pacemaker gone crazy Pg itself was happy to cooperate once the Pacemaker was out of the way. The unknown/weird Pacemaker failure modes were a real (and scary) problem.

I guess the lesson here is not to rely entirely on some HA black magic and always have procedures in place for the 'HA black magic failed us' moments. And team trained to deal with situation like this. It's only software so it will break sooner or later.

hinkley8y ago

Unfortunately appropriate project name? (A maulfunctioning pacemaker can kill you)

sandGorgon8y ago· 2 in thread

>Fortunately, as part of some unrelated work we'd done recently, we had a version of the cluster that we could run inside Docker containers. We used it to help us build a script that mimicked the failures we saw in production. Being able to rapidly turn clusters up and down let us iterate on that script quickly, until we found a combination of events that broke the cluster in just the right way.

this is the coolest part of this story. Any chance these scripts are opensource ?

Sinjo8y ago

We plan to open source it as soon as we can. Tiny bit more work to do, then review from a couple of people in the team, then we make it public. :)

sandGorgon8y ago

Thanks for this !

devit8y ago· 2 in thread

I think the root issue is that PostgreSQL does not offer an HA solution that works out of the box with minimal configuration, resulting in people using broken third-party ones and/or configuring them incorrectly.

They should either provide one or "bless" an external solution as the official one (after making sure it works correctly).

The other problem is that GoCardless setup an asynchronous and a synchronous replica instead of 2 synchronous replicas (or preferably 4+), resulting in only two points of failure, which is not enough.

Sinjo8y ago

Pretty much number 1 on my Postgres wishlist.

tracker18y ago

I cannot upvote this enough... it's the single most glaring feature missing from PostgreSQL at this point. I know it's where all the support companies are making their money, but it really keeps the majority of SMBs from adopting postgres. As much as I hate mySQL.

notyourday8y ago· 2 in thread

Stop pretending that there's a magic bullet called "multi-master" and "transparent promotion". Your apps are super simple. Their DB interactions are super simple. Learn how to do federations and all these problems will go away.

rosser8y ago

It's curious that you decry some kinds of "magic bullet" in favor of another.

How about "there are no magic bullets, full stop"?

vemv8y ago

Interesting. Recommended reads?

linker30008y ago· 1 in thread

Since I am investigating HA with PostgreSQL right now and have bitter experience of Pacemaker 'HA' instances that have been anything but, I am looking at Amazon Aurora and Microsoft's (in preview) Azure database for PostgreSQL offerings. I would really appreciate any insight from others who are already using them (we intend to do some PoC work shortly).

Our dev team also came up with some pertinent questions, which we have put to both companies, but if anyone else can comment from experience that would be fantastic:

* Is the product a fork of PostgreSQL or a wrapper round the current version?

* Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?

* If the DB engine keeps in lock-step, what’s the period between a new version of PostgreSQL being released before its incorporated in the live product?

* When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?

lattepiu8y ago

I can comment on Amazon Aurora:

> Is the product a fork of PostgreSQL or a wrapper round the current version?

Aurora is a fork: they've re-written a significant chunk of the engine. Note that Amazon also offers RDS PostgreSQL, which is a managed version of the "regular" PostgreSQL engine. RDS PostgreSQL also offers a HA setup (no version upgrade without downtime, however). It works quite well.

> Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?

Amazon promises to keep it in lock-step. How soon they will release an upgrade to a major version remains to be seen.

> When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?

Minor version upgrades are applied automatically. For major version upgrades, it's unclear at this time (there hasn't been one yet for Aurora PostgreSQL), but I think it's unlikely they will be applied automatically.

testplzignore8y ago· 1 in thread

Good write-up. I'm curious about two more things:

1. What caused the crash on the synchronous replica? Was it just a coincidence and completely unrelated to the primary failure?

2. Given the three conditions necessary for the cluster to break, was the behavior of the Pacemaker software expected? I.e., was this a gotcha that should be in the Pacemaker documentation, or a bug?

Sinjo8y ago

1. Unfortunately the logs don't give any detail there. Most likely something arrived down the replication connection that the process couldn't handle, and it crashed.

2. Our understanding now is that INF is the strongest preference, whereas -INF is a veto. It would be very cool to have this confirmed 100% by someone who works on Pacemaker!

qaq8y ago· 1 in thread

There are specialized tools like Patroni (https://github.com/zalando/patroni)

emXdem8y ago

I like Stolon: https://github.com/sorintlab/stolon

Erwin8y ago· 1 in thread

If you are running HA in AWS RDS, how would you compare your experience with the above? What are the types of RDS failures modes that you have experienced?

So far I've discovered that TCP keepalives are quite important, otherwise your queries may hang forever after failover (or at least for the default timeout which is like 30 minutes). The connection does not get broken otherwise by the failover.

pjungwir8y ago

Here are a couple gotchas I've seen on RDS:

- If you are running a MultiAZ instance, it is supposed to fail over automatically, but if the problem is in the networking, then you can still effectively lose service. One way around that is to run a read replica in another AZ, and use a Route53 entry with a health check to send traffic to the read replica if the primary isn't reachable. You'll still need to promote the read replica to a master though.

- If you restore from a snapshot, the new EBS volume only pulls blocks of data from S3 as they are requested. So these reads are a lot slower than normal. If you have a large database you could have degraded performance for days. Here is some more info about this: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-resto...

derekmhewitt8y ago· 1 in thread

I'm seeing the term "HA" being used a lot in the comments here, what does it mean?

inimino8y ago

high availability

cavisne8y ago

The end of this post mortem was a bit handwavy TBH. I feel like they didnt dig deep enough, and the problem was the backup VIP, not the two processes crashing at once and the backup VIP.

I think by still allowing the backup VIP to run on the sync replica the same mistake is being repeated, there will always be the possibility of a situation where the VIP cannot be moved when promotion is required. That replica should be doing nothing but sitting there waiting to save the day, and if they want the backup VIP to be highly available they should provision 2 async replicas.

ncmncm8y ago

I am astonished that, in the two years, you had not already handled 100+ scheduled failovers. If your HA is good, customers don't notice, and if not, you find out when there are fewer of them (and in daytime!), and fix it.

Probably by now Pacemaker would have been abandoned. A hundred drills would have been enough to flush out these behaviors. If you are afraid to run drills on production equipment, you should be running them on a full-scale production testbed, ideally with mirrored production traffic. With a production-scale testbed, two years is enough to run thousands of risk-free failovers.

Not doing frequent production failure drills is just irresponsible.

emXdem8y ago

Stolon with etcd or consul is a far superior solution for HA postgres.

j / k navigate · click thread line to collapse

117 comments

80 comments · 18 top-level

foobarbazetc8y ago· 16 in thread

I’ve been involved in various Postgres roles (developer on it, consultant for it for BigCo, using it at my startup now) for around 18 years.

That said, there was a lot of bad luck involved in this incident.

scurvy8y ago

While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want).

I will definitely agree with you that manual operator intervention is better than automated failover.

[0] https://github.com/unixsurfer/anycast_healthchecker

bringtheaction8y ago

I'd love to read more about this.

If you have the time and motivation to write a detailed post that would be much welcome by me and many others as well I am sure.

1 more reply

scurvy8y ago

thijsvandien8y ago

I'd generalize this to many kinds of failover setups, DB or non-DB. Practically each and every one I've seen is in itself less reliable than the thing it's covering for.

pmuk8y ago

It amazes me that in 2017 HA for RDBs is still so brittle and complicated. I had hoped this problem would have been solved by now!

1 more reply

gaius8y ago

I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever.

liveoneggs8y ago

your findings are supported by Baron Schwartz as well (https://www.xaprb.com/blog/2012/09/17/is-automated-failover-... and others)

fipar8y ago

I think it's relevant that Baron's post mentions a Github incident, and Github has been doing automatic failover of MySQL masters for a while now: https://githubengineering.com/orchestrator-github/

This post is also very relevant to the topic: http://code.openark.org/blog/mysql/mysql-high-availability-t...

The post you link is still relevant to this discussion, but we can't ignore that 5 years have passed. It would be good to see 2017's Baron revisit the topic :)

morrbo8y ago

(just learning about postgres and saw an opportunity to ask someone in the know) Cheers!

jeltz8y ago

1 more reply

fake-name8y ago

1 more reply

snuxoll8y ago

PgBouncer makes it easy to do a controlled failover without having to deal with floating IP's on your postgresql cluster.

pjungwir8y ago

> are created functions replicated/updated to everything else as well?

Yes.

Note if those functions have an implementation compiled from C, you do need to install the .so on the standbys though.

2 more replies

merb8y ago

why not just use https://github.com/zalando/patroni

takeda8y ago

When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.

1 more reply

cromantin8y ago

echelon8y ago· 8 in thread

Serious (naive) question; not wanting to start a flame war.

sitharus8y ago

Since PostgreSQL 10 WAL and logical replication strategies can support just about any sort of replication you desire, except multi-master.

YorickPeterse8y ago

benmmurphy8y ago

kakwa_8y ago

I know those are too simplistic setups not taking into account all the failure modes. But it also make them easier to understand and to debug.

dijit8y ago

Having ran MySQL in prod for a decade and PostgreSQL in prod for half a decade I can say without doubt that your data team is telling fibs.

Both of these approaches have trade-offs depending on what you want.

this bit me a few times with MySQL when I assumed that because the replica was 'up to date' with the master and it was set to read only, that the data had integrity- it absolutely did not.

antoncohen8y ago

1 more reply

deberon8y ago

1 more reply

chousuke8y ago

I like Postgres's physical replication for its straightforwardness. It's pretty easy to tell if your replica is up to date unless something really weird is going on. (undetected data corruption?).

Whether an "easy" multi-master galera set up is actually production-quality is another matter entirely, but it is not difficult to get up and running.

1 more reply

tomc19858y ago· 7 in thread

Makes me sad that running your own instances is now an "elephant in the room." No pride in old-school do-it-yourself nerditry these days :/

rosser8y ago

Getting HA right is hard. DIY-ing it incurs risk, possibly deliberately, out of Not-Invented-Here-ism.

Source: PostgreSQL DBA for over a decade; have built multiple HA environments; have seen many ways of "doing it wrong", and how those can end up biting their creators.

nh28y ago

On the other hand:

1 more reply

majidazimi8y ago

derefr8y ago

Sinjo8y ago

Right!

2 more replies

illumin88y ago

How many startups have failed because they spent too much money building "cool, nerdy, infrastructure" instead of just building a product?

cookiecaper8y ago

The danger here is differentiating between "the infrastructure" and "the product". Useful database and/or infrastructure work _is_ "building the product".

So in that sense, yes, you are right. It is dumb to spend any time or money on anything other than the bare minimum skeleton needed to allow your sales people to start pimping your stuff.

2 more replies

dboreham8y ago· 6 in thread

This is why you should be extremely wary of anything that is only run once in a a blue moon. And very wary of such things that when run, are being run to save your bacon.

Klathmon8y ago

I always advocate for monthly "pull the plug on the box" tests.

If you don't need "high availability", then it will test your backups and restore process, and if you do need "high availability", it will ensure your failover processes are running smoothly.

Not to mention it trains everyone involved what to do in an emergency since it should be second nature by the time it really happens.

If you can't go "Full netflix" and unleash a chaos monkey on your servers, at least setup a maintenance period where downtime is somewhat expected, and do it then.

seanwilson8y ago

> This is why you should be extremely wary of anything that is only run once in a a blue moon.

I find this similar to when you launch a project that hasn't been used in production yet. Bugs should be expected because it hasn't been battle tested.

rosser8y ago

"Your backups are only as good as their last successful restore."

CodeWriter238y ago

https://feedbin.com/blog/2017/05/05/testing-huge-postgresql-...

quicklyfrozen8y ago

And wary of using options that might be outside typical usage (such as -INF for the backup VIP).

king_phil8y ago

I don't see how using infinity in a pacemaker config would be outside typical usage

kodablah8y ago· 6 in thread

craigkerstiens8y ago

Sinjo8y ago

To be honest we've not looked into Citus in any depth.

Maybe I've been ignoring it for bad reasons!

EDIT: Managing Postgres clusters is something that a lot of people are working on. Thought I'd mention two projects that have me excited right now:

  - Patroni https://github.com/zalando/patroni
  - Stolon https://github.com/sorintlab/stolon

craigkerstiens8y ago

We do also support some analytics workloads, less so data warehousing, when there is a need for end user facing analytics where higher concurrency and real-time responsiveness is key.

clarkdave8y ago

mbrynard8y ago

When deciding to go with Patroni, did you have a look at CruncyDB? We're deciding between the two and kubernetes support on CrunchyDB and documentation seems to be more comprehensive.

1 more reply

mdekkers8y ago

Look at HAProxy, Patroni and Zookeeper

ahoka8y ago· 5 in thread

"The RAID controller logged the simultaneous loss of 3 disks from the array. All subsequent read and write operations against it failed."

People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)

luke00168y ago

> People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)

tensor8y ago

I suppose it depends on which is more reliable, HDDs or the controller. If the likelihood of a HHD failure is much higher than a controller failure, then it still makes sense to go with RAID.

majidazimi8y ago

fb038y ago

And if your OS breaks you can always boot from another one and recover your cluster.

Now try that with an obscure Adaptec card which has been EOL for the last 4 years :-)

king_phil8y ago

And the SATA/SAS controller of your Mainboard is invincible? Or the Mainboard itself? Or the CPU? What's your backup solution if they are faulty?

1 more reply

jskrablin8y ago· 3 in thread

barkingcat8y ago

It's like a perverse version of chaos-monkey, except you want it to destroy you when you are most vulnerable.

jskrablin8y ago

hinkley8y ago

Unfortunately appropriate project name? (A maulfunctioning pacemaker can kill you)

sandGorgon8y ago· 2 in thread

this is the coolest part of this story. Any chance these scripts are opensource ?

Sinjo8y ago

We plan to open source it as soon as we can. Tiny bit more work to do, then review from a couple of people in the team, then we make it public. :)

sandGorgon8y ago

Thanks for this !

devit8y ago· 2 in thread

They should either provide one or "bless" an external solution as the official one (after making sure it works correctly).

Sinjo8y ago

Pretty much number 1 on my Postgres wishlist.

tracker18y ago

notyourday8y ago· 2 in thread

rosser8y ago

It's curious that you decry some kinds of "magic bullet" in favor of another.

How about "there are no magic bullets, full stop"?

vemv8y ago

Interesting. Recommended reads?

linker30008y ago· 1 in thread

Our dev team also came up with some pertinent questions, which we have put to both companies, but if anyone else can comment from experience that would be fantastic:

* Is the product a fork of PostgreSQL or a wrapper round the current version?

* Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?

* If the DB engine keeps in lock-step, what’s the period between a new version of PostgreSQL being released before its incorporated in the live product?

* When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?

lattepiu8y ago

I can comment on Amazon Aurora:

> Is the product a fork of PostgreSQL or a wrapper round the current version?

> Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?

Amazon promises to keep it in lock-step. How soon they will release an upgrade to a major version remains to be seen.

> When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?

testplzignore8y ago· 1 in thread

Good write-up. I'm curious about two more things:

1. What caused the crash on the synchronous replica? Was it just a coincidence and completely unrelated to the primary failure?

2. Given the three conditions necessary for the cluster to break, was the behavior of the Pacemaker software expected? I.e., was this a gotcha that should be in the Pacemaker documentation, or a bug?

Sinjo8y ago

1. Unfortunately the logs don't give any detail there. Most likely something arrived down the replication connection that the process couldn't handle, and it crashed.

2. Our understanding now is that INF is the strongest preference, whereas -INF is a veto. It would be very cool to have this confirmed 100% by someone who works on Pacemaker!

qaq8y ago· 1 in thread

There are specialized tools like Patroni (https://github.com/zalando/patroni)

emXdem8y ago

I like Stolon: https://github.com/sorintlab/stolon

Erwin8y ago· 1 in thread

If you are running HA in AWS RDS, how would you compare your experience with the above? What are the types of RDS failures modes that you have experienced?

pjungwir8y ago

Here are a couple gotchas I've seen on RDS:

derekmhewitt8y ago· 1 in thread

I'm seeing the term "HA" being used a lot in the comments here, what does it mean?

inimino8y ago

high availability

cavisne8y ago

The end of this post mortem was a bit handwavy TBH. I feel like they didnt dig deep enough, and the problem was the backup VIP, not the two processes crashing at once and the backup VIP.

ncmncm8y ago

Not doing frequent production failure drills is just irresponsible.

emXdem8y ago

Stolon with etcd or consul is a far superior solution for HA postgres.

j / k navigate · click thread line to collapse