I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.
The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.
That said, there was a lot of bad luck involved in this incident.
If we want to failover (nothing automatic), we stop the primary (or it's already dead) and the route is withdrawn. An operator touches the recovery file on new primary, the health checker sees that, and the IP is announced back into the network. Yes, it's a "VIP", but it's one controlled by our operations team, not automation software. One nice things about this is that you can failover across datacenters (remember it's advertised into our network over BGP) without reconfiguring DNS or messing with application servers.
While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want).
I will definitely agree with you that manual operator intervention is better than automated failover.
If you have the time and motivation to write a detailed post that would be much welcome by me and many others as well I am sure.
I've used it heartbeat/pacemaker/corosync with DRBD for over a decade and I'm pretty pleased with it...now. But it took a bit of trial by fire to get it right. Luckily I never lost any data and found the issues in testing. Which gets at the heart of any failover mechanism -- it's all useless unless you test it on a regular basis (just like your backups).
That mirrors my experience - PG, DRDB, anything else, this form of clustering is a complete joke. If it's important, it's worth spending the money on a grown-solution. Veritas has its flaws sure but in 20+ years on half a dozen different OS's it's never let me down when the balloon went up.
This post is also very relevant to the topic: http://code.openark.org/blog/mysql/mysql-high-availability-t...
The post you link is still relevant to this discussion, but we can't ignore that 5 years have passed. It would be good to see 2017's Baron revisit the topic :)
(just learning about postgres and saw an opportunity to ask someone in the know) Cheers!
Yes.
Note if those functions have an implementation compiled from C, you do need to install the .so on the standbys though.
The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.
When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.
It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.
Postgres seems like a better choice for personal projects since it has a lot of nifty features. I'm also wary of Oracle, but that's my own attitude talking. For a startup eventually wanting to scale, would the better choice be to use MySQL out of the gates? Am I being mislead about Postgres clusters and availability?
Serious (naive) question; not wanting to start a flame war.
Since PostgreSQL 10 WAL and logical replication strategies can support just about any sort of replication you desire, except multi-master.
An HA setup can be a master <-> master setup with a VIP using VRRP and a check script ( keepalived). Of course, you have to remain cautious about network partition. Another thing that might be interesting in some use cases is the ability to skip some replication errors. This is specially interesting in cases where consistency is not critical.
I actually did some setup like that with a replication ring (4 full masters), and an additional daemon re-configuring the ring dynamically when a node was down. It also monitored the transaction log, trimming them if they were about to fill up the disk and setting the GTIDs to the new values. I added some skip error to not block replication. However it was for a very simple DB (session DB containing just one table, but an SQL db was required). Basically I switched from a CA DB to an AP DB, and it's nice to be able to do these kind of things.
I know those are too simplistic setups not taking into account all the failure modes. But it also make them easier to understand and to debug.
Firstly we consider that there are multiple replication possibilities of both technologies- however I'm going to assume the defaults because that's pretty much what everyone uses except if there's an actual case for using something else. It's the exception.
But by default MySQL uses statement based replication (in a weird binary format with log positions and stuff) and postgresql does logical replication (as in, you transmit the binary differences of what you'll be doing to the replica's database files directly and the replica just follows along)
Both of these approaches have trade-offs depending on what you want.
Statement based replication is great if you want to have _different_ datasets on each side, You can transform the data or remove huge chunks of it on a slave and use it for a dedicated purpose. However that applies the other way, you can never really be 100% sure that your replica looks exactly like your master.
this bit me a few times with MySQL when I assumed that because the replica was 'up to date' with the master and it was set to read only, that the data had integrity- it absolutely did not.
MySQL is run more often at extremely large scale (Facebook, YouTube, Twitter, Dropbox, etc.) than Postgres. That results in very battle tested and/or featureful tooling like orchestrator (https://github.com/github/orchestrator), MHA (https://github.com/yoshinorim/mha4mysql-manager), ProxySQL (http://www.proxysql.com/), and gh-ost (https://github.com/github/gh-ost), along knowledge and best practices shared by those organization.
I like Postgres's physical replication for its straightforwardness. It's pretty easy to tell if your replica is up to date unless something really weird is going on. (undetected data corruption?).
That said, PostgreSQL doesn't really make replication appear easy, so I can understand people thinking that even a basic master-slave setup is difficult (In my experience its behaviour is much easier to understand than with MySQL). However, MySQL is ahead in multi-master user friendliness, and setting up eg. a simple galera cluster is pretty easy.
Whether an "easy" multi-master galera set up is actually production-quality is another matter entirely, but it is not difficult to get up and running.
Source: PostgreSQL DBA for over a decade; have built multiple HA environments; have seen many ways of "doing it wrong", and how those can end up biting their creators.
With hosted Postgres, when a failure does happen, isn't it much harder to get at the log files? They seem extremely useful to diagnose the problem and make sure it doesn't happen again, as the article shows. What's your experiene here, can you get at logs easily with hosted Posgres offerings?
And it seems the only way to get reliable Postgres HA for everyone, and to weed out the bugs, is if more people run Posgres HA themselves. For example, I find Stolon and Patroni great, but I would be more relaxed about them if they had 100x more users.
OF COURSE every thing is going to work in the lab, BUT MAY BE there is some other corner case in the production that you haven't considered yet. --- Louis C.K. after a sleepless night of switching primary DB to secondaries.
But, when you get to the point where you need multi-master replication, you're making a mistake if you aren't dedicating an ops person (i.e. a DBA) to managing your growing cluster. If you can't afford that ops person, much better to just pay a DBaaS provider to handle the ops for you, than to get hosed when your cluster falls apart and your week gets shot putting it back together.
A thing that scares me is anyone saying they're running their own HA cluster (not single instance) for cost reasons. Infra people are not cheaper than the hosted solutions (Amazon RDS, Google Cloud SQL, Heroku Postgres).
How many startups have failed because they spent too much money building "cool, nerdy, infrastructure" instead of just building a product?
There's nothing necessarily wrong with using pre-baked or hosted components when they fit the bill, but pretending like they're unrelated concerns is going down a bad road. A lot of recent fads are based on this self-centered, lazy fantasy from devs that $LANGUAGE_OF_THE_MONTH is the only thing that matters and it's a dark, sad situation.
There's a pretty consistent inverse relationship between technical quality and popularity because time and money spent on technical/engineering resources is time and money not spent on marketing and sales resources that bring cash in the door.
Ever wonder why, with a few exceptions, it never seems that the products everyone knows about are comparable to what you can find after a little bit of research online? This is why. The people who are building good stuff are spending the time and resources on building good stuff, whereas the people who aren't are spending the time and resources on making sure they're the path of least resistance.
So in that sense, yes, you are right. It is dumb to spend any time or money on anything other than the bare minimum skeleton needed to allow your sales people to start pimping your stuff.
Whether or not people recognize your product's superiority is more or less irrelevant, because first, they won't, and because second, the extra effort it takes to swim upstream and use your product instead of the mainstream solution won't really be worth the gains for most people no matter how much better it is. You can probably rattle 15 examples of software off the top of your head that is just like this. PostgreSQL is actually a great example of it.
Amazon has run amok feeding people who don't really deserve the title "developer" a load of crap about how you can click buttons in their wizards and be like a super-real grown-up coder-hero without any having to learn any of that outdated command line mumbo-jumbo. It's 2017 after all! Don't worry about that gobbligook hocus-pocus that the smelly old man in the network closet keeps muttering under his breath. That's for smelly old people and third-party Amazon contractors. You have Very Important JavaScript to write, just as soon as you finish dragging Legos--err--"Mega-Elastic Dynamo-tastic Sumerian-Beanstalkinator Units" around on AWS.
How have other professions handled this issue? After all, most people wouldn't know the difference between a safe bridge and an unsafe one, and most people wouldn't know the difference between safely-prepared food and unsafely-prepared food (until they've already eaten it). The profit incentive is to put the bare minimum in place and then sell sell sell.
We may not like the heavy hand of regulation that will clamp down on the software industry, permanently and officially gate it behind the blood-sucking ivory tower of the academic priesthood, and strip it of all vitality and creativity, but with the attitudes that have become prevalent over the last few years, we have no one to blame but ourselves.
If you don't need "high availability", then it will test your backups and restore process, and if you do need "high availability", it will ensure your failover processes are running smoothly.
Not to mention it trains everyone involved what to do in an emergency since it should be second nature by the time it really happens.
If you can't go "Full netflix" and unleash a chaos monkey on your servers, at least setup a maintenance period where downtime is somewhat expected, and do it then.
I find this similar to when you launch a project that hasn't been used in production yet. Bugs should be expected because it hasn't been battle tested.
My early impression of it (can't speak for the rest of the team) was that it was mostly aimed at sharding analytics workloads, but parts of the docs (e.g. https://docs.citusdata.com/en/v7.1/admin_guide/cluster_manag...) make it sound like it handles OLTP workloads too.
Maybe I've been ignoring it for bad reasons!
EDIT: Managing Postgres clusters is something that a lot of people are working on. Thought I'd mention two projects that have me excited right now:
- Patroni https://github.com/zalando/patroni
- Stolon https://github.com/sorintlab/stolon
Stolon's client proxy approach in particular looks interesting, and reminds me of how people are using Envoy (https://github.com/envoyproxy/envoy), albeit as a TCP proxy rather than one that understands and can do fun stuff with the database's protocol. I wonder if we'll start to see more Envoy filters for different databases!We do also support some analytics workloads, less so data warehousing, when there is a need for end user facing analytics where higher concurrency and real-time responsiveness is key.
People seem to forget that adding a RAID controller creates a single point of failure instead of removing one. :-)
At worst it does both. In most cases it really does just remove a single point of failure (a disk). Other non-RAID configurations likely use a shared controller too. Moving that single point of failure to a different controller doesn't make it any worse.
Now try that with an obscure Adaptec card which has been EOL for the last 4 years :-)
It's like a perverse version of chaos-monkey, except you want it to destroy you when you are most vulnerable.
I guess the lesson here is not to rely entirely on some HA black magic and always have procedures in place for the 'HA black magic failed us' moments. And team trained to deal with situation like this. It's only software so it will break sooner or later.
this is the coolest part of this story. Any chance these scripts are opensource ?
They should either provide one or "bless" an external solution as the official one (after making sure it works correctly).
The other problem is that GoCardless setup an asynchronous and a synchronous replica instead of 2 synchronous replicas (or preferably 4+), resulting in only two points of failure, which is not enough.
Our dev team also came up with some pertinent questions, which we have put to both companies, but if anyone else can comment from experience that would be fantastic:
* Is the product a fork of PostgreSQL or a wrapper round the current version?
* Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?
* If the DB engine keeps in lock-step, what’s the period between a new version of PostgreSQL being released before its incorporated in the live product?
* When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?
> Is the product a fork of PostgreSQL or a wrapper round the current version?
Aurora is a fork: they've re-written a significant chunk of the engine. Note that Amazon also offers RDS PostgreSQL, which is a managed version of the "regular" PostgreSQL engine. RDS PostgreSQL also offers a HA setup (no version upgrade without downtime, however). It works quite well.
> Will the DB engine keep in lock-step with new PostgreSQL releases or might they diverge?
Amazon promises to keep it in lock-step. How soon they will release an upgrade to a major version remains to be seen.
> When new versions of Amazon Aurora/Azure DB for PostgreSQL are released will our live instance get automatically updated or will we be able to choose a version?
Minor version upgrades are applied automatically. For major version upgrades, it's unclear at this time (there hasn't been one yet for Aurora PostgreSQL), but I think it's unlikely they will be applied automatically.
1. What caused the crash on the synchronous replica? Was it just a coincidence and completely unrelated to the primary failure?
2. Given the three conditions necessary for the cluster to break, was the behavior of the Pacemaker software expected? I.e., was this a gotcha that should be in the Pacemaker documentation, or a bug?
2. Our understanding now is that INF is the strongest preference, whereas -INF is a veto. It would be very cool to have this confirmed 100% by someone who works on Pacemaker!
So far I've discovered that TCP keepalives are quite important, otherwise your queries may hang forever after failover (or at least for the default timeout which is like 30 minutes). The connection does not get broken otherwise by the failover.
- If you are running a MultiAZ instance, it is supposed to fail over automatically, but if the problem is in the networking, then you can still effectively lose service. One way around that is to run a read replica in another AZ, and use a Route53 entry with a health check to send traffic to the read replica if the primary isn't reachable. You'll still need to promote the read replica to a master though.
- If you restore from a snapshot, the new EBS volume only pulls blocks of data from S3 as they are requested. So these reads are a lot slower than normal. If you have a large database you could have degraded performance for days. Here is some more info about this: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-resto...
I think by still allowing the backup VIP to run on the sync replica the same mistake is being repeated, there will always be the possibility of a situation where the VIP cannot be moved when promotion is required. That replica should be doing nothing but sitting there waiting to save the day, and if they want the backup VIP to be highly available they should provision 2 async replicas.
Probably by now Pacemaker would have been abandoned. A hundred drills would have been enough to flush out these behaviors. If you are afraid to run drills on production equipment, you should be running them on a full-scale production testbed, ideally with mirrored production traffic. With a production-scale testbed, two years is enough to run thousands of risk-free failovers.
Not doing frequent production failure drills is just irresponsible.