I’m trying to remember. I think it was nomad.
Essentially, about once a week the raft pings between the nodes would result in no response from the primary, so then the cluster would assume it lost the node and try to hold an election to pick the new leader and get stuck in a loop cuz it kept indefinitely trying to ask the leader for it’s vote.
I thought, surely this software isn’t that stupid. But it was.
The recovery documentation was to hand-generate a peers.json file. Seriously? In 2013 when you have a million ways to do auto discovery? Including in your own software? it couldn’t just auto heal?
I managed a MongoDB cluster ten years ago and never had a single issue like this. I could routinely take nodes down and bring them up and the cluster healed perfectly.