Don’t make the mistake of overromanticizing the simple solutions. They have nice, well understood failure conditions, and they come up relatively frequently.
When you start playing the HA game, the easy failures go off the table, and things break less often because “failures happen constantly and are auto-healed”. But when your virtual IP failover goes sideways or your cluster scheduler starts reaping systems because the metadata service is giving it useless data, you’re well into an infrequent, complex failure, and I hope you have a good ops team.
It’s always a trade off.