Still love using Fly, please add static assets hosting/CDN.
Also, "self-healing" isn't really one thing. There are hundreds of different problems that can take out such a cluster, and every single one of them needs its own "self-healing" mechanism. These systems are literally the most complicated kinds of systems.
I stayed away from the so-called "stacked" control plane of etcd inside kubernetes because it can make a tiny fire into a sharkfirenado but recently I've heard discussions of k3s (which uses dqlite) managing the etcd members and then "formal" kubernetes managing the workloads pointed at that k3s-stacked-etcd but I haven't tried it yet in order to know how theory and practice differ
https://developer.hashicorp.com/consul/tutorials/datacenter-...
Paired with cloud discovery, it makes for a tolerable operational experience when instances are expected to occasionally disappear.
Generally there's a master node or multiple nodes in agreement. If the cluster cannot agree on it's current state the entire system may run multiple versions or be completely unavailable or provide inconsistent response bringing down other systems that rely on it.
Inspection itself is hampered by elections or syncing state or other process/race related/caching/ddossing itself or other services.
Meanwhile, hundreds of thousands of Consul, Nomad and Vault clusters used appropriately work perfectly well…