undefined | Better HN

0 pointsMizza3y ago0 comments

From my experience with the Hashi stack, I don't think it's a coincidence that Fly has a lot of downtime and are a major Hashi user. Terraform makes excellent bait though.

Still love using Fly, please add static assets hosting/CDN.

0 comments

atonse3y ago

How is this possible? How is consul not self—healing? It just seems so brittle in a way even database clusters aren’t.

throwawaaarrgh3y ago

All distributed decentralized systems are brittle. The only people who don't think this are people who haven't run them at scale.

Also, "self-healing" isn't really one thing. There are hundreds of different problems that can take out such a cluster, and every single one of them needs its own "self-healing" mechanism. These systems are literally the most complicated kinds of systems.

namaria3y ago

"Should have self healing" can be expanded into "should have systems to address the underlying system failure modes", which starts to shed some light into why distributed systems will always run into failure modes.

mdaniel3y ago

I haven't been responsible for babysitting consul, but I have been responsible for etcd for years and if consul's problem is anything like etcd's it's because members have identity and if one of them goes toes up then etcd will wait forever for the snowflake to come back to life, and if that's not how the underlying infra is configured, that's very very bad. Mix it mTLS into this story and it gets worse

I stayed away from the so-called "stacked" control plane of etcd inside kubernetes because it can make a tiny fire into a sharkfirenado but recently I've heard discussions of k3s (which uses dqlite) managing the etcd members and then "formal" kubernetes managing the workloads pointed at that k3s-stacked-etcd but I haven't tried it yet in order to know how theory and practice differ

cheeseprocedure3y ago

Consul’s autopilot feature makes life a little easier by automatically reaping failed instances:

https://developer.hashicorp.com/consul/tutorials/datacenter-...

Paired with cloud discovery, it makes for a tolerable operational experience when instances are expected to occasionally disappear.

grrdotcloud3y ago

I've worked with Vault and clusters, not Consul specifically but generally there's a healthy cluster state until something happens putting the cluster into an inconsistent state.

Generally there's a master node or multiple nodes in agreement. If the cluster cannot agree on it's current state the entire system may run multiple versions or be completely unavailable or provide inconsistent response bringing down other systems that rely on it.

Inspection itself is hampered by elections or syncing state or other process/race related/caching/ddossing itself or other services.

candiddevmike3y ago

Consul is a lot more than just a database cluster, and that may be part of its problem.

jen203y ago

The simpler explanation is that running products designed for LAN usage on a WAN is a fundamentally bad plan, as the folks over at fly acknowledge, even in this thread.

Meanwhile, hundreds of thousands of Consul, Nomad and Vault clusters used appropriately work perfectly well…

j / k navigate · click thread line to collapse