Not large by absolute standards sure, but large enough to cause issues.
I’m sure there’s some kind of solution that involves re-architecting the ES cluster and indices and re-architecting the data flows and stuff. But if our options are go through all that, or seriously slim down our architecture and costs by just running Sonic + our data warehouse, I’m definitely going to give it a go. After all, worst comes to worst we can go down the re-architecting ES route if Sonic doesn’t work out.
¯\_(ツ)_/¯
You can absolutely run 2 data/master-eligible nodes plus a single master-eligible tie-breaker node as a safe configuration. The only constraint is that you should have an uneven number of master-eligibile node to avoid a split brain. You also need to understand what the resilience guarantees are for any given number of replica (roughly: each replica allows for the loss of a single random node) and how many replica you can allocate on a given cluster (at most one per data node). That would allow you to run a 2-datanode cluster in a configuration that survives the loss of one node.
I pretty much specifically avoid 4 node clusters. You’d have to either designate 3 of the four nodes as master-eligibile with a quorum of 2 or have all of them master-eligible with a quorum of 3. Both options allow for failure of a single node before the cluster becomes unavailable. Any other configuration would either fail immediately on a node loss (quorum 4) or be unsafe (quorum of 2, allows for split-brain)
I’d much rather opt for 4 data/master eligible nodes plus a dedicated master eligible node with a quorum of 3.
You also need to pick the number of replica suitably: each replica allows for the loss of a single random(!) data node while retaining all your data. Note that if losses are not random but you want to safeguard against loss of a rack or an availability zone or such, configurations are possible that distribute primary and replica suitably (“keep a full copy on either side”)