Any feedback is super welcome. If folks like this we may consider working on upstreaming the approach into nodetool.
RethinkDB seems to have everything from Cassandra, without the complexity.
Replacing Cassandra at Yelp would be a lot of effort, so we'd have to be sure that it's worth it. That being said, RethinkDB definitely looks interesting and I'll make sure it's on my list of datastores to evaluate.
We might try it out someday, but for now we're fairly happy with stock Cassandra.
1. Noisy. We had a lot of large deployments where they had high replication factors (e.g. RF=5 or 7) and they very much didn't care if 2 nodes failed, or 3, or even 4. They had the high replication factor for resilience to multiple rack failures and didn't want to get paged by a few racks failing.
2. Hard to generalize, especially with multi-tenant clusters. Size of cluster != replication of keyspaces. For example if we had a 50 node cluster, but had a keyspace with RF=1, a single node failure should be a pageable event. Why is there a RF=1 keyspace ... because devops means that developers sometimes do things like that.
3. Had poor attribution. If you have a large cluster with many keyspaces, one of which has a lower RF than the rest or a higher consistency level, then only the owner of those keyspaces care if we lose a node or two. When we're dealing with an incident we can rope in the teams owning specifically the keyspaces that are under-replicated so they can take appropriate action.
To be totally honest, mostly it just helps us find keyspaces that have low RF ... The number of times we found out the new Cassandra version we just deployed added another system table that had a SimpleReplicationStrategy with default replication of 2 ...