At extremely high scale you start to run into very strange problems. We used to say that all of your "Unix Friends" fail at scale and act differently.
I once had 3000 machines running NTP sync'd cronjobs on the exact same second pounding the upstream server and causing outages (Whoops, add random offsets to cron!)
This sort of "dogpile effect" exists when fetching keys as well. A key drops out of cache and 30 machines (or worker threads) trying to load the same key at the same time, because the cache is empty.
One of the solutions around this problem was Facebook's Dataloader (https://github.com/graphql/dataloader), which tries to intercept the request pipeline, batch the requests together and coalesce many requests into one.
Essentially DataLoader will coalesce all individual loads which occur within a single frame of execution (a single tick of the event loop) and then call your batch function with all requested keys.
It helps by reducing requests and offering something resembling backpressure by moving the request into one code path.
I would expect that you'd have the same sort of problem at scale with this system given the number of requests on many procs across many machines.
We had a lot of small tricks like this (they add up!), in some cases we'd insert a message queue inbetween the requestor and the service so that we could increase latency / reduce request rate while systems were degraded. Those "knobs" were generally implemented by "Decider" code which read keys from memcache to figure out what to do.
By "pushes to connected SDKs": I assume you're holding a thread with this connection; How do you reconcile this when you're running something like node with PM2 where you've got 30-60 processes on a single host? They won't be sharing memory, so that's a lot of updates.
It seems better to have these updates pushed to one local process that other processes can read from via socket or shared memory.
I'd also consider the many failure modes of services. Sometimes services go catatonic upon connect and don't respond, sometimes they time out, sometimes they throw exceptions, etc...
There's a lot to think about here but as I said what you've got is a great start.