Odds are pretty good it's left at the default of 4 worker threads... so on a 16 vcpu instance that's not going to reach great heights. Since it's a 1.4.x version (years old), it's missing some newer features that both help in average latency and memory efficiency. Or rather, a lot of them are there but disabled by default.
Memcached has allowed pipelining since it was created. For the ASCII protocol, packing multiple responses into single packets is done via a straight multiget. You can send multiple requests in a single packet for any protocol and any command.
My stress utility (https://github.com/memcached/mc-crusher) has options for pipelining requests, and using multigets ascii packed get responses. I test to the limit of lock scaling for each individual subsystem.
The 55M test required running mc-crusher via localhost, there's no network that can go that fast. My point is you're limited by the network throughput, not the CPU. In that particular 55M test, all cores were used, but ~7-8 of them were used by mc-crusher... so the real limit for the machine is even higher. It did have a lot of cores. 48ish?
You can still do apples/apples with instance sizes... but given everything I know about this thing, unless those cores are extremely slow, hitting 11m ops/sec shouldn't be an issue. Or at least, with minimal fiddling it should hit 6-8m, which doesn't give you a crazy 9x figure.
You do need to stop doing 1:1 get/set ratio though. Sets don't scale very well since I've generally never had complaints about the speed. I'd say a highly conservative test would be 5:1 get/set. Production workloads are typically even higher than that. (that said I do intend to speed them up more, it's just lowish priority.. the LRU locks are highly granular, so spreading sets across different slab classes can help mutation perf a lot).