I can share mine. It's an ads retrieval system. Latency is very sensitive and it has to be efficient. To avoid mem allocations, special hashtables with fixed number of buckets (also open addressing) are used in multiple places in query processing. Default is 1000. However, there are cases that number of elements are only a handful. Then in this case, it fails to utilize the cache, hence slower.
The solution is to tune number of buckets from info derived from the pprof callgraph.
There were others too, like redundant serialization, etc. But this one is the most interesting.