I'm an application developer with decent "high-level" performance tuning skills. I can profile my application code and fix bottlenecks but eventually I hit a wall. Once I've addressed the low hanging fruit I know there is probably still 10-100x+ performance improvements available but out of reach to me with my current skills.
I don't know how to find and fix things like: excessive page faults, L1/L2 cache misses, branch mispredicts, context switches etc. What you might call "mechanical sympathy."
For those with these skills, how did you learn? How would you recommend someone develop this skillset today?