> software developers identify critical regions of their applications and evaluate design choices to select the best performing implementation.
Which is what I'd use metrics like cache misses and branch prediction misses for: really fine-tuning some section of the code that needs to execute lightning fast.
htop, on the other hand, gives a more high-level overview. Like, "who's eating all the RAM?". (Or, perhaps more often, "who do I need to kill?")
(*) What should N be? Depends on how frequently the counter is getting hit. N between 1000 and 1000000 is pretty typical. Choosing prime N is a good idea.
An approach I sometimes use is to throw a generic profiler at the program, make the program do something that is not fast enough (and that would need to be optimizes), look at the profile to identify the function(s) that are too slow, extract them from the big code base, get a good set of input data and run that with {call,cache}grind.
Then you can use the awesome kcachegrind to look at the data (where you can look at different cache misses, branch misdirect, etc.).
Of course, most of the time, simply running in the profiler show a non-optimal algorithm, or terrible allocation patterns, so you don't have to do all that, but I found this approach useful when writing inner loops for numeric computations (and of course, extracting the code if rather easy for this kind of stuff).
And also, this is osx/linux only, sadly.
So perhaps you could use 'tiptop' to get a general view of what might be slow, and then drill down using 'perf top'.