The problem with callgrind is (
http://valgrind.org/docs/manual/cg-manual.html#branch-sim):
> Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004.
In other words, the data produced by Callgrind may be suitable to find obvious regressions, but there still may be more regressions which are only relevant on more modern CPUs.