I'd use asan over valgrind only for memory leaks. It's faster.
In general, I tend to use ASan for nearly everything I used Valgrind for back in the day; it's faster and usually more precise (Valgrind cannot reliably detect small overflows between stack variables). Valgrind if I cannot recompile, or if ASan doesn't find th issue. Callgrind and Cachegrind never; perf does a much better job, much faster. DHAT never; Heaptrack gives me what I want.
Valgrind was and is a fantastic tool; it became part of my standard toolkit together with the editor, compiler, debugger and build system. But technology has moved on for me.
But when it was the only option it was fantastically useful.
On Linux, you can easily instrument real cache events using the very powerful perf suite. There is an overwhelming number of events you can instrument (use perf-list(1) to show them), but a simple example could look like this:
$ perf stat -d -- sh -c 'find ~ -type f -print | wc -l'
^Csh: Interrupt
Performance counter stats for 'sh -c find ~ -type f -print | wc -l':
47,91 msec task-clock # 0,020 CPUs utilized
599 context-switches # 12,502 K/sec
81 cpu-migrations # 1,691 K/sec
569 page-faults # 11,876 K/sec
185.814.947 cycles # 3,878 GHz (28,71%)
105.650.405 instructions # 0,57 insn per cycle (46,15%)
22.991.322 branches # 479,863 M/sec (46,72%)
643.767 branch-misses # 2,80% of all branches (46,14%)
26.010.223 L1-dcache-loads # 542,871 M/sec (36,80%)
2.449.173 L1-dcache-load-misses # 9,42% of all L1-dcache accesses (29,62%)
517.052 LLC-loads # 10,792 M/sec (22,53%)
133.152 LLC-load-misses # 25,75% of all LL-cache accesses (16,02%)
2,403975646 seconds time elapsed
0,005972000 seconds user
0,046268000 seconds sys
Ignore the command, it's just a placeholder to get meaningful values. The -d flag adds basic cache events, by adding another -d you also get load and load miss events for the dTLB, iTLB and L1i cache.But as mentioned, you can instrument any event supported by your system. Including very obscure events such as uops_executed.cycles_ge_2_uops_exec (Cycles where at least 2 uops were executed per-thread) or frontend_retired.latency_ge_2_bubbles_ge_2 (Retired instructions that are fetched after an interval where the front-end had at least 2 bubble-slots for a period of 2 cycles which was not interrupted by a back-end stall).
You can also record data using perf-record(1) and inspect them using perf-report(1) or - my personal favorite - the Hotspot tool (https://github.com/KDAB/hotspot).
Sorry for hijacking the discussion a little, but I think perf is an awesome little tool and not as widely known as it should be. IMO, when using it as a profiler (perf-record), it is vastly superior to any language-specific built-in profiler. Unfortunately some languages (such as Python or Haskell) are not a good fit for profiling using perf instrumentation as their stack frame model does not quite map to the C model.