undefined | Better HN

0 pointsglouwbug3y ago0 comments

Yeah, valgrind can report L1/L2 cache misses and report the percentage of branch mispredictions. It also reports the exact number of instructions processed, and how many of those instructions cache missed. It's great for improving small code that needs to be performant.

I'd use asan over valgrind only for memory leaks. It's faster.

0 comments

3 comments · 2 top-level

Sesse__3y ago· 1 in thread

If you only want memory leaks, LSan will do that for you.

In general, I tend to use ASan for nearly everything I used Valgrind for back in the day; it's faster and usually more precise (Valgrind cannot reliably detect small overflows between stack variables). Valgrind if I cannot recompile, or if ASan doesn't find th issue. Callgrind and Cachegrind never; perf does a much better job, much faster. DHAT never; Heaptrack gives me what I want.

Valgrind was and is a fantastic tool; it became part of my standard toolkit together with the editor, compiler, debugger and build system. But technology has moved on for me.

gpderetta3y ago

Amen. Between the various sanitizers and perf, I stopped needing valgrind a few years ago.

But when it was the only option it was fantastically useful.

themulticaster3y ago

If I understand correctly valgrind (cachegrind) reports L1/L2 cache misses based on a simulated CPU/cache model.

On Linux, you can easily instrument real cache events using the very powerful perf suite. There is an overwhelming number of events you can instrument (use perf-list(1) to show them), but a simple example could look like this:

  $ perf stat -d -- sh -c 'find ~ -type f -print | wc -l'
  ^Csh: Interrupt
   Performance counter stats for 'sh -c find ~ -type f -print | wc -l':
  
               47,91 msec task-clock                #    0,020 CPUs utilized
                 599      context-switches          #   12,502 K/sec
                  81      cpu-migrations            #    1,691 K/sec
                 569      page-faults               #   11,876 K/sec
         185.814.947      cycles                    #    3,878 GHz                      (28,71%)
         105.650.405      instructions              #    0,57  insn per cycle           (46,15%)
          22.991.322      branches                  #  479,863 M/sec                    (46,72%)
             643.767      branch-misses             #    2,80% of all branches          (46,14%)
          26.010.223      L1-dcache-loads           #  542,871 M/sec                    (36,80%)
           2.449.173      L1-dcache-load-misses     #    9,42% of all L1-dcache accesses  (29,62%)
             517.052      LLC-loads                 #   10,792 M/sec                    (22,53%)
             133.152      LLC-load-misses           #   25,75% of all LL-cache accesses  (16,02%)
  
         2,403975646 seconds time elapsed
  
         0,005972000 seconds user
         0,046268000 seconds sys

Ignore the command, it's just a placeholder to get meaningful values. The -d flag adds basic cache events, by adding another -d you also get load and load miss events for the dTLB, iTLB and L1i cache.

But as mentioned, you can instrument any event supported by your system. Including very obscure events such as uops_executed.cycles_ge_2_uops_exec (Cycles where at least 2 uops were executed per-thread) or frontend_retired.latency_ge_2_bubbles_ge_2 (Retired instructions that are fetched after an interval where the front-end had at least 2 bubble-slots for a period of 2 cycles which was not interrupted by a back-end stall).

You can also record data using perf-record(1) and inspect them using perf-report(1) or - my personal favorite - the Hotspot tool (https://github.com/KDAB/hotspot).

Sorry for hijacking the discussion a little, but I think perf is an awesome little tool and not as widely known as it should be. IMO, when using it as a profiler (perf-record), it is vastly superior to any language-specific built-in profiler. Unfortunately some languages (such as Python or Haskell) are not a good fit for profiling using perf instrumentation as their stack frame model does not quite map to the C model.

j / k navigate · click thread line to collapse