Modern processors will look hundreds of instructions into the future and try to start executing them as soon as possible. Branches are predicted far in advance of when they can actually be evaluated. Many instructions can be executing simultaneously. A clean tidy flame graph showing 1-3ns slices (~5 cycles) cannot help but be a vast simplification of what the CPU is really doing.
The linked page about Processor Trace says this:
> instruction data (control flow) is perfectly accurate but timing information is less accurate
The article mentions using magic-trace to detect changes in inlining decisions made by the compiler. This is a case where it will shine, since PT can perfectly capture the control flow, and it doesn't necessarily rely on having perfect timestamps for everything.
Anyway, I wanted to say how much I appreciate your comment of 10 years ago. I'm also a parser nerd, and a performance nerd, and I feel strongly that programmers have a professional responsibility to write code in a way that expresses our intent by a logical minimum of instructions/work. I strongly suspect that this will become important again in the future, not because the ratio of software-efficiency to hardware-power decreases again, but because climate concerns will drive us to measure our code in performance-per-watt rather than performance-per-dollar (depending on what action is taken on carbon pricing, it may be a distinction without a difference).
I look forward to the day when grossly inefficient software is rightly considered to be as unacceptable as grossly inefficient SUVs, and people in our profession are forced to take responsibility for the damage that their obscenely inefficient crap is doing. I hope Python 4 comes with a snorkel.
if we imagine there existed some visualization that could more accurately represent the complexity of a core, I don’t know how it would be possible to get the data, because AFAIK there are no methods to trace processor execution for modern processors at higher fidelity than this.
even sampling profilers have similar issues with being limited to the model of sequential instruction streams, since each sample gives a single program counter, not the full view of everything the core has in flight.
I also agree that sampling profilers have the same issue: instruction-level views of sampling profiles should be taken with a grain of salt.
My concern is that flame graphs with 1-3ns of resolution are presented as a selling point of the tool, without any mention of the caveats around how this model really breaks down at this time scale. I would like to know more details of how the PT data actually relates to the out-of-order execution. Does a branch's timestamp correspond to when that branch was retired? Do we actually know what the timestamp corresponds to, or is it not well-specified? Are there cases where the timestamp is known to be misleading about the true bottleneck?
I don't know the answers to these questions, but I see a tool like this, I really want more information about the strengths and limitations of the data.
Sure, the thinnest slices on the highest zoom are going to be misleading. They're also not what you generally will want to be looking at (though they may provide context for you to identify the part higher functions taking a long time or hints about cache contention etc).
Well, we all know. We use systems like that daily.
From the link it says: " it needs a post-Skylake Intel processor " https://en.wikipedia.org/wiki/Skylake_(microarchitecture)
Man page has a description https://www.man7.org/linux/man-pages/man1/perf-intel-pt.1.ht...
Dont know who LauterBach in Germany are, but they have a training manual which goes back to 1989.
https://www2.lauterbach.com/pdf/training_ipt_trace.pdf
https://www2.lauterbach.com/pdf/trace_intel_pt.pdf
https://www2.lauterbach.com/pdf/debugger_x86.pdf
From their manuals "The Intel® Processor Trace (IPT) works similar to the LBR and BTS feature of Ix86 based cores (see “CPU specific Onchip Trace Commands” (debugger_x86.pdf)"
I knew about the debugger on ARM cpu's like the Rpi, didnt know about Intel having one, but its suggested AMD dont have one either, so there might be some security reason for that, depends on if the trace is just output only or whether its possible to use things like SOIC clips to alter bits and bytes in realtime, but like slowed down not normal cpu clock speeds.
They make a lot of debuggers for embedded targets. Often with tracing capability and such. Really good tools.