All my favorite tracing tools (opens in new tab)

(thume.ca)

355 pointstrishume2y ago40 comments

40 comments

I wrote Spall, one of the lightweight profilers mentioned in the post. I loved the author's blogpost on implicit in-order forests, it was neat to see someone else's take on trees for big traces, pushed me to go way bigger than I was originally planning!

Thankfully, eytzinger-ordered 4-ary trees work totally fine at 165+ fps, even at 3+ billion functions, but I like to read back through that post once in a while just in case I hit that perf wall someday.

Working on timestamp delta-compression at the moment to pack events into much smaller spaces, and hopefully get to 10 billion in 128 GB RAM sometime soon (at least for native builds of Spall).

Thanks for the kick to keep on pushing!

criddell2y ago

If you work on Windows applications, check out Event Tracing for Windows (ETW). The best place to start is Bruce Dawson’s blog:

https://randomascii.wordpress.com/2015/09/24/etw-central/

yakubin2y ago

In my opinion, the best way to interact with ETW is through DTrace. Microsoft’s GUIs like WPA-Xperf are so buggy and unreliable that using them feels utterly futile. DTrace on Windows on the other hand is very usable.

maccard2y ago

If you're working with ETW traces, SuperLuminal [0] (no affiliation just a happy customer) is leaps and bounds ahead of the built-in ETW viewer.

[0] https://superluminal.eu/

RicoElectrico2y ago

Isn't ETW a total trainwreck from a developer usability standpoint? Or so my colleagues (and the interwebs) tell me.

criddell2y ago

I don’t think so. The ability to keep drilling down deeper and deeper and the amazing sort and grouping functionality make it supremely useful.

The other tool I didn’t mention is WinDbg. In my opinion, it’s the greatest debugger on any platform.

RicoElectrico2y ago

Finally found it: https://caseymuratori.com/blog_0025

1 more reply

Veserv2y ago

A pretty good overview of open source solutions in the space.

Missing out on one of the most useful areas for tracing which is time travel debugging. There are a number of interesting solutions there taking advantage of hardware trace, instrumentation, and deterministic replay. Even better when you get full visualization integration so you can do something like zoom in from a multiple minute trace onto a suspicious 200 ns function and then double click on it which will then backstep to that exact point in your program with the full reconstruction of memory at that time so you can debug from that point.

trishumeOP2y ago

Do you know of anyone who's built that kind of time travel debugging with a trace visualization in the open outside of Javascript? I know about rr and Pernosco but don't know of trace visualization integration for either of them, that would indeed be very cool. I definitely dream of having systems like this.

mark_undoio2y ago

At undo.io we're interested in using our time travel capability beyond conventional time travel debugging - a recording file contains everything the program did, without any advance knowledge of what you need to sample, so there's a lot of potential to get other data out of it.

I just read your post and don't think it would take much to integrate with some of the visualisations you posted about, as a first step.

We've played around in the past with a sampling profiler (code here, requires a copy of our product to be useful though it could easily port to rr): https://github.com/undoio/addons/tree/master/sample_function... which can output in a format understood by Brendan Gregg's flame frames (https://www.brendangregg.com/flamegraphs.html)

But that's not quite the kind of tracing you're talking about. We also built a printf-style interface to our recording files, which seems closer: https://docs.undo.io/PostFailureLogging.html

Something like that but outputting trace events that can be consumed by Perfetto (say) would not be so hard to add. If we considered modifying the core record/replay engine then even more powerful things become possible.

inetknght2y ago

I've seen undo.io several times at cppcon. I've been throughly impressed with the demonstrations at the conference and came to this thread specifically to recommend undo.io. I was particularly impressed this year by a demonstration of debugging stack smashing -- that's something I recently worked around stack smashing in protobuf which happens before `main()` even starts. It seems perfect for undo.io to help debug :)

I'm still waiting on the keyserver to be able to run in Kubernetes though

1 more reply

Veserv2y ago

Green Hills Software TimeMachine + History for C and C++: https://www.ghs.com/products/MULTI_IDE.html

No particularly good publicly visible documentation of the functionality, but it does that and is a publicly purchasable product.

They also had TimeMachine + PathAnalyzer from the early 2000s which was a time travel debug with visualization solution, but they were only about as integrated as most of the solutions you see floating around today.

deutschepost2y ago

Tomorrow Corp does something like this in a variant of c++. But I am not sure it’s very open.

https://youtu.be/72y2EC5fkcE

hibbelig2y ago

Is there a time traveling debugging solution for Java?

byefruit2y ago

I think undo have one: https://undo.io/products/java

Veserv2y ago

Not that I am aware of. They phase in and out of existence every so often because developing the technology is expensive and requires constant maintenance, but nobody wants to pay for tools so they never catch on with enough resources to stay maintained.

mark_undoio2y ago

As byefruit says above - we (undo.io) sell a Java Time Travel Debugger.

If anybody wants to try it, they should get in touch with us.

Our Java tech is based on an underlying record/replay engine that works at the level of machine instructions / syscalls to record the entire process. On top of that we've added the necessary cleverness to show what that means at Java level (so normal source-level debugging works).

That's different to e.g. Chronon, which I think was a pure Java solution: https://blog.jetbrains.com/idea/2014/03/try-chronon-debugger... It had some flexibility (e.g. only record certain classes) but at the cost of quite considerable slowdown and very large storage requirements.

2 more replies

rubyissimo2y ago

how long can you time-travel?

is this something like https://www.reddit.com/r/ruby/comments/15o9hc1/timetraveling... ?

mark_undoio2y ago

Conceptually similar in that you can decide after-the-fact what state you want to see.

But Time Travel Debugging applies that to everything in the program, not just log statements - all function calls, variables, memory locations, etc can be reconstructed after the fact without having to log them explicitly.

mark_undoio2y ago

Oh, and regarding how long - it depends how long it takes to fill the circular buffer of non deterministic behaviour.

Serious compute bound workloads can run days with a gigabyte of non deterministic event log. Serious IO bound workloads burn it much faster.

For a rule of thumb, think of it consuming a few MB per second, so the length of the time travel is limited by how much of that you can store.

jeffrallen2y ago

The author mentions dtrace in passing. If you're into "load bearing rants", check out bcantrill's recent rant on bpftrace silently losing events and why dtrace won't do that.

trishumeOP2y ago

I haven't actually used bpftrace myself, only BCC. I can totally imagine it being more janky than DTrace, BCC is pretty janky even if I also think it's cool. In my eBPF tracing framework I had to add special handling counters to alert you if it ever lost any events, plausible bpftrace didn't do that.

kqr2y ago

I think if you're working mostly with tracing/sampling specific applications you'll be more of a BCC person, while if you're hired to diagnose problems in a wide variety of applications then you might learn to like bpftrace more.

danobi2y ago

What kind of events were being lost, and under what conditions? I'd like to see if it can be fixed.

kristjansson2y ago

The "you can feel like lights flickering on" one?

Always_Anon2y ago

Dtrace is a generation behind eBPF. There's a reason why the tracing community has moved on to eBPF and is no longer interested in dtrace.

bcantrill2y ago

That's an absurd comment: eBPF and DTrace exist on orthogonal systems, and most using eBPF have never even used DTrace, let alone "moved on" from it. The systems are really quite different, and have different design centers; for the use case of instrumenting the system for purposes of understanding it, there are many regards in which eBPF remains behind DTrace -- one of which I elaborated on in the rant to which the parent is referring.[0]

[0] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m25s

Always_Anon2y ago

Just because it stings doesn't make it absurd

>eBPF and DTrace exist on orthogonal systems

That was true 15 years ago. eBPF and DTrace exist on some of the same systems now, Linux and Windows.

>and most using eBPF have never even used DTrace, let alone "moved on" from it

The performance and tracing groups at Microsoft certainty have. Same with Oracle, Netflix, among others.

>The systems are really quite different, and have different design centers; for the use case of instrumenting the system for purposes of understanding it

True, but unfortunately for DTrace, it is too late. Oracle should have done this years ago. Now Linux has a more powerful tracer builtin, eBPF, and it would be a backwards step to switch the kernel code to DTrace. [0]

[0] From the man that wrote the books: https://news.ycombinator.com/item?id=16377141

kqr2y ago

> I wanted to correlate packets with userspace events from a Python program, so I used a fun trick: Find a syscall which has an early-exit error path and bindings in most languages, and then trace calls to that which have specific arguments which produce an error.

Wow. This is some great engineering. Obviously that's what you'd do, but I'd never think of it in a thousand years!

ElijahLynn2y ago

What a great way to recruit! The ending pitch to join Tristan at Anthropic, if I were competent enough in this area, is very alluring! Tristan does a great job covering the content about the types of things one would be working on.

p.s. I think the blog post could use more screengrabs of the traces. Great first pass at it though, and screengrabs can be added over time!

felixrieseberg2y ago

I wish the industry had a better answer for deterministically profiling the execution cost of JavaScript. Attempts were made in Chromium by hooking into Linux perf, but that change has since been removed.

If anyone has any tips on how to trace JavaScript (not just profile by time, but deterministically measure the cost of it in CI), I'd love to hear tips!

zubairq2y ago

Some great tools in here, thanks!

j / k navigate · click thread line to collapse

40 comments

crdavidson2y ago

Working on timestamp delta-compression at the moment to pack events into much smaller spaces, and hopefully get to 10 billion in 128 GB RAM sometime soon (at least for native builds of Spall).

Thanks for the kick to keep on pushing!

criddell2y ago

If you work on Windows applications, check out Event Tracing for Windows (ETW). The best place to start is Bruce Dawson’s blog:

https://randomascii.wordpress.com/2015/09/24/etw-central/

yakubin2y ago

maccard2y ago

If you're working with ETW traces, SuperLuminal [0] (no affiliation just a happy customer) is leaps and bounds ahead of the built-in ETW viewer.

[0] https://superluminal.eu/

RicoElectrico2y ago

Isn't ETW a total trainwreck from a developer usability standpoint? Or so my colleagues (and the interwebs) tell me.

criddell2y ago

I don’t think so. The ability to keep drilling down deeper and deeper and the amazing sort and grouping functionality make it supremely useful.

The other tool I didn’t mention is WinDbg. In my opinion, it’s the greatest debugger on any platform.

RicoElectrico2y ago

Finally found it: https://caseymuratori.com/blog_0025

1 more reply

Veserv2y ago

A pretty good overview of open source solutions in the space.

trishumeOP2y ago

mark_undoio2y ago

I just read your post and don't think it would take much to integrate with some of the visualisations you posted about, as a first step.

But that's not quite the kind of tracing you're talking about. We also built a printf-style interface to our recording files, which seems closer: https://docs.undo.io/PostFailureLogging.html

inetknght2y ago

I'm still waiting on the keyserver to be able to run in Kubernetes though

1 more reply

Veserv2y ago

Green Hills Software TimeMachine + History for C and C++: https://www.ghs.com/products/MULTI_IDE.html

No particularly good publicly visible documentation of the functionality, but it does that and is a publicly purchasable product.

deutschepost2y ago

Tomorrow Corp does something like this in a variant of c++. But I am not sure it’s very open.

https://youtu.be/72y2EC5fkcE

hibbelig2y ago

Is there a time traveling debugging solution for Java?

byefruit2y ago

I think undo have one: https://undo.io/products/java

Veserv2y ago

mark_undoio2y ago

As byefruit says above - we (undo.io) sell a Java Time Travel Debugger.

If anybody wants to try it, they should get in touch with us.

2 more replies

rubyissimo2y ago

how long can you time-travel?

is this something like https://www.reddit.com/r/ruby/comments/15o9hc1/timetraveling... ?

mark_undoio2y ago

Conceptually similar in that you can decide after-the-fact what state you want to see.

mark_undoio2y ago

Oh, and regarding how long - it depends how long it takes to fill the circular buffer of non deterministic behaviour.

Serious compute bound workloads can run days with a gigabyte of non deterministic event log. Serious IO bound workloads burn it much faster.

For a rule of thumb, think of it consuming a few MB per second, so the length of the time travel is limited by how much of that you can store.

jeffrallen2y ago

The author mentions dtrace in passing. If you're into "load bearing rants", check out bcantrill's recent rant on bpftrace silently losing events and why dtrace won't do that.

trishumeOP2y ago

kqr2y ago

danobi2y ago

What kind of events were being lost, and under what conditions? I'd like to see if it can be fixed.

kristjansson2y ago

The "you can feel like lights flickering on" one?

Always_Anon2y ago

Dtrace is a generation behind eBPF. There's a reason why the tracing community has moved on to eBPF and is no longer interested in dtrace.

bcantrill2y ago

[0] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m25s

Always_Anon2y ago

Just because it stings doesn't make it absurd

>eBPF and DTrace exist on orthogonal systems

That was true 15 years ago. eBPF and DTrace exist on some of the same systems now, Linux and Windows.

>and most using eBPF have never even used DTrace, let alone "moved on" from it

The performance and tracing groups at Microsoft certainty have. Same with Oracle, Netflix, among others.

>The systems are really quite different, and have different design centers; for the use case of instrumenting the system for purposes of understanding it

[0] From the man that wrote the books: https://news.ycombinator.com/item?id=16377141

kqr2y ago

Wow. This is some great engineering. Obviously that's what you'd do, but I'd never think of it in a thousand years!

ElijahLynn2y ago

p.s. I think the blog post could use more screengrabs of the traces. Great first pass at it though, and screengrabs can be added over time!

felixrieseberg2y ago

If anyone has any tips on how to trace JavaScript (not just profile by time, but deterministically measure the cost of it in CI), I'd love to hear tips!

zubairq2y ago

Some great tools in here, thanks!

j / k navigate · click thread line to collapse