Over time, though, it got less and less effective, as I got better and better at avoiding writing such bugs in the first place.
Most bugs I create these days are due to misunderstanding the problem I'm solving, rather than memory misuse.
also there should be an HN comments award show or something
[0] - https://en.wikipedia.org/wiki/Valgrind
[1] - https://en.wikipedia.org/wiki/Julian_SewardThis bug actually happened because I mis-merged upstream into our fork and missed an important call to rd_kafka_poll_set_consumer: https://github.com/zendesk/confluent-kafka-go/commit/6e2d889...
Incidentally, I'm always surprised at how little Confluent supports librdkafka considering that their Go, Python and C# clients (that I'm aware of) are just wrappers around it.
You should not have to read the documentation of features you do not need to discover they’ll take down your system if you forget to configure them.
Either the feature should default to a safe innocuous state, or you should not be able to skip its configuration.
Yes, you should totally read the documentation before using a library and then blaming it for a memory leak bug.
At the very least you could read it before launching such an expensive troubleshooting operation.
https://github.com/tialaramex/leakdice (or there's a Rust rewrite https://github.com/tialaramex/leakdice-rust because I was learning Rust)
leakdice is not a clever, sophisticated tool like valgrind, or eBPF programming, but that's fine because this isn't a subtle problem - it's very blatant - and running leakdice takes seconds so if it wasn't helpful you've lost very little time.
Here's what leakdice does: It picks a random heap page of a running process, which you suspect is leaking, and it displays that page as ASCII + hex.
That's all, and that might seem completely useless, unless you either read Raymond Chen's "The Old New Thing" or you paid attention in statistics class.
Because your program is leaking so badly the vast majority of heap pages (leakdice counts any pages which are writable and anonymous) are leaked. Any random heap page, therefore, is probably leaked. Now, if that page is full of zero bytes you don't learn very much, it's just leaking blank pages, hard to diagnose. But most often you're leaking (as was happening here) something with structure, and very often sort of engineer assigned investigating a leak can look at a 4kbyte page of structure and go "Oh, I know what that is" from staring at the output in hex + ASCII.
This isn't a silver bullet, but it's very easy and you can try it in like an hour (not days, or a week) including writing up something like "Alas the leaked pages are empty" which isn't a solution but certainly clarifies future results.
Note that you can play with it on programs that aren't leaking, it's just that obviously it isn't diagnosing a leak, you're just seeing whatever happened to be in a random page of that process with no reason to expect that page to be representative of anything.
It would be great if one could configure the uprobes for malloc/free to trigger one in N times but when I last looked they were unconditional. It didn’t help to have the BPF probe just return early, either — the cost is in getting into the kernel to start with.
However, jemalloc itself has great support for producing heap profiles with low overhead. Allocations are sampled and the stacks leading to them are recorded in much the same way as the linked BPF approach:
https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Pro...
Shameless plug in case you (or anyone else) is interested, I wrote a memory profiler for exactly this usecase:
https://github.com/koute/bytehound
It's definitely not perfect, but it's relatively fast, has an okay-ish GUI, and it's even scriptable: https://koute.github.io/bytehound/memory_leak_analysis.html
While new tools are great and I appreciate this nice write-up of how you can use BPF to find memory leaks, I wonder if they could have just guessed the above issue the minute after realizing that Valgrind did not report relevant issues. Actually the program just kept creating objects that were never used. With more context about the offending program, such design issues could be found by the responsible programmers by means of „thinking it through“. What I mean is: Sometimes complicated tooling distracts you so much from the actual problem that you are missing the obvious.
Probably some other signal that has a default action of terminate the program that our app isn't handling might've worked though.
I’ve often done similar mistakes, where data from step 1 already rules out a hypothesis for step 2 - but I’m too sleep-deprived and desperate to realize it. Debugging production issues is the worst.
It's also crazy how the bug could be tied back to an unbounded queue that was backing up. It seems like the wrapper library should be designed in a way where not handling the queue events is hard to do, meanwhile the experts walked right into that.
https://github.com/edenhill/librdkafka/blob/master/STATISTIC...