Hunting down a C memory leak in a Go program (opens in new tab)

(medium.com)

129 pointsxerxes9014y ago46 comments

46 comments

33 comments · 13 top-level

WalterBright4y ago· 6 in thread

I once had to intercept every call malloc/free/realloc and log it to find a leak. I wound up turning that into an immensely useful tool.

WalterBright4y ago

It would also stomp on all memory returned by malloc, and all memory that free free'd. This uncovered amazing numbers of bugs.

Over time, though, it got less and less effective, as I got better and better at avoiding writing such bugs in the first place.

Most bugs I create these days are due to misunderstanding the problem I'm solving, rather than memory misuse.

marivilla4y ago

If you are the same Walter Bright who invented the D programming language then this comment should win a fun comment award

also there should be an HN comments award show or something

vips7L4y ago

The one and only Walter Bright.

1 more reply

lostmsu4y ago

valgrind?

WalterBright4y ago

I wish. Mine instrumented the source code, not the executable.

linkdd4y ago

This was my first guess, but according to wikipedia[0], the author of valgrind is Julian Seward[1].

  [0] - https://en.wikipedia.org/wiki/Valgrind
  [1] - https://en.wikipedia.org/wiki/Julian_Seward

2 more replies

otterley4y ago· 3 in thread

Segment learned quite some time ago that confluent-kafka-go has problems like these (and doesn’t support Contexts either), so they wrote a pure Go replacement instead. https://github.com/segmentio/kafka-go

xerxes901OP4y ago

So, in the interests of full transparency - we at Zendesk are actually running a fork of confluent-kafka-go, which I forked to add, amongst other things, context support: https://github.com/confluentinc/confluent-kafka-go/pull/626

This bug actually happened because I mis-merged upstream into our fork and missed an important call to rd_kafka_poll_set_consumer: https://github.com/zendesk/confluent-kafka-go/commit/6e2d889...

otterley4y ago

Did you consider Segment’s pure Go library instead? Curious as to why you might have rejected it. (I should also note that Go C bindings impose performance penalties in many cases due to the need to copy memory back and forth.)

1 more reply

EdwardDiego4y ago

I strongly believe that kafka-go is going to eat Sarama's lunch in a few years, in terms of Github stars, it's Sarama, kafka-go, then confluent's lib.

Incidentally, I'm always surprised at how little Confluent supports librdkafka considering that their Go, Python and C# clients (that I'm aware of) are just wrappers around it.

richardfey4y ago· 3 in thread

So the root cause was...not reading librdkafka documentation?

masklinn4y ago

The root cause was bad library design, either incorrect defaults or a bad interface.

You should not have to read the documentation of features you do not need to discover they’ll take down your system if you forget to configure them.

Either the feature should default to a safe innocuous state, or you should not be able to skip its configuration.

richardfey4y ago

> You should not have to read the documentation of features you do not need to discover they’ll take down your system if you forget to configure them.

Yes, you should totally read the documentation before using a library and then blaming it for a memory leak bug.

At the very least you could read it before launching such an expensive troubleshooting operation.

Taniwha4y ago

Yeah, it's also a little unfair blaming C for a bug in the go program

tialaramex4y ago· 2 in thread

This goes on a very exciting journey. But, the leak has a notable property that should cause you to reach for a particular tool quite early just in case. The leak is enormous. The program's leak is much larger than the program itself and is in fact triggering the OOM killer. So my first thought (on Linux) would be to reach for my:

https://github.com/tialaramex/leakdice (or there's a Rust rewrite https://github.com/tialaramex/leakdice-rust because I was learning Rust)

leakdice is not a clever, sophisticated tool like valgrind, or eBPF programming, but that's fine because this isn't a subtle problem - it's very blatant - and running leakdice takes seconds so if it wasn't helpful you've lost very little time.

Here's what leakdice does: It picks a random heap page of a running process, which you suspect is leaking, and it displays that page as ASCII + hex.

That's all, and that might seem completely useless, unless you either read Raymond Chen's "The Old New Thing" or you paid attention in statistics class.

Because your program is leaking so badly the vast majority of heap pages (leakdice counts any pages which are writable and anonymous) are leaked. Any random heap page, therefore, is probably leaked. Now, if that page is full of zero bytes you don't learn very much, it's just leaking blank pages, hard to diagnose. But most often you're leaking (as was happening here) something with structure, and very often sort of engineer assigned investigating a leak can look at a 4kbyte page of structure and go "Oh, I know what that is" from staring at the output in hex + ASCII.

This isn't a silver bullet, but it's very easy and you can try it in like an hour (not days, or a week) including writing up something like "Alas the leaked pages are empty" which isn't a solution but certainly clarifies future results.

avinassh4y ago

This looks quite interesting and I learned something new! Is there any sample output of it? Adding in README also would be great

tialaramex4y ago

I didn't provide sample output, but I think I should, this is a tool many people would use only once so it can't hurt to set expectations properly. I will think about how best to do that. There's a README for the original C version, and I should add one to the Rust code.

Note that you can play with it on programs that aren't leaking, it's just that obviously it isn't diagnosing a leak, you're just seeing whatever happened to be in a random page of that process with no reason to expect that page to be representative of anything.

cranekam4y ago· 2 in thread

Nice write up! Using BPF to trace malloc/free is good example of the tool’s power. Unfortunately, IME, this approach doesn’t scale to very high load services. Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

It would be great if one could configure the uprobes for malloc/free to trigger one in N times but when I last looked they were unconditional. It didn’t help to have the BPF probe just return early, either — the cost is in getting into the kernel to start with.

However, jemalloc itself has great support for producing heap profiles with low overhead. Allocations are sampled and the stacks leading to them are recorded in much the same way as the linked BPF approach:

https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Pro...

kouteiheika4y ago

> Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

Shameless plug in case you (or anyone else) is interested, I wrote a memory profiler for exactly this usecase:

https://github.com/koute/bytehound

It's definitely not perfect, but it's relatively fast, has an okay-ish GUI, and it's even scriptable: https://koute.github.io/bytehound/memory_leak_analysis.html

cranekam4y ago

Interesting! What is the overhead of this? We found that jemalloc's heap profiling had a small (perhaps 1-2%? It's been a while) CPU penalty and, depending on the complexity of the code being profiled and the sample rate, potentially a few hundred MB of extra RAM use on very large, complex binaries. I'd assume the RAM cost is similar given the data is the same (i.e. backtraces).

1 more reply

G3rn0ti4y ago· 2 in thread

> our application was not actually handling events from that queue, so the size of that queue grew without bound

While new tools are great and I appreciate this nice write-up of how you can use BPF to find memory leaks, I wonder if they could have just guessed the above issue the minute after realizing that Valgrind did not report relevant issues. Actually the program just kept creating objects that were never used. With more context about the offending program, such design issues could be found by the responsible programmers by means of „thinking it through“. What I mean is: Sometimes complicated tooling distracts you so much from the actual problem that you are missing the obvious.

xerxes901OP4y ago

Believe me, before reaching for this stuff there was several passes of staring at the code trying to find something by “thinking it through” as you say. But at the end of the day, we’re human, and fixing bugs by inspection is just not always a realistic strategy.

Matthias2474y ago

I think it depends on how much context you have. For apps where I have written most of the code its usally easy to think it through and review some code paths. However the situation is vastly different if you run 90% unfamiliar code, like it often happens in larger organizations where everyone only contributed a tiny part of things. When you run millions of lines of code in production it's not reasonable to review everything in order being able to fix it - you need to use hints and tools to narrow down the source of the issue. That can be logs, metrics, debugging tools, profilers, etc.

mmoll4y ago· 1 in thread

I suspect valgrind‘s massif would have helped (massifly). It shows memory usage over time, but also where what fraction of memory was allocated.

eska4y ago

My thought as well. I was able to fix a years old memory leak at a prior company using massif within an hour. It’s a really great suite of tools!

matt1234567894y ago· 1 in thread

The author describes using eBPF to trace malloc/free refs as a solution to the program properly freeing all heap objects before exiting, which was enlightening to me. Would it have been possible to issue a kill -9 to the program in the middle of execution while using valgrind to see this info as well? Or is it more to the point that eBPF is cleaner and allows you to see many more snapshots of memory allocations while the program is still running?

xerxes901OP4y ago

Off the top of my head I don't think this would work, because Valgrind needs the atexit hooks to run?

Probably some other signal that has a default action of terminate the program that our app isn't handling might've worked though.

nikanj4y ago

Author uses jmalloc to confirm malloc allocations are unfreed, then later speculates the allocations might be something not visible to Valgrind e.g. mmap.

I’ve often done similar mistakes, where data from step 1 already rules out a hypothesis for step 2 - but I’m too sleep-deprived and desperate to realize it. Debugging production issues is the worst.

kubb4y ago

It's insane how much ad hoc engineering and random details like compiler flags were required to get the location where the unfreed memory was allocated. It's likely that an experienced team was on it for several days (unless they already had experience with all the tools used).

It's also crazy how the bug could be tied back to an unbounded queue that was backing up. It seems like the wrapper library should be designed in a way where not handling the queue events is hard to do, meanwhile the experts walked right into that.

jjluoma4y ago

I wonder if statistics provided by librdkafka (available also with confluent-kafka-go) could have been used to solve the issue with less effort.

https://github.com/edenhill/librdkafka/blob/master/STATISTIC...

GnarfGnarf4y ago

On an unrelated note, I am a Zendesk customer and absolutely love the app. Zendesk makes customer support fun!

sam0x174y ago

Fun side note: I once had to debug a GC stuttering issue in Crystal, and was delighted to find that the language was so damn open that I could just monkey-patch the actual allocator to print debug information whenever an allocation was made.

j / k navigate · click thread line to collapse

46 comments

33 comments · 13 top-level

WalterBright4y ago· 6 in thread

I once had to intercept every call malloc/free/realloc and log it to find a leak. I wound up turning that into an immensely useful tool.

WalterBright4y ago

It would also stomp on all memory returned by malloc, and all memory that free free'd. This uncovered amazing numbers of bugs.

Over time, though, it got less and less effective, as I got better and better at avoiding writing such bugs in the first place.

Most bugs I create these days are due to misunderstanding the problem I'm solving, rather than memory misuse.

marivilla4y ago

If you are the same Walter Bright who invented the D programming language then this comment should win a fun comment award

also there should be an HN comments award show or something

vips7L4y ago

The one and only Walter Bright.

1 more reply

lostmsu4y ago

valgrind?

WalterBright4y ago

I wish. Mine instrumented the source code, not the executable.

linkdd4y ago

This was my first guess, but according to wikipedia[0], the author of valgrind is Julian Seward[1].

  [0] - https://en.wikipedia.org/wiki/Valgrind
  [1] - https://en.wikipedia.org/wiki/Julian_Seward

2 more replies

otterley4y ago· 3 in thread

xerxes901OP4y ago

This bug actually happened because I mis-merged upstream into our fork and missed an important call to rd_kafka_poll_set_consumer: https://github.com/zendesk/confluent-kafka-go/commit/6e2d889...

otterley4y ago

1 more reply

EdwardDiego4y ago

I strongly believe that kafka-go is going to eat Sarama's lunch in a few years, in terms of Github stars, it's Sarama, kafka-go, then confluent's lib.

Incidentally, I'm always surprised at how little Confluent supports librdkafka considering that their Go, Python and C# clients (that I'm aware of) are just wrappers around it.

richardfey4y ago· 3 in thread

So the root cause was...not reading librdkafka documentation?

masklinn4y ago

The root cause was bad library design, either incorrect defaults or a bad interface.

You should not have to read the documentation of features you do not need to discover they’ll take down your system if you forget to configure them.

Either the feature should default to a safe innocuous state, or you should not be able to skip its configuration.

richardfey4y ago

> You should not have to read the documentation of features you do not need to discover they’ll take down your system if you forget to configure them.

Yes, you should totally read the documentation before using a library and then blaming it for a memory leak bug.

At the very least you could read it before launching such an expensive troubleshooting operation.

Taniwha4y ago

Yeah, it's also a little unfair blaming C for a bug in the go program

tialaramex4y ago· 2 in thread

https://github.com/tialaramex/leakdice (or there's a Rust rewrite https://github.com/tialaramex/leakdice-rust because I was learning Rust)

Here's what leakdice does: It picks a random heap page of a running process, which you suspect is leaking, and it displays that page as ASCII + hex.

That's all, and that might seem completely useless, unless you either read Raymond Chen's "The Old New Thing" or you paid attention in statistics class.

avinassh4y ago

This looks quite interesting and I learned something new! Is there any sample output of it? Adding in README also would be great

tialaramex4y ago

cranekam4y ago· 2 in thread

https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Pro...

kouteiheika4y ago

> Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

Shameless plug in case you (or anyone else) is interested, I wrote a memory profiler for exactly this usecase:

https://github.com/koute/bytehound

It's definitely not perfect, but it's relatively fast, has an okay-ish GUI, and it's even scriptable: https://koute.github.io/bytehound/memory_leak_analysis.html

cranekam4y ago

1 more reply

G3rn0ti4y ago· 2 in thread

> our application was not actually handling events from that queue, so the size of that queue grew without bound

xerxes901OP4y ago

Matthias2474y ago

mmoll4y ago· 1 in thread

I suspect valgrind‘s massif would have helped (massifly). It shows memory usage over time, but also where what fraction of memory was allocated.

eska4y ago

My thought as well. I was able to fix a years old memory leak at a prior company using massif within an hour. It’s a really great suite of tools!

matt1234567894y ago· 1 in thread

xerxes901OP4y ago

Off the top of my head I don't think this would work, because Valgrind needs the atexit hooks to run?

Probably some other signal that has a default action of terminate the program that our app isn't handling might've worked though.

nikanj4y ago

Author uses jmalloc to confirm malloc allocations are unfreed, then later speculates the allocations might be something not visible to Valgrind e.g. mmap.

kubb4y ago

jjluoma4y ago

I wonder if statistics provided by librdkafka (available also with confluent-kafka-go) could have been used to solve the issue with less effort.

https://github.com/edenhill/librdkafka/blob/master/STATISTIC...

GnarfGnarf4y ago

On an unrelated note, I am a Zendesk customer and absolutely love the app. Zendesk makes customer support fun!

sam0x174y ago

j / k navigate · click thread line to collapse