My problem is likely in one of our addons, but this kind of debugging, this whole genre of problem solving is entirely beyond me. How do I get to this level? What do I need to learn? To study?
It's just a little depressing to read something like this and see how far the road ahead goes, despite how far I've already traveled...
Tools: valgrind and gdb are obvious. But don't forget your compiler! Crank up the warnings, and look through LLVM-clang's -fsanitize=<foo> and warning options. (Also, if you're already on OpenBSD, check out the "S" flag to malloc; if you're on Solaris, check out, well, the blog post.) Finally, Boehm's conservative garbage collector has a "find memory leaks" mode, which looks useful for those cases where you can't get valgrind working. If all else fails, shovel through the memory dump looking for repeated patterns.
Testing: try to reproduce the problem; the first iteration may look something like "it runs out of memory after 36 hours". Then simplify: for instance, the author of the article could have asked "does this still happen if the server closes the connection immediately, without sending any data" and would have found the bug very quickly. (Of course, you're likely to ask a lot of wrong questions before hitting on the right one; experience and a full knowledge of the system you're working on is useful but not sufficient.) Questions like "does this happen more quickly if we ping 100 times per second instead of once every ten minutes" are often useful as well. (Finally, just printing memory usage every N seconds is helpul.)
Coding: be careful when writing code. The usual ways of improving code quality (e.g. code reviews) work to reduce memory leaks, too. Try to run a multiple-hour soak test every so often during development (preferably on a CI server); it's a lot easier to debug "hey, we suddenly run out of memory after yesterday's commits" than "well, something goes wrong in production". If you're doing new development, consider alternatives to malloc() - arena/pool allocation (e.g. libtalloc) is convenient and very fast if your memory use is tree-like (e.g. a connections owns a request owns some memory to sort the data before returning it). In C, goto a single chunk of cleanup-and-return code rather than duplicating the cleanup at every place where you exit from the function.
The fact that someone is saying they don't know where to start seems to indicate this isn't true.
- Use valgrind (or gdb)! Your segfault should be simpler to find than a memory leak, because you know what line the segfault happens on.
- If you have a value that's getting mangled (pointer getting overwritten by a write to another address) and you can't figure out why, use watchpoints to see when that address is getting touched. http://sourceware.org/gdb/onlinedocs/gdb/Set-Watchpoints.htm...
- Find a minimal program to reproduce the problem. It's gross, but I used to actually just take a copy of the code and cut things out until the bug stopped, then look at the last thing I cut. You can do this as a binary search - only run the first half, check for the bug, only run the second half, check for the bug, repeat on the buggy half.
As I said, segfaults are a lot easier than this kind of problem (not that they're easy when you start out). Don't be discouraged! I would help out too, but you'd need to send everything to reproduce the bug (client code, server code, server platform, etc.)
That's not gross.
http://en.wikipedia.org/wiki/Strace
http://en.wikipedia.org/wiki/Lsof
http://en.wikipedia.org/wiki/Vmstat
http://en.wikipedia.org/wiki/Netstat
http://en.wikipedia.org/wiki/DTrace
http://en.wikipedia.org/wiki/Tcpdump
http://en.wikipedia.org/wiki/Magic_SysRq_key
https://perf.wiki.kernel.org/index.php/Main_Page
...
There is a big bunch of tools in the OS very few developers know, sysadmins know more, but they often don't understand the OS and use the tools without understanding their output too well.
Some confirmation of what I have written here is the fact that Joyent forked OpenSolaris to create an OS precisely to make it easier to do things of this kind:
http://wiki.smartos.org/display/DOC/Why+SmartOS+-+ZFS%2C+KVM...
In 2005, Sun Microsystems open sourced Solaris, its renowned Unix operating system, eventually to be released as a distribution called OpenSolaris. Among the earliest adopters and most effective advocates of OpenSolaris was Ben Rockwood, who wrote The Cuddletech Guide to Building OpenSolaris in June, 2005 – the first of his many important contributions to the nascent OpenSolaris community. Meanwhile, Joyent's CTO Jason Hoffman was frustrated by the inability of most operating systems to answer seemingly-simple questions like: "Why is the server down? When will it be back up? ... Now that it's back up, why is my database still slow?"
Jason knew that these questions would be a lot easier to answer on Solaris-based systems, and recognized Sun's open-sourcing initiative as a huge opportunity.
Tools like those described in the article are handy, but aren't absolutely necessary. They save a lot of time, but the same effects can usually be gotten by more laborious means.
You have a segfault. You should know where in the code it's occurring already; it's either an access to bad memory with the instruction pointer (IP) at the point of access, or it's an attempt to execute code with the IP pointing at the bad memory, in which case the top of the stack (or, depending on calling convention, one of the registers) normally contains the place where it came from (necessarily, since the code expected to be returned to).
There are ways to turn an instruction pointer into line number offset when you have appropriate debug info, if you can't get the program running under a debugger.
Given the line number, segfaults can typically be split into three categories: plain bad logic, use after free, and memory corruption. The last is hardest to find IME, most easily done using a debugger and hardware breakpoints on memory address modifications, but you need a stable repro and a consistent memory allocator that gives predictable addresses for every rerun.
If any of the above is meaningless to you, it should give you some clues as to where you need to research.
a) build a `debug` version of node (building node from source creates a node_g version which is a debug version)
b) build a `debug` version of all of the c++ addons in your node_modules folder (node-gyp build -d for each addon)
c) start gdb with the debug version of node: `gdb ./node_g`
d) in gdb, run your node script using `run <script.js>` -- add any other options
e) wait for it to crash, and then type `bt` - you'll see the location of the crash which should give you a good starting place.
By the way, I think a jenandre used to work at my company.
- the "troubleshooter" mentality/thinking pattern - extensive system knowledge
I haven't figured out how to teach #1, except maybe for "don't have anyone to ask for help" and #2 is self explanatory.
Cool writeup though!
A Java memory leak can be solved in a matter of minutes, or – if it's especially complex – in a couple of hours tops. You can take a heap dump and analyze it with Eclipse Memory Analyzer, and if you need allocation stack-traces, you instrument your code with VisualVM. All of this can be done remotely and without stopping the app.
Flight Recorder, which has recently been added to the HostSpot VM, even gives you instrumentation with hardly any performance penalty (though it requires a commercial license if used in production).
I know there are small footprint Java's, but then you're maybe wandering off the beaten path...
I've seen two sources of memory leaks in Erlang based systems: 1) unbounded process message queues, and 2) passing binaries across process (pid) boundaries.
Many beginning erlangers run into these, and they're relatively easy to identify and correct. With a little practice, these become easy patterns to recognize and avoid.
As far as httpc, I'm unaware of that bug -- but I can say that I recently worked on a commercial product that leveraged httpc as a core component of the service, and it worked fine.
Edit: Actually, it loads fine in a private browsing tab, so it must be a bad interaction with some extension. Oh well.
Not supporting OP's definition of irony, whatever it is, just speculating how OP could've thought of it.
At the end of the day, the more that you think this is impossible the more likely your programs will experience it. So please don't think that your program is immune to this because you use Javascript.
This might be a well known solved problem, but I have heard it mentioned before.
Is it worth all the extra effort, when you could just go with a language that does GC at runtime? Sometimes it is, it depends on the use case and it depends on the people.
I've done a project rewrite from C to Java where the Java implementation performed a lot better and consumed less memory than the C one. Some of the performance gain was because I chose better algorithms and limited DB interaction, but some of the gain came from having immutability guarantees whereas the C code would just copy a lot of data structures where immutability was not guaranteed. A lot of time in the C based project was spent doing mallocs and frees and memcpys for nothing. This is poor project design, but poor design happens and Java has some protection agains that due to promoting encapsulation to a greater extent than C by default.
I am 100% certain that if the original project would've been better designed and managed, it would've kicked ass because having it in C would have allowed us to have a smaller memory footprint which would've meant a greater monetary profit in the end for this particular project over time due to system constraints.
What it comes down to is that if you have a good team that understands the required dev proccess of a mid-sized C project and who are profficient enough to implement such a project without doing too much "quick fixin'" it can be worth it. If you're limited by the size and/or competence (everyone can't be a rockstar. I certainly am not) of the team or limited in turnaround time for the product, choosing C will probably not be in your best interest. But having the right people around and if there's a monetary gain in doing things efficiently with the hardware you have then C is still an awesome tool to have in your toolbox.
Most of the time, in the world of SaaS and web based solutions, using C doesn't make a lot of sense except for some bits of core functionality. That's why I like languages with good C bindings. Knowing e.g., Python and C, you really can get the best of both worlds.
IMHO, YMMV, &c
EDIT: inserted some newlines
The office photocopier broke down, so the manager called in a repairman. The repairman takes one look at the machine, draws an 'X' at the problem part, and hands the manager a bill for $500. The manager was shocked at the price, and demanded an itemized bill. The repairman simply wrote:
Marking the 'X' - $1
Knowing where to put the 'X' - $499 DEST=~~/public/walmart.graphsI made a very basic Node.js module in C++ with V8 and it was surprisingly difficult to make a good (idiomatic JS behaviour, believably bug-free) wrapper for a straightforward class and factory method. I say this coming from Boost Python and Luabind, where there are some tricky parts to bind complex classes, but simple ones are easy enough, and once written, obviously correct.
Perhaps a better call to action would be:
* Talk to us about how we can solve your problems
* Chat with us
* We can help you too
* What's up?
thanks for the details, very articulate and useful stuff.
(down vote me)