Skip to content

Top Best Ask Show New Jobs

Tracking Down a Python Memory Leak (opens in new tab)

(benbernardblog.com)

150 pointsbbernard9y ago42 comments

42 comments

34 comments · 7 top-level

tantalor9y ago· 9 in thread

Spoiler alert, the leak is in libxml2, not Python code.

g-clef9y ago

That was my first thought when I saw the headline "I bet it's lxml". I have lost count of the number of times I've had to do special nonsense to handle lxml's memory leaks. (I've even had to go so far as to launch parsing code in a subprocess just so the part that created an etree object would quit & return all its RAM.)

This sort of thing is why calling C from Python is a marginal idea. Debugging such C is tough, especially if it manipulates Python objects in complex ways. There are many invariants of the Python data structures which must be carefully maintained by hand in C code. Getting one wrong will result in obscure bugs like the parent.

"Pickle", Python's old seralizer, had a similar leak problem. It has a cache. You can reuse a Pickle stream, which is done in interprocess communication. But the cache of previously-send objects wasn't being cleared at the end of each Pickle block. I found that, did a workaround, and submitted a bug report. Not sure it was ever fixed properly.

I still have a bug report in on CPickle, a "faster" implementation of Pickle written in C.[1] In a complicated situation with multiple threads using CPickle, memory becomes corrupted and the program will crash. It doesn't happen using Python Pickle, so I just quit using CPickle. The bug report got the usual "reproduce in a simpler situation" reply to make it go away, and the bug report remains open. It may be the same bug as this one [2] from 2012, although I doubt it.

For parsing HTML, I use html4lib. It's slower, but it's all in Python.

[1] http://bugs.python.org/issue23655 [2] http://bugs.python.org/issue12680

bbernardOP9y ago

I've had quite a few issues with cPickle myself, so I see what you mean.

Indeed, Python packages built over C extensions can be quite hard to debug, as seen with lxml. But what makes it even harder is the fact that lxml is partly built using Cython... so you deal with Python code, C code generated with Cython, and pure C code (libxml2).

(So many typos above. "html4lib" should be "html5lib". "Seralizer" should be "serializer". "Previously-send" should be "previously-sent". Sorry.)

No kidding. I actually once had a freaky memory leak of my own that I was never able to fix. I was in fact using libxml2 there, I wonder if that was the cause.

bbernardOP9y ago

There's more than one leak actually. It's quite a party :)

Exact same thing. Odd unsolved memory leak, using lxml for HTML parsing.

I think I'll stay away from it in the future. Unfortunate, since it's the fastest HTML parser in Python that I know of.

bbernardOP9y ago

In libxml2? That's a reasonable guess, but I would be open to other possibilities I were you, hehe :) Stay tuned!

mwcampbell9y ago

The post's tags, displayed at the top, spoiled it for me.

guyzero9y ago· 5 in thread

"What's possible, though, is accumulating Python objects in memory and keeping strong references to them12. For instance, this happens when we build a cache (for example, a dict) that we never clear. The cache will hold references to every single item and the GC will never be able to destroy them, that is, unless the cache goes out of scope."

Back when I worked with a Java memory profiling tool (JProbe!) we called these "lingerers". Not leaks, but the behaviour was similar.

I can't find the documentation now, but had a similar problem with ASP.NET's Master Page. The example of using data binding to dynamically adjust the menus had the binding go to a backing Page instance. That seemed logical. Unfortunately after about 200 visits to the site, the whole thing fell over. Turns out master pages hold a reference to the backing page even once the whole page is rendered. This caused them to stack up in memory. The fix was to use a static method to provided the data. It was in a user note at the bottom of the page.

bbernardOP9y ago

I don't have that much experience in Java, but you're absolutely right. Those "lingerers" can happen in pretty much any language with a GC, but they're technically not leaks.

stinos9y ago

The cache will hold references to every single item and the GC will never be able to destroy them, that is, unless the cache goes out of scope

Had the reverse happening a while ago and it's nasty as well: some C++ objects were holding references to a Python objects, but due to the GC not scanning those memory regions (they're not Python-owned after all), the Python objetcs would get GC'd and all hell broke loose when the C++ tried to access the now dead Python objects. Solution is forced 'lingering', i.e. is applying some RAII adding the Python objects to a global dict and removing them when the C++ objects go out of scope.

winstonewert9y ago

Ugh... No. That's not how Python's GC works.

Python is reference counted. This was happening because you didn't increment the reference count for python objects you were referencing from C++. Your RAII should increment/decrement reference counts on the object, not place objects into a global dict. That's the "correct" way to reference python objects.

Python's GC doesn't scan memory in the same way other languages do. Instead, it detects cycles between python objects. As long as you follow the reference counting rules correctly, you shouldn't have to worry about it. (Unless you need to detect cycles involving your C++ objects.)

kevin_thibedeau9y ago

Shouldn't the C++ notify Python of the extra references by incrementing the ref counts?

mwcampbell9y ago· 5 in thread

The JVM community tends to prefer pure Java implementations of everything, rather than using existing C libraries like Python and Ruby. Some may see this as a bad thing, but it definitely has its benefits. One particularly relevant benefit in the context of this article is that the amount of code that can leak memory, in the conventional sense, is dramatically reduced. I suppose the same thing is happening in the Node.js ecosystem, though I don't recall if Node uses native code to parse XML.

dap9y ago

Ironically, C memory leaks are often significantly easier to debug than Java ones. In C, libraries like libumem basically do postmortem GC that can point straight to the leaking callstack (depending on how much debug info you can tolerate).

In GC'd languages, there's no real way for the VM to identify a leak.

greglindahl9y ago

If you don't mind potentially slow code, that's a fine thing. Once you've measured and discovered that you're losing out on a lot of performance, it's worth evaluating whether the risk of leaks can be baked away via careful testing, which doesn't appear to have been done at all in the library used in this blog post.

brianwawok9y ago

Well pure Java code is 10x to 100x faster than pure Python code. So you aren't exactly accepting slowness in that case.

thaunatos9y ago

A pure Python implementation of lxml would be exceedingly slow.

mwcampbell9y ago

Not with a good JIT compiler like PyPy.

gravypod9y ago· 3 in thread

> "But if we're strictly speaking about Python objects within pure Python code, then no, memory leaks are not possible - at least not in the traditional sense of the term. The reason is that Python has its own garbage collector (GC), so it should take care of cleaning up unused objects."

I have a hard time beliving this. Java can have memory leaks so why couldn't Python?

zipfle9y ago

I think that the author is defining memory leaks as permanently out of scope but not deallocated memory. In that sense I don't know of anything in vanilla Python, or Java, that would qualify as a memory leak. In the more intuitive sense of a memory leak being any failure to make objects available to garbage collection, (such as by retaining references to them in an unexpected place) leading to unchecked increases in a program's memory footprint, memory leaks are possible in either language.

bbernardOP9y ago

Check out this SO thread: http://stackoverflow.com/questions/2017381/is-it-possible-to....

There are indeed many interpretations of what a memory leak is in Python.

In C/C++, you can forget to free memory, thus causing memory leaks. For example, you may call malloc(), but forget to call free(). In Python or Java, you can't do that; you don't need to explicitly "free" objects, as there's a GC.

Sure, you could leave rogue Python or Java objects in memory, but in my mind this isn't a "leak" in the same sense as a leak in C/C++. The Python interpreter (which is written in C) or some C extension may themselves cause real leaks, though.

I'm the author of the post. Maybe my reasoning was not explained clearly enough.

winstonewert9y ago

Different definitions of memory leak.

In the post's definition, a memory leak is memory you can no longer reach but is still allocated. That's eliminated by Java and Python.

Of course, you can still waste memory in Java or Python in a variety of ways. But that's a different definition of memory leak then the post is using.

dekhn9y ago· 2 in thread

I've used the gc module, with get_referers and get_referents, to track down various leaks. This only really helps with python-allocated object.

It's trivial to end up with an unexpected strong reference. Weak references are the right way to deal with cache objects, imho.

dom09y ago

> Weak references are the right way to deal with cache objects, imho.

Yet, I disagree ;) Whether a weakref is the correct thing to use or not depends entirely on the purpose of the cache. I often find myself using caches were weakref would not be very useful, because it would cool the cache a lot.

bbernardOP9y ago

Interesting approach!

It's hard to tell the difference between a real memory leak and Python objects being accumulated infinitely in memory - at least if we rely only on the memory use of a process. That's why we need to use either gc or objgraph as a first step.

badminton19y ago· 2 in thread

Reminds me of myself tracing a memory leak in a node app loading a core dump into an IllumOS VM with mdb_v8. Not so simple/friendly/happy after all.

(You could argue that you could generate a heap snapshot with v8-profiler but I was against time).

bcantrill9y ago

Would be curious for detail on your experiences; we do this a lot (we developed mdb_v8) and we've continued to extend/develop mdb_v8 to make it easier -- but trying to debug node memory growth is not something I would every characterize as simple, friendly or happy (despite our best efforts).

bbernardOP9y ago

Looks like a lot of fun! :)

I agree that some memory leaks are rather hard to find.

module00009y ago· 1 in thread

tldr; libxml2's C implementation leaked memory, author tracked it down. Kudos to the author for their persistence in digging down to the root of the problem. A lot of people would throw their hands up and decide to recycle the process every <N> seconds rather than analyze it to the depth the author did.

bbernardOP9y ago

I'm the author of the post, so thanks a lot for your kind remarks.

Now, the problem appears to be in libxml2, but... it's only partly true. I assure you that the best is yet to come :)

j / k navigate · click thread line to collapse