"Pickle", Python's old seralizer, had a similar leak problem. It has a cache. You can reuse a Pickle stream, which is done in interprocess communication. But the cache of previously-send objects wasn't being cleared at the end of each Pickle block. I found that, did a workaround, and submitted a bug report. Not sure it was ever fixed properly.
I still have a bug report in on CPickle, a "faster" implementation of Pickle written in C.[1] In a complicated situation with multiple threads using CPickle, memory becomes corrupted and the program will crash. It doesn't happen using Python Pickle, so I just quit using CPickle. The bug report got the usual "reproduce in a simpler situation" reply to make it go away, and the bug report remains open. It may be the same bug as this one [2] from 2012, although I doubt it.
For parsing HTML, I use html4lib. It's slower, but it's all in Python.
[1] http://bugs.python.org/issue23655 [2] http://bugs.python.org/issue12680
Indeed, Python packages built over C extensions can be quite hard to debug, as seen with lxml. But what makes it even harder is the fact that lxml is partly built using Cython... so you deal with Python code, C code generated with Cython, and pure C code (libxml2).
I think I'll stay away from it in the future. Unfortunate, since it's the fastest HTML parser in Python that I know of.
Back when I worked with a Java memory profiling tool (JProbe!) we called these "lingerers". Not leaks, but the behaviour was similar.
Had the reverse happening a while ago and it's nasty as well: some C++ objects were holding references to a Python objects, but due to the GC not scanning those memory regions (they're not Python-owned after all), the Python objetcs would get GC'd and all hell broke loose when the C++ tried to access the now dead Python objects. Solution is forced 'lingering', i.e. is applying some RAII adding the Python objects to a global dict and removing them when the C++ objects go out of scope.
Python is reference counted. This was happening because you didn't increment the reference count for python objects you were referencing from C++. Your RAII should increment/decrement reference counts on the object, not place objects into a global dict. That's the "correct" way to reference python objects.
Python's GC doesn't scan memory in the same way other languages do. Instead, it detects cycles between python objects. As long as you follow the reference counting rules correctly, you shouldn't have to worry about it. (Unless you need to detect cycles involving your C++ objects.)
In GC'd languages, there's no real way for the VM to identify a leak.
I have a hard time beliving this. Java can have memory leaks so why couldn't Python?
There are indeed many interpretations of what a memory leak is in Python.
In C/C++, you can forget to free memory, thus causing memory leaks. For example, you may call malloc(), but forget to call free(). In Python or Java, you can't do that; you don't need to explicitly "free" objects, as there's a GC.
Sure, you could leave rogue Python or Java objects in memory, but in my mind this isn't a "leak" in the same sense as a leak in C/C++. The Python interpreter (which is written in C) or some C extension may themselves cause real leaks, though.
I'm the author of the post. Maybe my reasoning was not explained clearly enough.
In the post's definition, a memory leak is memory you can no longer reach but is still allocated. That's eliminated by Java and Python.
Of course, you can still waste memory in Java or Python in a variety of ways. But that's a different definition of memory leak then the post is using.
It's trivial to end up with an unexpected strong reference. Weak references are the right way to deal with cache objects, imho.
Yet, I disagree ;) Whether a weakref is the correct thing to use or not depends entirely on the purpose of the cache. I often find myself using caches were weakref would not be very useful, because it would cool the cache a lot.
It's hard to tell the difference between a real memory leak and Python objects being accumulated infinitely in memory - at least if we rely only on the memory use of a process. That's why we need to use either gc or objgraph as a first step.
(You could argue that you could generate a heap snapshot with v8-profiler but I was against time).
I agree that some memory leaks are rather hard to find.
Now, the problem appears to be in libxml2, but... it's only partly true. I assure you that the best is yet to come :)