Python extensions should be lazy (opens in new tab)

(gauge.sh)

115 points0x63_Problems1y ago63 comments

63 comments

35 comments · 7 top-level

lalaland11251y ago· 19 in thread

Optimizing Python extensions is becoming increasingly important as Python is used in more and more compute intensive environments.

The key for optimizing a Python extension is to minimize the number of times you have to interact with Python.

A couple of other tips in addition to what this article provides:

1. Object pooling is quite useful as it can significantly cut down on the number of allocations.

2. Be very careful about tools like pybind11 that make it easier to write extensions for Python. They come with a significant amount of overhead. For critical hotspots, always use the raw Python C extension API.

3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.

RhysU1y ago

> Optimizing Python extensions is becoming increasingly important as Python is used in more and more compute intensive environments.

I have always loved how the trick to making Python better eventually comes down to not writing Python.

Spivak1y ago

Is this not expected? You're never going to have any language with the kind of dynamism that Python/Ruby/JS have while also having performant number crunching simply because Python has to do more significantly more work for the same line of code. You could envision a world where a JIT could recognize cases where all that dynamism falls away and you can generate code similar to what you would get in the equivalent C but that's just a fancy way to not write Python again. You would be writing in this informal not super well defined restricted subset of Python that JITs cleanly.

The problem time immortal is language complexity vs the ability to hint to your compiler that it can make much stronger assumptions about your code than it has to assume naturally which is where we got __slots__. And there's lots of easy wins you could get in Python that eliminate a significant amount of dynamism-- you could tell your compiler that you'll never shadow names for example, that this list is fixed size, that you don't want number promotion but they all require adding stuff to the language to indicate that.

When you're looking from the bottom up you end up making different trade-offs. Because while you get nice primitives that generate very tight assembly when you need that dynamism you end up having this object model that exists in the abstract that you orchestrate and poke at but don't really see, like gobject. Ironically, HN's love-to-hate language C++ gives you both simultaneously but at the cost of a very complicated language.

1 more reply

stabbles1y ago

Why not go all the way and limit the times you have to interact with Python to zero ;)

mkoubaa1y ago

Because if your users want python, you have to convince them they don't want it. If you fail, you'll have made optimized code that nobody uses.

Another strategy is to actually serve your users

coldtea1y ago

Because then you have the warts of the new language, and the pain of migrating code, which could be 100s or millions of lines, to worry about...

1 more reply

0x63_ProblemsOP1y ago

Totally agree, keeping the interface with the extension as thin as possible makes sense.

I hadn't considered object pooling in this context, it might be more involved since each node has distinct data but for my use case it might still be a performance win.

Have you ever used pyo3 for rust bindings? I haven't measured the overhead but I have been assuming that it's worth the tradeoff vs. rolling my own.

(I'm the author)

hansvm1y ago

My last workplace used pyo3 for a project. It was slower than vanilla Python, and you picked up all the normal compiled-language problems like slow builds and cross-compilation toolchains.

I wouldn't take away from that observation that pyo3 is slow (it was just a poor fit; FFI for miniscule amounts of work), but the fact that the binding costs were higher than vanilla Python computations suggests that the overhead is (was?) meaningful. I don't know how it compares to a hand-written extension.

2 more replies

tomjakubowski1y ago

re: 3, Python has a native numeric array type https://docs.python.org/3/library/array.html

raymondh1y ago

We should probably get rid of that. It is old (predating numpy) and has limited functionality. In almost every case I can think of, you would be better off with numpy.

1 more reply

f33d51731y ago

array is for serializing to/from binary data. It isn't useful for returning from a library because the only way a python programmer can consume it is by converting into python objects, at which point there is no efficiency benefit. numpy has a library of functions for operating directly on the referenced data, as well as a cottage industry of libraries that will take a numpy array as input. Obviously someone might end up casting it to a list anyways, but there is at least the opportunity for them to not do that.

2 more replies

sgarland1y ago

Re: 3, you can also use Python’s array.array in some circumstances. If you have heterogeneous types, don’t need multiple dimensions, and don’t need Fortran memory layout, they’re a good choice IMO, and one that doesn’t require pulling in 3rd party packages.

jmkr1y ago

I have thought Python's arrays have been overlooked for years. So much so that people call a list an array.

alkh1y ago

Re: 2, is there any good repo with raw C Python API that can be used as a reference for someone who is not too proficient in C? I took a look at numpy but it seems too complicated for me

maxmorlocke1y ago

I've found rapidfuzz to be a good, digestable C/Python integration. It's especially nice as the algorithms implemented in C frequently have good pseudocode or other language representations, so you can reference really well. The docs are in reasonable shape as well:

https://github.com/rapidfuzz/RapidFuzz

jay-barronville1y ago

You mind elaborating on what exactly you’re looking for? Maybe I can help point you in the right direction, but right now, it’s not clear given your description.

lifthrasiir1y ago

> 2. Be very careful about tools like pybind11 that make it easier to write extensions for Python. They come with a significant amount of overhead. For critical hotspots, always use the raw Python C extension API.

Agrees on the broader point (and I don't like pybind11 that much anyway), but the raw Python C extension API is often hard to use correctly. I would suggest that you should at least have a rough idea about how higher-level libraries like pybind11 would translate to the C API, so that you can recognize performance pitfalls in advance.

> 3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.

Or use the `array` module in the standard library if that matters. numpy is not a small library and has quite a large impact on the initial startup time. (Libraries like PyTorch are even much worse to be fair, though.)

dumah1y ago

3. The memoryview interface is often a good solution.

lalaland11251y ago

I think you mean the buffer interface?

I think the buffer interface is too complex to provide directly to users. I think an API that returns numpy arrays is simpler and easier to understand.

1 more reply

jmkr1y ago

I discovered memoryview when looking at the JACK Python library. It is pretty neat. But also one of those things I wouldn't have known to look for.

raymondh1y ago· 4 in thread

This is an impressive post showing some nice investigative work that isolates a pain point and produces a performant work-around.

However, the conclusion is debatable. Not everyone has this problem. Not everyone would benefit from the same solution.

Sure, if your data can be loaded, manipulated, and summarized outside of Python land, then lazy object creation is a good way to go. But then you're giving up all of the Python tooling that likely drove you to Python in the first place.

Most of the Python ecosystem from sets and dicts to the standard library is focused on manipulating native Python objects. While the syntax supports method calls to data encapsulated elsewhere, it can be costly to constantly "box and unbox" data to move back and forth between the two worlds.

0x63_ProblemsOP1y ago

First off, thank you for all your contributions to Python!

I completely take your point that there are many places where this approach won't fit. It was a surprise for me to trace the performance issue to allocations and GC, specifically because it is rare.

WRT boxing and unboxing, I'd imagine it depends on access patterns primarily - given I was extracting a small portion of data from the AST only once each, it was a good fit. But I can imagine that the boxing and unboxing could be a net loss for more read-heavy use cases.

jhylton1y ago

You could create a custom C type that wrapped an arbitrary AST node and dynamically created values for attributes when you accessed them. The values would also be wrappers around the next AST node, and they could generate new AST nodes on writes. Python objects would be created on traversal, but each one would be smaller. It wouldn’t use Python lists to handle repeated fields It seems like a non-trivial implementation, but not fundamentally hard.

The analogy with numpy doesn’t seem quite right, as Raymond observes, because numpy depends on lots of builtin operations that operate on the underlying data representation. We don’t have any such code for the AST. You’ll still want to write Python code to traverse, inspect, and modify the AST.

1 more reply

coldtea1y ago

>However, the conclusion is debatable. Not everyone has this problem. Not everyone would benefit from the same solution.

Everyone would benefit from developers being more performance minded and not doing uneccesarry work though! Especially Python who has long suffered with performance issues.

Love your work btw!

BiteCode_dev1y ago

No. Days only 24h. If you focus on perfs, you leave something else.

Python is python because people cared about other things for many years.

2 more replies

jay-barronville1y ago· 3 in thread

Evan, just a tip…

When linking to code on GitHub in an article like this, for posterity, it’s a good idea to link based on a specific commit instead of a branch.

It might be a good idea to change your link to the `Py_CompileStringObject()` function in CPython’s `Python/pythonrun.c` [0] to a commit-based link [1].

[0]: https://github.com/python/cpython/blob/main/Python/pythonrun...

[1]: https://github.com/python/cpython/blob/967a4f1d180d4cd669d5c...

garbagepatch1y ago

Tip: in Github, press `y` when looking at a `/blob/branch/something` url to turn it into the current commit.

0x63_ProblemsOP1y ago

Thanks for pointing this out, the link is now updated!

jay-barronville1y ago

Sure thing. Great article, by the way!

Side note: Your tool, Tach, seems interesting…you might want to ask @dang [0] via email [1] if he’d be willing to add your submission [2] to the second-chance pool [3] (maybe also provide a clearer and more technical explanation of the tool and its key features).

[0]: https://news.ycombinator.com/user?id=dang

[1]: mailto:hn@ycombinator.com

[2]: https://news.ycombinator.com/item?id=41171593

[3]: https://news.ycombinator.com/item?id=26998308

formerly_proven1y ago· 1 in thread

> In the case of ASTs, one could imagine a kind of ‘query language’ API for Python that operates on data that is owned by the extension - analogous to SQL over the highly specialized binary representations that a database would use. This would let the extension own the memory, and would lazily create Python objects when necessary.

You could make the API transparently lazy, i.e. ast.parse creates only one AstNode object or whatever and when you ask that object for e.g. its children those are created lazily from the underlying C struct. To preserve identity (which I assume is something users of ast are more likely to rely on than usual) you'd have to add some extra book-keeping to make it not generate new objects for each access, but memoize them.

0x63_ProblemsOP1y ago

This seems like it could be implemented without much trouble for consumers, but I actually think for the common case of full AST traversal you'd still want to avoid building objects for the nodes while traversing.

That is to say, ast.NodeVisitor living in Python is part of the problem for use cases like mine. I need the extension to own the traversal as well so that I can avoid building objects except for the result set (which is typically a very small subset). That was what led me to imagine a query-like interface instead, so that Python can give concise traversal instructions.

pdhborges1y ago· 1 in thread

How much time did the PyAST_mod2obj actually take? The rewritte is 16x faster but the article doesn't make it clear if most of the speedup came from switching to the ruff parser (specially because it puts the GC overhead at only 35% of the runtime).

0x63_ProblemsOP1y ago

That's a good question. I don't have an easy way to rerun the comparison since this happened actually a while ago, but I do remember some relevant numbers.

In the first iteration of the Rust extension, I actually used the parser from RustPython. Although I can't find it at the moment, I think the RustPython parser was actually benchmarked as worse than the builtin ast parse (when both returned Python objects).

Even with this parser, IIRC the relevant code was around 8-11x faster when it avoided the Python objects. Apart from just the 35% spent in GC itself, the memory pressure appeared to be causing CPU cache thrashing (`perf` showed much poorer cache hit rates). I'll admit though that I am far from a Valgrind expert, and there may have been another consequence of the allocations that I missed!

truth_seeker1y ago

Preloading jemalloc binary can help here. Of course it wont be as efficient as using numpy 2.x especially for dealing with larger datasets.

jemalloc also gave good results with NodeJS and Ruby projects i did.

hackan1y ago

Nice article!

But I couldn't help but notice that when `_PyCompile_AstOptimize` fails (<0), then `arena` is never freed. I think this is bug :thinking:.

j / k navigate · click thread line to collapse

63 comments

35 comments · 7 top-level

lalaland11251y ago· 19 in thread

Optimizing Python extensions is becoming increasingly important as Python is used in more and more compute intensive environments.

The key for optimizing a Python extension is to minimize the number of times you have to interact with Python.

A couple of other tips in addition to what this article provides:

1. Object pooling is quite useful as it can significantly cut down on the number of allocations.

3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.

RhysU1y ago

> Optimizing Python extensions is becoming increasingly important as Python is used in more and more compute intensive environments.

I have always loved how the trick to making Python better eventually comes down to not writing Python.

Spivak1y ago

1 more reply

stabbles1y ago

Why not go all the way and limit the times you have to interact with Python to zero ;)

mkoubaa1y ago

Because if your users want python, you have to convince them they don't want it. If you fail, you'll have made optimized code that nobody uses.

Another strategy is to actually serve your users

coldtea1y ago

Because then you have the warts of the new language, and the pain of migrating code, which could be 100s or millions of lines, to worry about...

1 more reply

0x63_ProblemsOP1y ago

Totally agree, keeping the interface with the extension as thin as possible makes sense.

I hadn't considered object pooling in this context, it might be more involved since each node has distinct data but for my use case it might still be a performance win.

Have you ever used pyo3 for rust bindings? I haven't measured the overhead but I have been assuming that it's worth the tradeoff vs. rolling my own.

(I'm the author)

hansvm1y ago

My last workplace used pyo3 for a project. It was slower than vanilla Python, and you picked up all the normal compiled-language problems like slow builds and cross-compilation toolchains.

2 more replies

tomjakubowski1y ago

re: 3, Python has a native numeric array type https://docs.python.org/3/library/array.html

raymondh1y ago

We should probably get rid of that. It is old (predating numpy) and has limited functionality. In almost every case I can think of, you would be better off with numpy.

1 more reply

f33d51731y ago

2 more replies

sgarland1y ago

jmkr1y ago

I have thought Python's arrays have been overlooked for years. So much so that people call a list an array.

alkh1y ago

Re: 2, is there any good repo with raw C Python API that can be used as a reference for someone who is not too proficient in C? I took a look at numpy but it seems too complicated for me

maxmorlocke1y ago

https://github.com/rapidfuzz/RapidFuzz

jay-barronville1y ago

You mind elaborating on what exactly you’re looking for? Maybe I can help point you in the right direction, but right now, it’s not clear given your description.

lifthrasiir1y ago

> 3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.

dumah1y ago

3. The memoryview interface is often a good solution.

lalaland11251y ago

I think you mean the buffer interface?

I think the buffer interface is too complex to provide directly to users. I think an API that returns numpy arrays is simpler and easier to understand.

1 more reply

jmkr1y ago

I discovered memoryview when looking at the JACK Python library. It is pretty neat. But also one of those things I wouldn't have known to look for.

raymondh1y ago· 4 in thread

This is an impressive post showing some nice investigative work that isolates a pain point and produces a performant work-around.

However, the conclusion is debatable. Not everyone has this problem. Not everyone would benefit from the same solution.

0x63_ProblemsOP1y ago

First off, thank you for all your contributions to Python!

I completely take your point that there are many places where this approach won't fit. It was a surprise for me to trace the performance issue to allocations and GC, specifically because it is rare.

jhylton1y ago

1 more reply

coldtea1y ago

>However, the conclusion is debatable. Not everyone has this problem. Not everyone would benefit from the same solution.

Everyone would benefit from developers being more performance minded and not doing uneccesarry work though! Especially Python who has long suffered with performance issues.

Love your work btw!

BiteCode_dev1y ago

No. Days only 24h. If you focus on perfs, you leave something else.

Python is python because people cared about other things for many years.

2 more replies

jay-barronville1y ago· 3 in thread

Evan, just a tip…

When linking to code on GitHub in an article like this, for posterity, it’s a good idea to link based on a specific commit instead of a branch.

It might be a good idea to change your link to the `Py_CompileStringObject()` function in CPython’s `Python/pythonrun.c` [0] to a commit-based link [1].

[0]: https://github.com/python/cpython/blob/main/Python/pythonrun...

[1]: https://github.com/python/cpython/blob/967a4f1d180d4cd669d5c...

garbagepatch1y ago

Tip: in Github, press `y` when looking at a `/blob/branch/something` url to turn it into the current commit.

0x63_ProblemsOP1y ago

Thanks for pointing this out, the link is now updated!

jay-barronville1y ago

Sure thing. Great article, by the way!

[0]: https://news.ycombinator.com/user?id=dang

[1]: mailto:hn@ycombinator.com

[2]: https://news.ycombinator.com/item?id=41171593

[3]: https://news.ycombinator.com/item?id=26998308

formerly_proven1y ago· 1 in thread

0x63_ProblemsOP1y ago

pdhborges1y ago· 1 in thread

0x63_ProblemsOP1y ago

That's a good question. I don't have an easy way to rerun the comparison since this happened actually a while ago, but I do remember some relevant numbers.

truth_seeker1y ago

Preloading jemalloc binary can help here. Of course it wont be as efficient as using numpy 2.x especially for dealing with larger datasets.

jemalloc also gave good results with NodeJS and Ruby projects i did.

hackan1y ago

Nice article!

But I couldn't help but notice that when `_PyCompile_AstOptimize` fails (<0), then `arena` is never freed. I think this is bug :thinking:.

j / k navigate · click thread line to collapse