Sensible type-annotated python code could be so much faster if it didn't have to assume everything could change at any time. Most things don't change, and if they do they change on startup (e.g. ORM bindings).
class SomeClass
def init(self)
self.x = 0
def SomeMethod(self)
q = self.x
## do stuff with q, because otherwise you're dereferencing self.x all the damn timei don't understand what you think is nuts about this. it's an interpreted language and the word `self` is not special in any way (it's just convention - you can call the first param to a method anything you want). so there's no way for the interpreter/compiler/runtime to know you're accessing a field of the class itself (let alone that that field isn't a computed property or something like that).
lots of hottakes that people have (like this one) are rooted in just a fundamental misunderstanding of the language and programming languages in general <shrugs>.
https://github.com/abilian/p2w
NB: some preliminary results:
p2w is 4.03x SLOWER than gcc (geometric mean)
p2w is 5.50x FASTER than cpython (geometric mean)
p2w is 1.24x FASTER than pypy (geometric mean)Definitely, but then it wouldn't be Python. One of the core principles of Python's design is to be extremely dynamic, and that anything can change at any time.
There are many other, pretty good, strictly dynamically typed languages which work just as well if not better than Python, for many purposes.
And when Python is a mainstream language on top of which large, globally known websites, AI tools, core system utilities, etc are built, we should give up the purity angle and be practical.
Even the new performance push in Python land is a reflection of this. A long time ago some optimizations were refused in order to not complicate the default Python implementation.
It is called type hints, and is already there. TS typing doesn't bring any perf benefits over plain JS.
class Foo:
__slots__ = ("a", "b")
a: int
b: float
there are multiple issues with Python that prevent optimizations:* a user can define subtype `class my_int(int)`, so you cannot optimize the layout of `class Foo`
* the builtin `int` and `float` are big-int like numbers, so operations on them are branchy and allocating.
and the fact that Foo is mutable and that `id(foo.a)` has to produce something complicates things further.
Then it wouldn't be Python any more.
Even type annotations, though useful, can get in the way for certain tasks.Betting on things like these to speed up things would be a mistake, since it would kind of force you to follow that style.
Anything that accelearates things should rely on run-time data, not on type annotations that won't change.
As far as I can tell, it only ever existed to make PyPy possible, and was only defined/specified in terms of PyPy's needs.
You could make this clean break and call it Python 4 but frankly I fear it won't be Python anymore.
All the dynamism from Python should stay where it is.
Just JIT and remember a type maybe, but do not force a type from a type hint or such things.
As a minimum, I would say not relying on that is the correct thing. You could exploit it, but not force it to change the semantics.
TL;DR: SPy is a variant of Python specifically designed to be statically compilable while retaining a lot of the "useful" dynamic parts of Python.
The effort is led by Antonio Cuni, Principal Software Engineer at Anaconda. Still very early days but it seems promising to me.
Great idea, but I'm not convinced that they learned anything from the Python 2 to 3 transition, so I wouldn't hold my breath.
If you want a language system without contempt for backward compatibility, you're probably better off with Java/C++/JavaScript/etc. (though using JS libraries is like building on quicksand.) Bit of a shame since I want to like Python/Rust/Swift/other modern-ish languages, but it turns out that formal language specifications were actually a pretty good idea. API stability is another.
It has nothing to do with whether the list is empty. It has nothing to do with lists at all. It's the behaviour of default arguments.
It happens at the time that the function object is created, which is during runtime.
You only notice because lists are mutable. You should already prefer not to mutate parameters, and it especially doesn't make sense to mutate a parameter that has a default value because the point of mutating parameters is that the change can be seen by the caller, but a caller that uses a default value can't see the default value.
The behaviour can be used intentionally. (I would argue that it's overused intentionally; people use it to "bind" loop variables to lambdas when they should be using `functools.partial`.)
If you're getting got by this, you're fundamentally expecting Python to work in a way that Pythonistas consider not to make sense.
https://github.com/python/cpython/blob/3.14/Lib/json/encoder...
Default value is evaluated once, and accessing parameter is much cheaper than global
Still churning on it, will probably publish it and do a proper blog post once I've built something interesting with the language itself.
This would be a potential case for a new major version number.
The more likely reason is that there simply hasn't been that big a push for it. Ruby was dog slow before the JIT and Rails was very popular, so there was a lot of demand and room for improvement. PHP was the primary language used by Facebook for a long time, and they had deep pockets. JS powers the web, so there's a huge incentive for companies like Google to make it faster. Python never really had that same level of investment, at least from a performance standpoint.
To your point, though, the C API has made certain types of optimizations extremely difficult, as the PyPy team has figured out.
A worthwhile JIT is a fully optimizing compiler, and that is the hard part. Language semantics are much less important - dynamic languages aren’t particularly harder here, but the performance roof is obviously just much lower.
Although your general sentiment is something I agree with(if it's going to be painful do it and get it over with), I don't believe anybody knew or could've guessed what the reaction of the ecosystem would be.
Your last point about being able to change internals more freely is also great in theory but very difficult(if not impossible) to achieve in practice.
I don't know. Having maintained some small projects that were free and open source, I saw the hostility and entitlement that can come from that position. And those projects were a spec of dust next to something like Python. So I think the core team is doing the best they can. It was always going to be damned if you do, damned if you don't.
Slight tangent: if Claude can decimate IBM stock price by migrating off Cobol for cheap, surely we can do Python 2 to 3 now, too?
About the internals: we sort of missed an opportunity there, but back then there also didn't quite know what they were doing (or at least we have better ideas of what's useful today). And making the step from 2 to 3 even bigger might have been a bad idea?
Since the switch we have seen enormous companies being built from scratch. There is no reason for anyone to be complaining about it being too hard to upgrade in 2026
It wasn't until much later (I would say 3.4 or 3.5?) that we had good tooling to allow for migrating from Python 2 to Python 3 gradually, which is what most tools needed to do.
The final thing that made Python upgrading easy was making a bunch of changes (along with stuff like six) so that you could write code that would run identically in Python 2 and Python 3. That lets you do refactors over time, little cleanups, and not have the huge "move to Python 3" commit.
The switch had nothing to do with Python's rise in popularity though, it was because of NumPy and later PyTorch being adopted by data scientist and later machine learning tasks that themselves became very popular. Python's popularity rose alongside those.
> There is no reason for anyone to be complaining about it being too hard to upgrade in 2026
The "complaints" are about unnecessary and pointless breakage, that was very difficult for many codebases to upgrade for years. That by now most of these codebases have been either abandoned, upgraded or decided to stick with Python2 until the end of time doesn't mean these pains didn't happen nor that the language's developers inflicting them to their users were a good idea because some largely unrelated external factors made the language popular several years later.
Its widely regarded as a disaster for good reason, that forced some corrections in python to fix it. Just because its fine now, does not mean it was always fine
if sys.version_info.major == 2:
import old
else:
import new
Or worse, people used try/except in their imports.Anyway you can already try freethreaded builds that have the GIL disabled, but my experience is that most of your dependencies won't work.
Even the main driver for Python 3, the bytes-Unicode split, has unfortunately turned out to be sub-optimal. Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.
Similarly, I don't entirely understand refcount elimination; I've seen the codegen difference, but since the codegen happens at build time, does this mean each opcode is possibly split into two (or more?) stencils, with and without removed increfs/decrefs? With so many opcodes and their specialized variants, how many stencils are there now?
https://open.spotify.com/show/1PGRfdrLEwgXjQbPBNk1pW
pablo and Łukasz
Thanks for your interest. This is something we could improve on. We were supposed to document the JIT better in 3.15, but right now we're crunching for the 3.15 release. I'll try to get to updating the docs soon if there's enough interest. PEP 744 does not document the new frontend.
I wrote a somewhat high-level overview here in a previous blog post https://fidget-spinner.github.io/posts/faster-jit-plan.html#...
> does this mean each opcode is possibly split into two (or more?) stencils, with and without removed increfs/decrefs?
This is a great question, the answer is not exactly! The key is to expose the refcount ops in the intermediate representation (IR) as one single op. For example, BINARY_OP becomes BINARY_OP, POP_TOP (DECREF), POP_TOP (DECREF). That way, instead of optimizing for n operations, we just need to expose refcounting of n operations and optimize only 1 op (POP_TOP). Thus, we just need to refactor the IR to expose refcounting (which was the work I divided up among the community).
If you have any more questions, I'm happy to answer them either in public or email.
I also did some reading and experiments, so quickly talking about things I've found out re: refcount elimination:
Previously given an expression `c = a + b`, the compiler generated a sequence of two LOADs (that increment the inputs' refcounts), then BINARY_OP that adds the inputs and decrements the refcounts afterwards (possibly deallocating the inputs).
But if the optimizer can prove that the inputs definitely will have existing references after the addition finishes (like when `a` and `b` are local variables, or if they are immortals like `a+5`), then the entire incref/decref pair could be ignored. So in the new version, the DECREFs part of the BINARY_OP was split into separate uops, which are then possibly transformed into POP_TOP_NOP by the optimizer.
And I'm assuming that although normally splitting an op this much would usually cost some performance (as the compiler can't optimize them as well anymore), in this case it's usually worth it as the optimization almost always succeeds, and even if it doesn't, the uops are still generated in several variants for various TOS cache (which is basically registers) states so they still often codegen into just 1-2 opcodes on x86.
One thing I don't entirely understand, but that's super specific from my experiment, not sure if it's a bug or special case: I looked at tier2 traces for `for i in lst: (-i) + (-i)`, where `i` is an object of custom int-like class with overloaded methods (to control which optimizations happen). When its __neg__ returns a number, then I see a nice sequence of
_POP_TOP_INT_r32, _r21, _r10.
But when __neg__ returns a new instance of the int-like class, then it emits
_SPILL_OR_RELOAD_r31, _POP_TOP_r10, _SPILL_OR_RELOAD_r01, _POP_TOP_r10, etc.
Is there some specific reason why the "basic" pop is not specialized for TOS cache? Is it because it's the same opcode as in tier1, and it's just not worth it as it's optimized into specialized uops most of the time; or is it that it can't be optimized the same way because of the decref possibly calling user code?
https://discuss.python.org/t/pep-744-jit-compilation/50756/8... here's one thing
I do think you can also just outright ask questions about it on the forums and you'll get some answers.
At the end of the day there's only so many people working on this though.
I love playing with compilers for fun, so maybe I can shed some light. I’ll explain it in a simplified way for everyone’s benefit (going to ignore the stack):
When an object is passed between functions in Python, it doesn’t get copied. Instead, a reference to the object’s memory address is sent. This reference acts as a pointer to the object’s data. Think of it like a sticky note with the object’s memory address written on it. Now, imagine throwing away one sticky note every time a function that used a reference returns.
When an object has zero references, it can be freed from memory and reused. Ensuring the number of references, or the “reference count” is always accurate is therefore a big deal. It is often the source of memory leaks, but I wouldn’t attribute it to a speed up (only if it replaces GC, then yes).
There is a story that Python is harder to optimize than, say, Typescript, with Python flexibility and the C API getting mentioned. Maybe, if the list of troublesome Python features was out there, programmers could know to avoid those features with the promise of activating the JIT when it can prove the feature is not in use. This could provide a way out of the current Python hard-to-JIT trap. It's just a gist of an idea, but certainly an interesting first step would be to hear from the JIT people which Python features they find troublesome.
[1] https://fidget-spinner.github.io/posts/faster-jit-plan.html
I think __del__ is tricky though. In theory __del__ is not meant to be reliable. In practice CPython reliably calls it cuz it reference counts. So people know about it and use it (though I've only really seen it used for best effort cleanup checks)
In a world where more people were using PyPy we could have pressure from that perspective to avoid leaning into it. And that would also generate more pressure to implement code that is performant in "any" system.
A big part of the problem is that much of the power of the Python ecosystem comes specifically from extensions/bindings written in languages with manual (C) or RAII/ref-counted (C++, Rust) memory management, and having predictable Python-level cleanup behavior can be pretty necessary to making cleanup behavior in bound C/C++/Rust objects work. Breaking this behavior or causing too much of a performance hit is basically a non-starter for a lot of Python users, even if doing so would improve the performance of "pure" Python programs.
Doesn't FinalizationRegistry let you do exactly that?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
This is more pedantry than a serious question. JavaScript has WeakReference, sure it'd be cumbersome and inefficient because you'd need to manually make and poll each thing you wanted to observe, but could it not be said that it does provide a view on deallocations?
> Using str.frobnicate prevents TurboJit on line 63
The PSF is primarily a political advocacy organisation, so it wouldn't make sense for them to use the money for Python.
See https://github.com/numpy/numpy/issues/30416 for example. It's not being updated for compatibility with new versions of Python.
That is not remotely the case for anyone who produces quality work.
If you care about quality you absolutely can guide a machine to produce that for you without writing a single line of code yourself.
And I expect the amount of guidance needed will continue to drop.
In my experience the people who care the most about code readability tend to be the people most opinionated on having the right abstractions, which are historically not available in Go.
I recently read an interview about implementing free-threading and getting modifications through the ecosystem to really enable it: https://alexalejandre.com/programming/interview-with-ngoldba...
The guy said he hopes the free-threaded build'll be the only one in "3.16 or 3.17", I wonder if that should apply to the JIT too or how the JIT and interpreter interact.
Having to have thread safe code all over the place just for the 1% of users who need to have multi-threading in Python and can't use subinterpreters for some reason is nuts.
Way more than 1% of the community, particularly of the community actively developing Python, wants free-threaded. The problem here is that the Python community consists of several different groups:
1. Basically pure Python code with no threading
2. Basically pure Python with appropriate thread safety
3. Basically pure Python code with already broken threaded code, just getting lucky for now
4. Mixed Python and C/C++/Rust code, with appropriate threading behavior in the C or C++ components
5. Mixed Python and C or C++ code, with C and C++ components depending on GIL behavior
Group 1 gets a slightly reduced performance. Groups 2 and 4 get a major win with free-threaded Python, being able to use threading through their interfaces to C/C++/Rust components. Group 3 is already writing buggy code and will probably see worse consequences from their existing bugs. Group 5 will have to either avoid threading in their Python code or rewrite their C/C++ components.
Right now, a big portion of the Python language developer base consists of Groups 2 and 4. Group 5 is basically perceived as holding Python-the-language and Python-the-implementations back.
Microsoft used to do this for their C runtime library.
I think the GIL provides python with a great guarantee, I would probably prefer single-thread performance improvements over multithreading in python to be honest.
Anyway if I need performance, Python would probably not be my first choice
blueberry (aarch64)
Description: Raspberry Pi 5, 8GB RAM, 256GB SSD
OS: Debian GNU/Linux 12 (bookworm)
Owner: Savannah Ostrowski
ripley (x86_64)
Description: Intel i5-8400 @ 2.80GHz, 8GB RAM, 500GB SSD
OS: Ubuntu 24.04
Owner: Savannah Ostrowski
jones (aarch64)
Description: Apple M3 Pro, 18GB RAM, 512GB SSD
OS: macOS
Owner: Savannah Ostrowski
prometheus (x86_64)
Description: AMD Ryzen 5 3600X @ 3.80GHz, 16GB RAM
OS: Windows 11 Pro
Owner: Savannah OstrowskiLike this is a big deal to get a project to a state where volunteers are spun up and actively breaking tasks and getting work done, no? It's a python JIT something I know next to nothing about — as do most application developers — which tells one how difficult this must have been.
The funding was Microsoft employing most of the team. They were laid off (or at least, moved onto different projects), apparently because they weren't working on AI.
`from future import time_travel`
But I do agree that it would be a bit clearer to talk in terms of time taken rather than speedup % i.e. instead of "20% slowdown to over 100% speedup" it's clearer to say "takes between 50% and 125% of the original time". (Especially since people very often say things like "3 times faster", which technically means 4 times as fast, when they should say "3 times as fast"; "takes 1/3 of the time" is unambiguous.)
> By using only a single instruction and two tables, we only increase the interpreter by a size of 1 instruction, and also keep the base interpreter ultra fast. I affectionally call this mechanism dual dispatch.
I really do hope they'll write that better explanation one day because this sounds pretty intriguing all on its own.
Kudos to those involved into making it happen.