Sensible type-annotated python code could be so much faster if it didn't have to assume everything could change at any time. Most things don't change, and if they do they change on startup (e.g. ORM bindings).
class SomeClass
def init(self)
self.x = 0
def SomeMethod(self)
q = self.x
## do stuff with q, because otherwise you're dereferencing self.x all the damn timei don't understand what you think is nuts about this. it's an interpreted language and the word `self` is not special in any way (it's just convention - you can call the first param to a method anything you want). so there's no way for the interpreter/compiler/runtime to know you're accessing a field of the class itself (let alone that that field isn't a computed property or something like that).
lots of hottakes that people have (like this one) are rooted in just a fundamental misunderstanding of the language and programming languages in general <shrugs>.
A common approach is hidden classes that work much like classes in other languages. Reading a simple int property just reads bytes at an offset from the object pointer directly. Upon entry to the method bits of the object are tested and if the object is not known to be simple it escapes into the full dynamic machinery.
I don't know if those exact techniques would work for Python but this is not an either-or situation.
See also: modern Objective-C msg_Send which is so fast on modern hardware for the fast-path it is rarely a performance bottleneck. Despite being able to add dynamic subclasses or message forward at runtime.
The name `self` is a convention, yes, but interestingly in python methods the first parameter is special beyond the standard "bound method" stuff. See for example PEP 367 (New Super) for how `super()` resolution works (TL;DR the super function is a special builtin that generates extra code referencing the first parameter and the lexically defining class)
Funnily enough I’ve found Python to be excellent for modelling my problem domain with Pydantic (so far basically unparalleled, open for suggestions in Go/Rust), while the language also gets out of my way when I get creative with list expressions and the like. So overall, still it is extremely productive for the work I’m doing, I just need to spin up more containers in prod.
And what prevents someone from designing such a language?
https://github.com/abilian/p2w
NB: some preliminary results:
p2w is 4.03x SLOWER than gcc (geometric mean)
p2w is 5.50x FASTER than cpython (geometric mean)
p2w is 1.24x FASTER than pypy (geometric mean)Definitely, but then it wouldn't be Python. One of the core principles of Python's design is to be extremely dynamic, and that anything can change at any time.
There are many other, pretty good, strictly dynamically typed languages which work just as well if not better than Python, for many purposes.
And when Python is a mainstream language on top of which large, globally known websites, AI tools, core system utilities, etc are built, we should give up the purity angle and be practical.
Even the new performance push in Python land is a reflection of this. A long time ago some optimizations were refused in order to not complicate the default Python implementation.
It is called type hints, and is already there. TS typing doesn't bring any perf benefits over plain JS.
class Foo:
__slots__ = ("a", "b")
a: int
b: float
there are multiple issues with Python that prevent optimizations:* a user can define subtype `class my_int(int)`, so you cannot optimize the layout of `class Foo`
* the builtin `int` and `float` are big-int like numbers, so operations on them are branchy and allocating.
and the fact that Foo is mutable and that `id(foo.a)` has to produce something complicates things further.
Therefore Python has no use for TS-like superset, because it already has facilities for static analysis with no bearing on runtime, which is what TS provides.
Then it wouldn't be Python any more.
Even type annotations, though useful, can get in the way for certain tasks.Betting on things like these to speed up things would be a mistake, since it would kind of force you to follow that style.
Anything that accelearates things should rely on run-time data, not on type annotations that won't change.
As far as I can tell, it only ever existed to make PyPy possible, and was only defined/specified in terms of PyPy's needs.
You could make this clean break and call it Python 4 but frankly I fear it won't be Python anymore.
I think it's not so much that Pytest is using obscure language features (decorators are cool and the obvious choice for a lot of this kind of stuff) but that it wants too much magic to happen in terms of how the "fixtures" automatically connect together. I would think that "Explicit is better than implicit" and "Simple is better than complex" go double for tests. But things like `pytest.mark.parametrize` are extremely useful.
All the dynamism from Python should stay where it is.
Just JIT and remember a type maybe, but do not force a type from a type hint or such things.
As a minimum, I would say not relying on that is the correct thing. You could exploit it, but not force it to change the semantics.
TL;DR: SPy is a variant of Python specifically designed to be statically compilable while retaining a lot of the "useful" dynamic parts of Python.
The effort is led by Antonio Cuni, Principal Software Engineer at Anaconda. Still very early days but it seems promising to me.
Great idea, but I'm not convinced that they learned anything from the Python 2 to 3 transition, so I wouldn't hold my breath.
If you want a language system without contempt for backward compatibility, you're probably better off with Java/C++/JavaScript/etc. (though using JS libraries is like building on quicksand.) Bit of a shame since I want to like Python/Rust/Swift/other modern-ish languages, but it turns out that formal language specifications were actually a pretty good idea. API stability is another.
It has nothing to do with whether the list is empty. It has nothing to do with lists at all. It's the behaviour of default arguments.
It happens at the time that the function object is created, which is during runtime.
You only notice because lists are mutable. You should already prefer not to mutate parameters, and it especially doesn't make sense to mutate a parameter that has a default value because the point of mutating parameters is that the change can be seen by the caller, but a caller that uses a default value can't see the default value.
The behaviour can be used intentionally. (I would argue that it's overused intentionally; people use it to "bind" loop variables to lambdas when they should be using `functools.partial`.)
If you're getting got by this, you're fundamentally expecting Python to work in a way that Pythonistas consider not to make sense.
It's just slightly annoying having to work around this by defaulting to None.
https://github.com/python/cpython/blob/3.14/Lib/json/encoder...
Default value is evaluated once, and accessing parameter is much cheaper than global
Still churning on it, will probably publish it and do a proper blog post once I've built something interesting with the language itself.
Similarly, I don't entirely understand refcount elimination; I've seen the codegen difference, but since the codegen happens at build time, does this mean each opcode is possibly split into two (or more?) stencils, with and without removed increfs/decrefs? With so many opcodes and their specialized variants, how many stencils are there now?
https://open.spotify.com/show/1PGRfdrLEwgXjQbPBNk1pW
pablo and Łukasz
Thanks for your interest. This is something we could improve on. We were supposed to document the JIT better in 3.15, but right now we're crunching for the 3.15 release. I'll try to get to updating the docs soon if there's enough interest. PEP 744 does not document the new frontend.
I wrote a somewhat high-level overview here in a previous blog post https://fidget-spinner.github.io/posts/faster-jit-plan.html#...
> does this mean each opcode is possibly split into two (or more?) stencils, with and without removed increfs/decrefs?
This is a great question, the answer is not exactly! The key is to expose the refcount ops in the intermediate representation (IR) as one single op. For example, BINARY_OP becomes BINARY_OP, POP_TOP (DECREF), POP_TOP (DECREF). That way, instead of optimizing for n operations, we just need to expose refcounting of n operations and optimize only 1 op (POP_TOP). Thus, we just need to refactor the IR to expose refcounting (which was the work I divided up among the community).
If you have any more questions, I'm happy to answer them either in public or email.
I also did some reading and experiments, so quickly talking about things I've found out re: refcount elimination:
Previously given an expression `c = a + b`, the compiler generated a sequence of two LOADs (that increment the inputs' refcounts), then BINARY_OP that adds the inputs and decrements the refcounts afterwards (possibly deallocating the inputs).
But if the optimizer can prove that the inputs definitely will have existing references after the addition finishes (like when `a` and `b` are local variables, or if they are immortals like `a+5`), then the entire incref/decref pair could be ignored. So in the new version, the DECREFs part of the BINARY_OP was split into separate uops, which are then possibly transformed into POP_TOP_NOP by the optimizer.
And I'm assuming that although normally splitting an op this much would usually cost some performance (as the compiler can't optimize them as well anymore), in this case it's usually worth it as the optimization almost always succeeds, and even if it doesn't, the uops are still generated in several variants for various TOS cache (which is basically registers) states so they still often codegen into just 1-2 opcodes on x86.
One thing I don't entirely understand, but that's super specific from my experiment, not sure if it's a bug or special case: I looked at tier2 traces for `for i in lst: (-i) + (-i)`, where `i` is an object of custom int-like class with overloaded methods (to control which optimizations happen). When its __neg__ returns a number, then I see a nice sequence of
_POP_TOP_INT_r32, _r21, _r10.
But when __neg__ returns a new instance of the int-like class, then it emits
_SPILL_OR_RELOAD_r31, _POP_TOP_r10, _SPILL_OR_RELOAD_r01, _POP_TOP_r10, etc.
Is there some specific reason why the "basic" pop is not specialized for TOS cache? Is it because it's the same opcode as in tier1, and it's just not worth it as it's optimized into specialized uops most of the time; or is it that it can't be optimized the same way because of the decref possibly calling user code?
https://discuss.python.org/t/pep-744-jit-compilation/50756/8... here's one thing
I do think you can also just outright ask questions about it on the forums and you'll get some answers.
At the end of the day there's only so many people working on this though.
I love playing with compilers for fun, so maybe I can shed some light. I’ll explain it in a simplified way for everyone’s benefit (going to ignore the stack):
When an object is passed between functions in Python, it doesn’t get copied. Instead, a reference to the object’s memory address is sent. This reference acts as a pointer to the object’s data. Think of it like a sticky note with the object’s memory address written on it. Now, imagine throwing away one sticky note every time a function that used a reference returns.
When an object has zero references, it can be freed from memory and reused. Ensuring the number of references, or the “reference count” is always accurate is therefore a big deal. It is often the source of memory leaks, but I wouldn’t attribute it to a speed up (only if it replaces GC, then yes).
Although your general sentiment is something I agree with(if it's going to be painful do it and get it over with), I don't believe anybody knew or could've guessed what the reaction of the ecosystem would be.
Your last point about being able to change internals more freely is also great in theory but very difficult(if not impossible) to achieve in practice.
I don't know. Having maintained some small projects that were free and open source, I saw the hostility and entitlement that can come from that position. And those projects were a spec of dust next to something like Python. So I think the core team is doing the best they can. It was always going to be damned if you do, damned if you don't.
Slight tangent: if Claude can decimate IBM stock price by migrating off Cobol for cheap, surely we can do Python 2 to 3 now, too?
About the internals: we sort of missed an opportunity there, but back then there also didn't quite know what they were doing (or at least we have better ideas of what's useful today). And making the step from 2 to 3 even bigger might have been a bad idea?
In my experience, the problem had always been maintaining the business logic and any integrations with third-party software that also may be running legacy code-bases or have been abandoned. It can get quite complicated, from what I've seen. Now of course if you're talking about well maintained code-bases with 100%, or close to 100% test coverage, and that includes the integration part along with having the ability to maintain the user experience and/or user interface then yes it becomes a relatively easy process of "just write the code". But, in my experience, this has never been the case.
For the 2.x code-bases I maintain, the customers simply doesn't want to pay for any of it. They might choose to at a later time, but so far it has been more cost effective for them to pay me to maintain that legacy code than pay to have it migrated. Other customers have different needs and thus budget differently.
I'll refrain from judging if 2 to 3 was a missed opportunity or not. I believe the core team does actually know what they're doing and that any decision would've been criticized.
It does much better with good tests. In my case the output was a statically generated website, so I could just say 'make the same website, given these inputs'.
Since the switch we have seen enormous companies being built from scratch. There is no reason for anyone to be complaining about it being too hard to upgrade in 2026
It wasn't until much later (I would say 3.4 or 3.5?) that we had good tooling to allow for migrating from Python 2 to Python 3 gradually, which is what most tools needed to do.
The final thing that made Python upgrading easy was making a bunch of changes (along with stuff like six) so that you could write code that would run identically in Python 2 and Python 3. That lets you do refactors over time, little cleanups, and not have the huge "move to Python 3" commit.
The switch had nothing to do with Python's rise in popularity though, it was because of NumPy and later PyTorch being adopted by data scientist and later machine learning tasks that themselves became very popular. Python's popularity rose alongside those.
> There is no reason for anyone to be complaining about it being too hard to upgrade in 2026
The "complaints" are about unnecessary and pointless breakage, that was very difficult for many codebases to upgrade for years. That by now most of these codebases have been either abandoned, upgraded or decided to stick with Python2 until the end of time doesn't mean these pains didn't happen nor that the language's developers inflicting them to their users were a good idea because some largely unrelated external factors made the language popular several years later.
In case people have forgotten: python 3.3 through 3.5 (and 3.6 I think) each had to reintroduce something that was removed to make the upgrade easier. Jumping from 2.7 to 3.3 (or higher depending on what you needed) was the recommended route because of this, it was less work than going to 3.0, 3.1, or 3.2
You don't realise it but you have actually made my point. People are acting like was a suicidal, harmful move that fucked over Python and ... it didn't matter because Python went on to become insanely popular. Particularly hilarious is that some people have tried to argue that Perl somehow did it better.
In the end your Python 2 project was either worth porting and you did it, or it wasn't worth it and you didn't. Python has gone on to be an undeniable success and it is nearly unthinkable that anyone would use Python 2 for anything. It is absolutely fucking insane for people to still be whining about this in this day and age
Its widely regarded as a disaster for good reason, that forced some corrections in python to fix it. Just because its fine now, does not mean it was always fine
if sys.version_info.major == 2:
import old
else:
import new
Or worse, people used try/except in their imports.Anyway you can already try freethreaded builds that have the GIL disabled, but my experience is that most of your dependencies won't work.
Even the main driver for Python 3, the bytes-Unicode split, has unfortunately turned out to be sub-optimal. Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.
How so? Python3 strings are unicode and all the encoding/decoding functions default to utf-8. In practice this means all the python I write is utf-8 compatible unicode and I don't ever have to think about it.
It did nothing of the sort. UTF-8 is the default source file encoding and has been the target for many APIs. It likely would have been the default for all I/O stuff if we lived in a world where Windows had functioning Unicode in the terminal the whole time and didn't base all its internal APIs on UTF-16.
I assume you're referring to the internal representation of strings. Describing it as "UTF-32 with space-saving optimizations" is missing the point, and also a contradiction in terms. Yes, it is a system that uses the same number of bytes per character within a given string (and chooses that width according to the string contents). This makes random access possible. Doing anything else would have broken historical expectations about string slicing. There are good arguments that one shouldn't write code like that anyway, but it's hard to identify anything "sub-optimal" about the result except that strings like "I'm learning 日本語" use more memory than they might be able to get away with. (But there are other strings, like "ℍℯℓ℗", that can use a 2-byte width while the UTF-8 encoding would add 3 bytes per character.)
There is a story that Python is harder to optimize than, say, Typescript, with Python flexibility and the C API getting mentioned. Maybe, if the list of troublesome Python features was out there, programmers could know to avoid those features with the promise of activating the JIT when it can prove the feature is not in use. This could provide a way out of the current Python hard-to-JIT trap. It's just a gist of an idea, but certainly an interesting first step would be to hear from the JIT people which Python features they find troublesome.
[1] https://fidget-spinner.github.io/posts/faster-jit-plan.html
I think __del__ is tricky though. In theory __del__ is not meant to be reliable. In practice CPython reliably calls it cuz it reference counts. So people know about it and use it (though I've only really seen it used for best effort cleanup checks)
In a world where more people were using PyPy we could have pressure from that perspective to avoid leaning into it. And that would also generate more pressure to implement code that is performant in "any" system.
A big part of the problem is that much of the power of the Python ecosystem comes specifically from extensions/bindings written in languages with manual (C) or RAII/ref-counted (C++, Rust) memory management, and having predictable Python-level cleanup behavior can be pretty necessary to making cleanup behavior in bound C/C++/Rust objects work. Breaking this behavior or causing too much of a performance hit is basically a non-starter for a lot of Python users, even if doing so would improve the performance of "pure" Python programs.
Doesn't FinalizationRegistry let you do exactly that?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
> A conforming JavaScript implementation, even one that does garbage collection, is not required to call cleanup callbacks. When and whether it does so is entirely down to the implementation of the JavaScript engine. When a registered object is reclaimed, any cleanup callbacks for it may be called then, or some time later, or not at all. It's likely that major implementations will call cleanup callbacks at some point during execution, but those calls may be substantially after the related object was reclaimed. Furthermore, if there is an object registered in two registries, there is no guarantee that the two callbacks are called next to each other — one may be called and the other never called, or the other may be called much later. There are also situations where even implementations that normally call cleanup callbacks are unlikely to call them:
I remember a couple of years ago (well probably around 2021) reading about GC exposure concerns and seeing some line in some TC39 doc like "users should not have visibility into collection" but if we've shipped weakrefs sounds like we're not thinking about that anymore
This is more pedantry than a serious question. JavaScript has WeakReference, sure it'd be cumbersome and inefficient because you'd need to manually make and poll each thing you wanted to observe, but could it not be said that it does provide a view on deallocations?
Note that 90% of the uses for them actually shouldn't be using them, usually for subtle reasons. It's always a big cause for debate.
> Using str.frobnicate prevents TurboJit on line 63
> By using only a single instruction and two tables, we only increase the interpreter by a size of 1 instruction, and also keep the base interpreter ultra fast. I affectionally call this mechanism dual dispatch.
I really do hope they'll write that better explanation one day because this sounds pretty intriguing all on its own.
I recently read an interview about implementing free-threading and getting modifications through the ecosystem to really enable it: https://alexalejandre.com/programming/interview-with-ngoldba...
The guy said he hopes the free-threaded build'll be the only one in "3.16 or 3.17", I wonder if that should apply to the JIT too or how the JIT and interpreter interact.
Having to have thread safe code all over the place just for the 1% of users who need to have multi-threading in Python and can't use subinterpreters for some reason is nuts.
Way more than 1% of the community, particularly of the community actively developing Python, wants free-threaded. The problem here is that the Python community consists of several different groups:
1. Basically pure Python code with no threading
2. Basically pure Python with appropriate thread safety
3. Basically pure Python code with already broken threaded code, just getting lucky for now
4. Mixed Python and C/C++/Rust code, with appropriate threading behavior in the C or C++ components
5. Mixed Python and C or C++ code, with C and C++ components depending on GIL behavior
Group 1 gets a slightly reduced performance. Groups 2 and 4 get a major win with free-threaded Python, being able to use threading through their interfaces to C/C++/Rust components. Group 3 is already writing buggy code and will probably see worse consequences from their existing bugs. Group 5 will have to either avoid threading in their Python code or rewrite their C/C++ components.
Right now, a big portion of the Python language developer base consists of Groups 2 and 4. Group 5 is basically perceived as holding Python-the-language and Python-the-implementations back.
Native code can already be multi-threaded so if you are using Python to drive parallelized native code, there's no win there. If your Python code is the bottleneck, well then you could have subinterpreters with shared buffers and locks. If you really need to have shared objects, do you actually need to mutate them from multiple interpreters? If not, what about exploring language support for frozen objects or proxies?
The only thing that free threading gives you is concurrent mutations to Python objects, which is like, whatever. In all my years of writing Python I have never once found myself thinking "I wish I could mutate the same object from two different threads".
Microsoft used to do this for their C runtime library.
For instance, this would have been a valid complaint:
"Users who don't need free threading will now suffer a performance penalty for their single-threaded code."
That is true. But if you are currently using multiple threads, code that was correct before will still be correct in the free threaded build, and code that was incorrect before will still be incorrect.
I think the GIL provides python with a great guarantee, I would probably prefer single-thread performance improvements over multithreading in python to be honest.
Anyway if I need performance, Python would probably not be my first choice
The PSF is primarily a political advocacy organisation, so it wouldn't make sense for them to use the money for Python.
See https://github.com/numpy/numpy/issues/30416 for example. It's not being updated for compatibility with new versions of Python.
Kudos to those involved into making it happen.
Like this is a big deal to get a project to a state where volunteers are spun up and actively breaking tasks and getting work done, no? It's a python JIT something I know next to nothing about — as do most application developers — which tells one how difficult this must have been.
The funding was Microsoft employing most of the team. They were laid off (or at least, moved onto different projects), apparently because they weren't working on AI.
This would be a potential case for a new major version number.
The more likely reason is that there simply hasn't been that big a push for it. Ruby was dog slow before the JIT and Rails was very popular, so there was a lot of demand and room for improvement. PHP was the primary language used by Facebook for a long time, and they had deep pockets. JS powers the web, so there's a huge incentive for companies like Google to make it faster. Python never really had that same level of investment, at least from a performance standpoint.
To your point, though, the C API has made certain types of optimizations extremely difficult, as the PyPy team has figured out.
Or lack of incentive?
Alot of big python projects that does machine learning and data processing offloads the heavy data processing from pure python code to libraries like numpy and pandas that take advantage of C api binding to do native execution.
But the main problem was actually that pypy was never adopted as “the JIT” mechanism. That would have made a huge difference a long time ago and made sure they evolved in lock step.
A worthwhile JIT is a fully optimizing compiler, and that is the hard part. Language semantics are much less important - dynamic languages aren’t particularly harder here, but the performance roof is obviously just much lower.
Including simply implementing the slow parts in C, such as the high performance machine learning ecosystem that exists in Python.
blueberry (aarch64)
Description: Raspberry Pi 5, 8GB RAM, 256GB SSD
OS: Debian GNU/Linux 12 (bookworm)
Owner: Savannah Ostrowski
ripley (x86_64)
Description: Intel i5-8400 @ 2.80GHz, 8GB RAM, 500GB SSD
OS: Ubuntu 24.04
Owner: Savannah Ostrowski
jones (aarch64)
Description: Apple M3 Pro, 18GB RAM, 512GB SSD
OS: macOS
Owner: Savannah Ostrowski
prometheus (x86_64)
Description: AMD Ryzen 5 3600X @ 3.80GHz, 16GB RAM
OS: Windows 11 Pro
Owner: Savannah OstrowskiThat is not remotely the case for anyone who produces quality work.
If you care about quality you absolutely can guide a machine to produce that for you without writing a single line of code yourself.
And I expect the amount of guidance needed will continue to drop.
In my experience the people who care the most about code readability tend to be the people most opinionated on having the right abstractions, which are historically not available in Go.
I just happen to read what the machine writes, which is a way to both learn and inspect. So yes, I care about the code being relatively (and I stress relatively) easy to read. Go is ok there.
`from future import time_travel`
But I do agree that it would be a bit clearer to talk in terms of time taken rather than speedup % i.e. instead of "20% slowdown to over 100% speedup" it's clearer to say "takes between 50% and 125% of the original time". (Especially since people very often say things like "3 times faster", which technically means 4 times as fast, when they should say "3 times as fast"; "takes 1/3 of the time" is unambiguous.)