[0] https://www.pypy.org/contact.html [1] https://www.pypy.org/posts/2022/11/pypy-and-conda-forge.html [2] https://www.pypy.org/download.html [3] https://www.pypy.org/contact.html
Moving to pypy definitely speeded me up a bit. Not as much as I'd hoped, it's probably all about string index into dict and dict management. I may recode into a radix tree. Hard to work out in advance how different it would be: People optimised core datastructs pretty well.
Uplift from normal python was trivial. Most dev time spent fixing pip3 for pypy in debian not knowing what apts to load, with a lot of "stop using pip" messaging.
I’m sure it’s better if you’re deploying an appliance that you hand off and never touch again, but for evolving modern Python servers it’s not well suited.
I still haven't figured out how to beat this dragon. All suggestions welcome!
Hi, I'm one of the people that look after this bit of Debian (and it's exactly the same in Ubuntu, FWIW).
It's like that to solve a problem (of course, everything has a reason). The idea is that Debian provides a Python that's deeply integrated into Debian packages. But if you want to build your own Python from source, you can. What you build will use site-packages, so it won't have any overlap with Debian's Python.
Unfortunately, while this approach was designed to be something all package-managed distributions could do, nobody else has adopted it, and consequently the code to make it work has never been pushed upstream. So, it's left as a Debian/Ubuntu oddity that confuses people. Sorry about that.
My recommendations are: 1. If you want more control over your Python than you get from Debian's package-managed python, build your own from source (or use a docker image that does that). 2. Deploy your apps with virtualenvs or system-level containers per app.
Since PEP 665 was rejected the Python ecosystem continues to lack a reasonable package manager and the lack of hashed based lock files prevents building on top of the current python project/package managers.
Docker
There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.
The chance of colliding on the 64-bit space is low if the hash distributes evenly, so you just yolo it.
Cool. Is the performance here something you would like to pursue? If so could you open an issue [0] with some kind of reproducer?
I need to find out how to instrument the seek/add cost of threads against the shared dict under a lock.
My gut feel is that probably if I inlined things instead of calling out to functions I'd shave a bit more too. So saying "slower than expected" may be unfair because there's limits to how much you can speed this kind of thing up. Thats why I wondered if alternate datastructures were a better fit.
its variable length string indexes into lists/dicts of integer counts. The advantage of a radix trie would be finding the record in semi constant time to the length in bits of the strings, and they do form prefix sets.
By definition if you lift something it is going to go up, but what does this mean?
Some engines can't build and deploy all imports.
Some engines demand syntactic sugar to do their work. Pypy doesn't
I'm very curious about where the line is/should be.
Create venv and activate it and install packages:
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install wheel
pip install -r requirements.txt
I wanted a similar one-liner that I could use on a fresh Ubuntu machine so I can try out PyPy easily in the same way. After a bit of fiddling, I came up with this monstrosity which should work with both bash and zsh (though I only tested it on zsh):Create venv and activate it and install packages using pyenv/pypy/pip:
if [ -d "$HOME/.pyenv" ]; then rm -Rf $HOME/.pyenv; fi && \
curl https://pyenv.run | bash && \
DEFAULT_SHELL=$(basename "$SHELL") && \
if [ "$DEFAULT_SHELL" = "zsh" ]; then RC_FILE=~/.zshrc; else RC_FILE=~/.bashrc; fi && \
if ! grep -q 'export PATH="$HOME/.pyenv/bin:$PATH"' $RC_FILE; then echo -e '\nexport PATH="$HOME/.pyenv/bin:$PATH"' >> $RC_FILE; fi && \
if ! grep -q 'eval "$(pyenv init -)"' $RC_FILE; then echo 'eval "$(pyenv init -)"' >> $RC_FILE; fi && \
if ! grep -q 'eval "$(pyenv virtualenv-init -)"' $RC_FILE; then echo 'eval "$(pyenv virtualenv-init -)"' >> $RC_FILE; fi && \
source $RC_FILE && \
LATEST_PYPY=$(pyenv install --list | grep -P '^ pypy[0-9\.]*-\d+\.\d+' | grep -v -- '-src' | tail -1) && \
LATEST_PYPY=$(echo $LATEST_PYPY | tr -d '[:space:]') && \
echo "Installing PyPy version: $LATEST_PYPY" && \
pyenv install $LATEST_PYPY && \
pyenv local $LATEST_PYPY && \
pypy -m venv venv && \
source venv/bin/activate && \
pip install --upgrade pip && \
pip install wheel && \
pip install -r requirements.txt
Maybe others will find it useful.So if you have PyPy already on your machines;
pypy -m venv venv && \
source venv/bin/activate && \
pip install --upgrade pip && \
pip install wheel && \
pip install -r requirements.txt
Was not that bad after all, when my initial thought was that do I need all the above to just initiate the project :D echo "layout python\npip install --upgrade pip pip-tools setuptools wheel\npip-sync" > .envrc
When you CD into a given project, it'll activate the venv, upgrade to non-ancient versions of Pip/etc with support for latest PEPs (ie. `pyproject.toml` support on new Python 3.9 env), verify the latest pinned packages are present.. it's just too useful not to have. direnv stdlib
This command (or this link https://direnv.net/man/direnv-stdlib.1.html) will print many useful functions that can be used in the `.envrc` shell script that is loaded when entering directories, ranging from many languages, to `dotenv` support, to `on_git_branch` for e.g. syncing deps when switching feature branches.Check it out if you haven't.. I've been using it for more years than I can count and being able to CD from a PHP project to a Ruby project to a Python project with ease really helps with context switching.
python3 -m venv -p pypy3 venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install wheel
pip install -r requirements.txt
Not very different...I am still working on it but the main issue is psycopg support for now, as I had to install psycopg2cffi in my test environment, but it will probably prevent me from using pypy for running our test suite, because psycopg2cffi does not have the same features and versions as psycopg2. This means either we switch our prod to pypy, which won't be possible because I am very new in this team and that would be seen as a big, risky change by the others, or we keep in mind the tests do not run using the exact same runtime as production servers (which might cause bugs to go unnoticed and reach production, or failing tests that would otherwise work on a live environment).
I think if I ever started a python project right now, I'd probably try and use pypy from the start, since (at least for web development) there does not seem to be any downsides to using it.
Anyways, thank you very much for your hard work !
[1]: https://www.psycopg.org/psycopg3/docs/basic/install.html
With CPython, I was frustrated with how slow it was, and complained about it to the people I was working with, PyPy was a simple upgrade that sped up my code to the point where it was comfortable to work with.
I am still using this library that I wrote
https://paulhoule.github.io/gastrodon/
to visualize RDF data so even if I make my RDF model in Java I am likely to load it up in Python to explore it. I don’t know if they are using PyPy but there is at least one big bank that has people using Gastrodon for the same purpose.
https://paulhoule.github.io/gastrodon/
which makes it very easy to visualize RDF data with Jupyter by turning SPARQL results into data frames.
Here are two essays I wrote using it
https://ontology2.com/essays/LookingForMetadataInAllTheWrong...
https://ontology2.com/essays/PropertiesColorsAndThumbnails.h...
People often think RDF never caught on but actually there are many standards that are RDF-based such as RSS, XMP, ActivityPub and such that you can work on quite directly with RDF tools.
Beyond that I’ve been on a standards committee for ISO 20022 where we’ve figured out, after quite a few years of looking at the problem, how to use RDF and OWL as a master standard for representing messages and schemas in financial messaging. In the project that needed PyPy we were converting a standard represented in EMOF into RDF. Towards the end of last year I figured out the right way to logically model the parts of those messages and the associated schema with OWL. That is on its way of becoming one of those ISO standard documents that unfortunately costs 133 swiss franc. I also figured out that it is possible to do the same for many messages defined with XSLT and I’m expecting to get some work applying this to a major financial standard and I think there will be some source code and a public report on that.
Notably the techniques I use address quite a few problems with the way most people use RDF, most notably many RDF users don’t use the tools available to represented ordered collections, a notable example with this makes trouble is in Dublin Core for document (say book) metadata where you can’t represent the order of the authors of a paper which is something the authors usually care about a great deal. XMP adapts the Dublin Core standard enough to solve this problem, but with the techniques I use you can use RDF to do anything any document database can, though some SPARQL extensions would make it easier.
So the good: It apparently now supports Python 3.9? Might want to update your front page, it only mentions Python 3.7.
The bad: It only supports Python 3.9, we use newer features throughout our code, so it'd be painful to even try it out.
Maybe the site is not up to date ?
Personally I don't use PyPy for anything, though I have followed it with interest. Most of the things I need to go faster are numerical, so Numba and Cython seem more appropriate.
Edit; typo
I don’t use it.
Why would I use it, what’s the compelling benefit?
This two weird tricks tend to create wonders, tho.
> A fast, compliant alternative implementation of Python
Performance without compromising too much on compatibility seems to be the main benefit. There is a talk on the YouTube channel «Pycon Sweden» from 5 years ago where the host showed some impressive speed gains for his workload (parsing black box dumps from planes).
Haven’t used it in a bit mostly because I’ve been working on projects that haven’t had the same bottleneck, or that rely on incompatible extensions.
Thank you for your work on the project!
> that rely on incompatible extensions.
Which ones? Is using conda an option, we have more luck getting binary packages into their build pipelines than getting projects to build wheels for PyPI
The biggest blocker for me for 'defaulting' to PyPy is a) issues when dealing with CPython extensions and how quite often it ends up being a significant effort to 'port' more complex applications to PyPy b) the muscle memory for typing 'python3' instead of 'pypy3'.
Speed up of 30x - 40x. The highest speedup on those that require logic in the transformation. (lot of function calls, numerical operations and dictionary lookups).
Python is fun to work with (except classes…), but its just sooo slow. Pypy can be a life saver.
[1] https://blog.transitapp.com/how-we-shrank-our-trip-planner-t... [2] https://blog.transitapp.com/how-we-built-the-worlds-pretties...
We also use pypy3 to accelerate rdflib parsing and serialization of various RDF formats. See for example [3].
Thanks to you and the whole PyPy team!
1. https://github.com/tgbugs/dockerfiles/blob/6f4ad5d873b7ab267...
2. https://github.com/tgbugs/dockerfiles/blob/6f4ad5d873b7ab267...
3. https://github.com/SciCrunch/sparc-curation/blob/0fdf393e26f...
I just reran one of my usual benchmarks and I see 2mins for pypy3 (pypy 7.3.12 python 3.10.12) peak memory usage about 8gigs, 4.8mins for python3.11 (3.11.4) peak memory usage about 3.6gigs (2.4x speedup). On another computer running the exact same workload I see 6.3mins and 19mins (3x speedup) with the same peak memory usage.
I don't have any numbers on the dataset pipelines because I never ran them in production on cpython and went straight to pypy3. It is easy for me to switch between the two implementations in this context so I could run a side by side comparison (with the usual caveat that it would be completely non-rigorous).
I also have some internal notes related to a project that I didn't list because it isn't public, isn't in production, and the benchmarks were collected quite a while ago, but I see a 4x increase in throughput when pulling large amounts of data from a postgresql database from 20mbps on cpython 3.6 to 80mbps on pypy3.
Basically I'm using a SciPy exclusively for the optimization routine:
* minimize(method="SLSQP") [0]
* A list comprehention which calls ~10-500 pre-fitted PchipInterpolator [1] functions and stores the values as a np.array().
The Pchip functions (and it's first derivatives) are used in the main opt function as well as in several constraints.
Most jobs took about 10 seconds but the long tail might take up to 10 min some times. I tried the pypy 3.8 (7.3.9), and saw similar compute times on the shorter jobs, but roughly ~2x slower compute times on the heavier jobs. This obviously was not what I expected, but I had very limited experience with pypy and didn't know how to debug further.
Eventually python 3.10 came around and gave 1.25x speed increase, and then 3.11 which gave another 1.6-1.7x increase which gave a decent ~2x cumulative speedup, but the occasional heavy jobs still stay in the 5 min range and would have been nicer in the 10-30s obviously.
Still I would like to say that trying pypy out was a quite smooth experience, staying within scipy land, took me half a day to switch and benchmark. But if anyone else has experience with pypy and scipy, knowing some obvious pitfalls, it would be much appreciated to hear.
[0] https://docs.scipy.org/doc/scipy/reference/optimize.minimize...
[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.i...
I'm currently doing multi-agent reinforcement learning research using RLlib, which is part of Ray. I tried to install a PyPy environment for it. It failed because Ray doesn't provide a wheel for it:
Could not find a version that satisfies the requirement ray (from versions: none)
My hunch is that even Ray did provide that, there would have been some other roadblock that would have prevented me from using PyPy.The modern debugging tools available in other IDEs work fine with PyPy (and have for years), so I guess that must be a wing issue.
FWIW, since I've seen it mentioned, we've also been using psycopg2cffi to access Postgres sources.
The product now lives (at least partially) as Datastream on GCP (https://cloud.google.com/datastream/docs/overview). I'm not sure though if it's still running on PyPy.
I could try and connect with the folks still working on it, if you're interested.
Quite often you would want to just thank somebody, or say that you would prefer it that way and don't understand why is it this way or it would be cool to have this or that, but of course opening ticket on github feels like wasting time of the maintainer and especially when you have some feedback like e.g. what would you like to see or what you do and don't like it feels entitled because well you can do it yourself, you can fork etc.
It would need to be low friction for both sides. Preferably with no way to respond so that there's zero pressure and little time waste for maintainers.
Mail feels like you want something, it works for thank you but still feels bad on receiving end when you just ignore them.
SQL Alchemy actually points to PyPy in its recommendations of things to try in ORM performance. https://docs.sqlalchemy.org/en/20/faq/performance.html#resul...
For PostgreSQL, psycopg2 is not supported. psycopg2cffi is largely unmaintained, and the 2.9.0 version in PyPI lacks some newer features of psycopg2: the `psycopg2.sql` module and empty result sets raise a RuntimeError in Python 3.7+. The latest commit in on Github does have these changes [1]. Psycopg 3 [2] and pg8000 [3] (as user tlocke mentioned elsewhere) are viable alternates provided you aren't stuck with older versions of PostgreSQL. I have to continue use psycopg2cffi until I can upgrade an old PostgreSQL 9.4 database.
For Microsoft SQL Server, pymssql does not support PyPy [4]. It's under new maintainership so it might gain support in the future. pypyodbc hasn't had any activity since 2022, and no new PyPI release since 2021 [5]. The datatypes returned can differ between libodbc1 versions. On Ubuntu 18.04 in particular: empty string columns are returned as a single space, integer columns are returned as a Decimal. Also, if you encounter a mysterious HY010 error ("Function sequence error"), you may need to upgrade libodbc1 to v2.3.7+ from v2.3.4 using the Microsoft repos.
[1]: https://github.com/chtd/psycopg2cffi [2]: https://pypi.org/project/psycopg/ [3]: https://pypi.org/project/pg8000/ [4]: https://github.com/pymssql/pymssql/pull/517 [5]: https://pypi.org/project/pypyodbc/
I was hoping to see some improvement in ORM performance (SQLAlchemy 1.3) - mainly in the bookkeeping side. Currently the app is about 60% Python app wait time and 40% DB wait time. We have a handful noisy areas which emit a lot of statements (Update 1 row at a time, 10000 times via ORM for example).
I also tried cProfiler to drill down, but as I've seen in Stack Overflow notes that profiler has a larger impact in PyPy over CPython.
If we could use pypy, while still using those packages, I think it'd be the go-to interpreter. Why can't pypy optimize everything else, and leave the C stuff as-is?
How does pypy handle packages written in other languages, like rust? can I use pypy if I depend on Pydantic?
Numpy being itself written in C and C++ it is strongly tied to the C API and has a complicated build process. Some stuff works and some don't (didn't try recently). If you're invested in numerical python you should most likely not use pypy but go for stuff like cython (like scipy does).
For psycopg apparently you can use psycopg2cffi (never tried).
> How does pypy handle packages written in other languages, like rust? can I use pypy if I depend on Pydantic?
PyO3 supports pypy so everything should be fine.
For c-extensions see https://www.pypy.org/posts/2018/09/inside-cpyext-why-emulati...
We would like to be able to "just JIT" better. But for that we need feedback about what is still unreasonably slow, and resources to work on improving it. Right now PyPy is on a shoe-string budget of volunteers.
For rust, like CPython, use PyO3, which works with PyPy.
I am not sure about Pydantic. Sounds like a topic for someone to investigate on their codebase and tell us how PyPy does.
There is still the lag though, Python 3.10 was out for quite a while before PyPy supported 3.10.
We use the PyPy provided downloads (Linux x86 64 bit) because it's easier to maintain multiple versions simultaneously on Ubuntu servers. The PyPy PPA does not allow this. I try to keep the various projects using the latest stable version of PyPy as they receive maintenance, and we're currently transitioning from 3.9/v7.3.10 to 3.10/v7.3.12.
Thank you for all of the hard work providing a JITed Python!
PyPy is pretty well stress-tested by the competitive programming community.
https://codeforces.com/contests has around 20-30k participants per contest, with contests happening roughly twice a week. I would say around 10% of them use python, with the vast majority choosing pypy over cpython.
I would guesstimate at least 100k lines of pypy is written per week just from these contests. This covers virtually every textbook algorithm you can think of and were automatically graded for correctness/speed/memory. Note that there's no special time multiplier for choosing a slower language, so if you're not within 2x the speed of the equivalent C++, your solution won't pass! (hence the popularity of pypy over cpython)
The sheer volume of advanced algorithms executed in pypy gives me huge amount of confidence in it. There was only one instance where I remember a contestant running into a bug with the jit, but it was fixed within a few days after being reported: https://codeforces.com/blog/entry/82329?#comment-693711 https://foss.heptapod.net/pypy/pypy/-/issues/3297.
New edit from that previous comment: there's now a Legendary Grandmaster (ELO rating > 3000, ranking 33 out of hundreds of thousands) who almost exclusively use pypy: https://codeforces.com/submissions/conqueror_of_tourist
Competitive Programming needs a lot of speed to compete with the C++ submissions, really cool that there are Contestants using Python to win.
Thank you for your amazing work!
The performance of PyPy over CPython saved us loads and loads time and thus $$$s, from what I can recall.
If I could just `pip3 install pypy` and then set an environment variable to use it or something like that then I'd give it a try. It does feel a bit like adding a jet pack to a rowing boat though. I know some people use Python in situations where the performance requirement isn't "I literally don't care" but surely not very many?
Obviously if it was the default that would be fantastic.
rtx use python@pypy3.10
This downloaded and installed PyPy v3.10 in a few seconds and created an .rtx.toml file in the current directory that ensures when I run python in that directory I get that version of PyPy.PyPy should had become standard implemention and it would save a lot of investment on Fast python
I tried to shill PyPy all the time but thanks to outdated website and weird reason of hetapod love ( at least put something on GitHub for discovery sick) , the devs who won't bother to look anything further than a GitHub page frawns upon me thinking PyPy is outdated and inactive project.
PyPy is one of the most ambitious project in opensource history and lack of publicity make me scream internally.
Also, you might want to flag the libraries that technically "work" but still require an extremely long and involved build process. For example, I recently started the process of installing Pandas with pip in a PyPy venv and it was stuck on `Getting requirements to build wheel ...` for a very long time, like 20+ minutes.
I'm rarely using python in places at work where it would suit it (lots of python usage, but they're more on the order of short run tools), but I'm always looking for chances and always using it for random little personal things.
(nevertheless, PyPy is impressive :-) )
So we've made it configurable to run some instances with Pypy - which was able to work through the data in realtime, i.e. without generating a lag in the data stream. The downside of using pypy was increased memory usage (4-8x) - which isn't really a problem. An actually problem that I didn't really track down was that the test suite (running pytest) was taking 2-3 times longer with Pypy than with CPython.
A few months ago I upgraded the system to run with CPython 3.11 and the performance improvements of 10-20% that come with that version now actually allowed us to drop Pypy and only run CPython. Which is more convenient and makes the deployment and configuration less complex.
We eventually rewrote the profiler tool in Rust for additional speedups, but as mentioned for the verification engine, it's probably too complicated to ever do that so we really appreciate drop-in tools like PyPy that can speed up our code.
[1]: https://github.com/StanfordLegion/legion/blob/master/tools/l...
[2]: https://github.com/StanfordLegion/legion/blob/master/tools/l...
That said, if I do ever run into a situation where I need my code to perform better, PyPy is high on my list of things to try. It’s nice to know it’s an option.
Also in my day job we use pypy in all our python deployments, to be fair until now I thought that everybody would develop in python, test in pypy for an easy speed boost and only got back to python if pypy was slower than cpython
I would be interested in seeing benchmarks where PyPy is compared with more recent versions of CPython. https://www.pypy.org/ currently shows a comparison with CPython 3.7, but recent releases of CPython (3.11+) put a lot of effort into performance which is important to take into account.
Things like https://github.com/gevent/gevent/issues/676 and the fix at https://github.com/gevent/gevent/commit/f466ec51ea74755c5bee... indicate to me that there are subtleties on how PyPy's memory management interacts with low-level tweaks like gevent that have relied on often-implicit historical assumptions about memory management timing.
Not sure if this is limited to gevent, either - other libraries like Sentry, NewRelic, and OpenTelemetry also have low-level monkey-patched hooks, and it's unclear whether they're low-level enough that they might run into similar issues.
For a stack without any monkey-patching I'd be overjoyed to use PyPy - but between gevent and these monitoring tools, practically every project needs at least some monkey-patching, and I think that there's a lack of clarity on how battle-tested PyPy is with tools like these.
1. The same naive deserialization and dict processing code ran much faster with PyPy.
2. Conveniently, PyPy also tolerated some broken surrogate pairs in Twitter's UTF8 stream, which threw exceptions when trying to decode the same events with the regular Python interpreter.
I've had some web service code where I wished I could easily swap to PyPy, but these were conservative projects using Apache + mod_wsgi daemons with SE-Linux. If there were a mod_wsgi_pypy that could be a drop-in replacement, I would have advocated for trials/benchmarking with the ops team.
Most other performance-critical work for me has been with combinations of numpy, PyOpenCL, PyOpenGL, and various imaging codecs like `tifffile` or piping numpy arrays in/out of ffmpeg subprocesses.
I've deployed used the pypy:3.9 image on docker.
One thing I did notice is that it was significantly faster on my local machine vs when I tried to deploy it using an AWS lambda/fargate. I know this is because of virtualization/virtual-cpu, but there was not much I could do to improve it.
Two reasons for my hesitation:
1) Cpython is fast enough for most things I need to do. The speed improvement from Pypy is either not enough or not necessary.
2) Lingering doubts about subtle incompatibility (in terms of library support) that I might have to spend hours getting to the bottom of.
I already work long hours and don’t have bandwidth to tinker. With Cpython, although slow, I can be assured is the standard surface that everyone targets, and I can google solutions for.
It’s the subtle things that i waste a lot of time on. It’s analogous to an Ubuntu user trying to use Red Hat. They’re both Linuxes but the way things are done are different enough that they trip you up.
The only way to get out of this quandary is for Pypy to be a first class citizen. Guido will never endorse this so this means a bunch of us will always have hesitation putting it into production systems.
But while programming as a hobby at home, mostly small-scale simulations, PyPy is my default interpreter for Python. It seems PyPy has a sweet spot on code written relying heavily on OOP style, with a lot of method calls and self invocation. I consistently get 8-10x speed improvements.
I was close to trying pypy on a production django deployment (which gets ~100k views a month), but given that the tiny AWS EC2 instance we're running it on is memory bound, the increased pypy memory usage made it impractical to do so.
Nowadays, to be honest, everything that I need to be fast in Python is largely around numerical code which either calls out to C/C++ (via numpy or some ML library) or I use numba for. And these are either slower w/ PyPi or won't work.
HTTP web servers are notoriously slow in Python (even the fastest ones like falcon) but I found they either didn't play nicely with Pypi or weren't any faster. In large part because if the API does any kind of "heavy lifting" they can't be truly concurrent.
The big obstacle is that for while we would have multiple execution environments. It’s not like we could flip a switch and all Dockerfiles are using PyPy.
Plus I don’t think AWS Lambda supports it.
If I could go back in time, we would use it from the beginning.
So... thanks for not doing that.