There appears to be roughly the same code structure, ported to every language, while for some languages, arbitrary optimizations are introduced (such as using `array` instead of `list` in Python).
But nobody working in Python uses matrix multiplication code written in Python. They use NumPy, which is a de facto standard library for people working in the relevant fields. It's as much part of "Python" as list comprehensions.
Without taking such real-world conventions into account, such comparisons say essentially nothing about the languages involved (and their all-important ecosystems).
So this shouldn't be taken as "how fast does a real-world Python program do at matrix multiplication", since of course no one writes real-world programs doing matrix multiplication in pure Python. But it can show the relative speed of pure Python at purely computational tasks.
But that's irrelevant if nobody uses "pure Python" for computational tasks.
It's like asking "how well do these languages run on a Lisp machine from 1979?". It simply has no relevance to real-world considerations today.
If this were the only concern, it should be valid to create a blob of binary and call that function for the optimal performance. (Python's ctypes makes this very easy, for example.) So you want an idiomatic solution instead, and Numpy for matrix computation is considered idiomatic in Python, even more than the pure Python code.
It also means that adding performance to an existing Python program requires dropping into a different language, which is not only complicated, but also requires engineers capable in both Python and C (or similar).
People use matrix multiplication libraries (often written in Assembly) from every language if they really care about performance. That's because such libraries incorporate 100 PhD theses' worth of tricks that no individual can hope to reinvent in the course of solving another problem. There is absolutely nothing special about Python in this context.
> It also means that adding performance to an existing Python program requires dropping into a different language
As stated above, this applies to all languages. BLAS routines used for serious numerical work are hand-vectorized Assembly fine-tuned for each processor architecture, written by a few hyper-experts who do nothing else.
Nobody who needs performant matrix multiplication from C thinks "hey, let me just write two nested loops".
It's actually not that bad. I think it's part of the reason Python became so popular, it's fairly easy to write C code and expose it via python.
You need libraries to do _anything_ in Python. It's interpreted, so literally any call you make in Python will eventually make it back to something written in a compiled language (like a call to NumPy commands).
It is supposed to show the performance of a language when you have to implement a new algorithm in the language. This may happen if you can't find the algorithm in existing libraries. If you don't like matmul, ignore it and focus on nqueen, sudoku and bedcov.
It doesn't make sense to allow "warmup" time for them unless your expected application is a server which for most of the time will be running "warm" (and even then with scalable containers that assumption may not even be true in some cases). For servers, however, what matters is mostly how fast its HTTP library is and how good the async IO is... check Techempower benchmarks for that: https://www.techempower.com/benchmarks/#hw=ph&test=fortune&s...
https://www.oracle.com/java/technologies/security-in-java.ht...
Why should that run time startup cost be ignored?
The difference is mostly in matmul.
I see how C, for an n x n matrix, does 2 allocations, while rust does n+1. C's matrix rows are right next to each other, rusts are probably all over the place. Didn't look at nim or zig.
Maybe a slice of slice of double would perform better than a vec of vec of double? Then again, an argument can be made that rust pushes you to vec, so this impl is more honest for how a beginner would do it. Otoh, C has the optimization so why doesn't rust?
i sent a pr to fix it
Edit: I have tried making an iterator-based version to elide bound checks, but had to resort to unsafe, and it's barely 50% faster than the original rust version (not as fast as C): https://gist.github.com/anisse/6b580628206293ef242faa7db6219...
Edit 2: updated, and my rust iterator version now ~equivalent to C with no unsafe.
Edit 3: too late, the repo has been updated with an other iterator-based version that is just as fast.
[0] - https://docs.aws.amazon.com/lambda/latest/dg/dotnet-native-a...
C# also has arrays-of-arrays, and could (should) be written in the same manner.
check out how a modern language deals with this stuff https://bun.sh/docs/api/ffi#usage
PHP added JIT in 8.0, and these math-heavy tasks can take advantage of it. It's not trivial to fine tine JIT configuration though.
In PHP 8.4 (scheduled Nov 2024), there is a major upgrade to JIT as well.
First, the graph is misleading, stacking times with languages that have half the implementation, they appear faster, until you dig in. I'd suggest producing an alternate graph that shows only the implemented puzzles in every language, or make a unique graph for every language:puzzle.
Second, the examples are taken from rosetta code and are not necessarily what would be the best implementation, or even close to the best implementation, for benchmarking purposes.
Finally, those examples should be reproduced across various hardware platforms, I'm on arm64 Darwin myself, but you might find different results on Intel platforms due to the various compiler optimizations available based on the hardware.
More benchmarks would be interesting to see, such as actual real world operations, e.g. opening a file, reading it, parsing json, opening a socket server, etc.
Recently there have been some decent improvements to CPython's speed, but there are real upper limits to how fast you can make an interpreter. CPython will need JIT compilation if it is ever to break out of its current speed bracket.
JavaScript has had a feature complete JIT reference implementation since 2008 which is a major part of the reason JS applications exploded so much in the 2010s.
nqueen vs. sudoku: 0.531
matmul vs. sudoku: 0.362
matmul vs. nqueen: 0.127
1. https://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.p...
This is only true when three matrices are independent of each other, and also why C has a `restrict` qualifier that enables this assumption. The benchmark itself has no such assumption because all three variables are defined as `double **`, and it can be verified by assembly outputs. Clang's excellent performance in matmul is probably much more due to autovectorization.
The Julia matmul implementation has its rows and columns flipped though - unlike C, Julia uses row-major matrices. This has large implications for speed.
Also, the code may be much faster if you enable SIMD in the function, which is disabled in the code because a) the code unnecessarily checks bounds at every index instead of at the top of the function, and b) float SIMD is opt-in since SIMD changes the rounding
It would be nice to see those other languages in a chart that doesn't include the slower four. Alternatively, you could also show those slower four with "broken" columns like this https://peltiertech.com/broken-y-axis-in-excel-chart/
The choice of stacked bar charts is also strange because the languages are not all comparable
> Every language has nqueen and matmul implementations. Some languages do not have sudoku or bedcov implementations. In addition, I implemented most algorithms in plb2 and adapted a few contributed matmul and sudoku implementations in plb. As I am mostly a C programmer, implementations in other languages may be suboptimal and there are no implementations in functional languages. Pull requests are welcomed!
So the point still stands that many charts doing one thing each would be better than fewer charts doing many things each
Given that its become increasingly more common for CPUs to have both Performance & Efficency cores … how do benchmarks ensure they are only being run on the P-cores?
I believe Game Mode will push the processes to e cores to keep a consistent game play without thermal throttling.
Why would you think that?
Maybe the "naive" implementation just shows how fast the easily removed hotspot in your code is going to run.
"Swap the order of two statements and see the Java code slow down … Swap globals for local variables in a function and see the Python code speed up. Swap language implementations and see the C code speed up."
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Including time to JIT compile is questionable, why not also include time to compile the compiled languages?
Would .NET with F# make big difference here?
I'm little surprised Java beat .NET. Is that typical these days?
There are only 20 thousand array allocations in total, not a lot. Javascript also has these many arrays allocated/deallocated but it is 4 times as fast.
Here are naive line-by-line transliterations from an original C program:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Here exhaustively-optimised + multicore + vector-instruction programs are included:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Maybe I made an accidental optimization in my language translation, or maybe there are some operations that are much slower in JS and these benchmarks didn't hit any of them.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
"How source code size is measured"
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
OP - are you interested in pull requests adding support for other languages?
However, translating Sudoku and N-Queens into a similar problem that you can feed into a solver can get you a long way. Even better, you can move that solver into whatever language gives you the best optimizations that you can work. Even better, there are almost certainly optimizations in common solvers that you don't want to deal with implementing on your on.