Benchmarking 20 programming languages on N-queens and matrix multiplication (opens in new tab)

(github.com)

163 pointsattractivechaos2y ago186 comments

186 comments

125 comments · 37 top-level

p-e-w2y ago· 10 in thread

What is this supposed to demonstrate?

There appears to be roughly the same code structure, ported to every language, while for some languages, arbitrary optimizations are introduced (such as using `array` instead of `list` in Python).

But nobody working in Python uses matrix multiplication code written in Python. They use NumPy, which is a de facto standard library for people working in the relevant fields. It's as much part of "Python" as list comprehensions.

Without taking such real-world conventions into account, such comparisons say essentially nothing about the languages involved (and their all-important ecosystems).

tsimionescu2y ago

Generally, the point of language benchmarks is to show how the languages compare at solving the same problem, without external libraries. Including external libraries is pointless since any language can call any library, ultimately, so at best you'd be comparing the FFI overhead.

So this shouldn't be taken as "how fast does a real-world Python program do at matrix multiplication", since of course no one writes real-world programs doing matrix multiplication in pure Python. But it can show the relative speed of pure Python at purely computational tasks.

p-e-w2y ago

> But it can show the relative speed of pure Python at purely computational tasks.

But that's irrelevant if nobody uses "pure Python" for computational tasks.

It's like asking "how well do these languages run on a Lisp machine from 1979?". It simply has no relevance to real-world considerations today.

3 more replies

lifthrasiir2y ago

> Generally, the point of language benchmarks is to show how the languages compare at solving the same problem, without external libraries.

If this were the only concern, it should be valid to create a blob of binary and call that function for the optimal performance. (Python's ctypes makes this very easy, for example.) So you want an idiomatic solution instead, and Numpy for matrix computation is considered idiomatic in Python, even more than the pure Python code.

1 more reply

morepedantic2y ago

It demonstrates that Python needs libraries like NumPy. Few problems are more heavily optimized than matrix multiplication in practice, so comparing matrix multiplication benchmarks across languages with NumPy is not representative of real-world performance for most programming use cases.

It also means that adding performance to an existing Python program requires dropping into a different language, which is not only complicated, but also requires engineers capable in both Python and C (or similar).

p-e-w2y ago

> It demonstrates that Python needs libraries like NumPy.

People use matrix multiplication libraries (often written in Assembly) from every language if they really care about performance. That's because such libraries incorporate 100 PhD theses' worth of tricks that no individual can hope to reinvent in the course of solving another problem. There is absolutely nothing special about Python in this context.

> It also means that adding performance to an existing Python program requires dropping into a different language

As stated above, this applies to all languages. BLAS routines used for serious numerical work are hand-vectorized Assembly fine-tuned for each processor architecture, written by a few hyper-experts who do nothing else.

Nobody who needs performant matrix multiplication from C thinks "hey, let me just write two nested loops".

3 more replies

lifthrasiir2y ago

You don't need any C knowledge to use numpy. In fact, its conceptual similarity with Matlab is possibly the single most important reason for its popularity. Many other problems do need specialized treatments that would indeed require other languages, but numpy is not a good counterexample.

1 more reply

doix2y ago

> It also means that adding performance to an existing Python program requires dropping into a different language, which is not only complicated, but also requires engineers capable in both Python and C (or similar).

It's actually not that bad. I think it's part of the reason Python became so popular, it's fairly easy to write C code and expose it via python.

RHSeeger2y ago

> It demonstrates that Python needs libraries like NumPy.

You need libraries to do _anything_ in Python. It's interpreted, so literally any call you make in Python will eventually make it back to something written in a compiled language (like a call to NumPy commands).

attractivechaosOP2y ago

> What is this supposed to demonstrate?

It is supposed to show the performance of a language when you have to implement a new algorithm in the language. This may happen if you can't find the algorithm in existing libraries. If you don't like matmul, ignore it and focus on nqueen, sudoku and bedcov.

ggm2y ago

So you'd be measuring the speed of loops over out calls to c or Fortran (nag library) or their vectorisation and only testing the cost of data representation changes

Atotalnoob2y ago· 8 in thread

Aren’t JIT languages at a disadvantage since they are benchmarked through the CLI rather than using a benchmarking library to allow JIT to warmup?

brabel2y ago

For one-time problem runs, the JIT languages in practice will not be given time to warm up. All that matters for a user is how fast the application is in practice. It's not about "making it fair" for languages, it's about measuring how fast they go from nothing to results.

It doesn't make sense to allow "warmup" time for them unless your expected application is a server which for most of the time will be running "warm" (and even then with scalable containers that assumption may not even be true in some cases). For servers, however, what matters is mostly how fast its HTTP library is and how good the async IO is... check Techempower benchmarks for that: https://www.techempower.com/benchmarks/#hw=ph&test=fortune&s...

nithril2y ago

From nothing to result rarely happens in real live. I hardly see someone to start/stop a program per unit task (like piping commands).

1 more reply

igouy2y ago

Some language implementations have run time startup costs (bytecode verifier), not-required at run time by other language implementations.

https://www.oracle.com/java/technologies/security-in-java.ht...

Why should that run time startup cost be ignored?

jakobnissen2y ago

Yes, but the author claims the longest JIT warmup is 0.3 seconds, so it's not an important issue in these benchmarks that take several seconds.

lifthrasiir2y ago

I strongly suspect that the author may have confused the JIT warmup (hard to measure, as you need to ensure that the performance figure have reached the stable point) from the startup overhead (easy to measure).

1 more reply

nithril2y ago

0.3s is significant for a task that only takes 1.14s

1 more reply

borissk2y ago

What's CLI?

zogrodea2y ago

Presumably command line interface/terminal. I can type `dotnet run` in the terminal and provide some options to runa .NET program for example.

hyperman12y ago· 6 in thread

I wondered why rust is so far behind C/nim/zig, they should have similar behaviour.

The difference is mostly in matmul.

I see how C, for an n x n matrix, does 2 allocations, while rust does n+1. C's matrix rows are right next to each other, rusts are probably all over the place. Didn't look at nim or zig.

Maybe a slice of slice of double would perform better than a vec of vec of double? Then again, an argument can be made that rust pushes you to vec, so this impl is more honest for how a beginner would do it. Otoh, C has the optimization so why doesn't rust?

nrabulinski2y ago

The rust code is very unidiomatic, not only because of the Vec of Vecs which I’d say, even if it’s the obvious naive approach, no one experienced wouldn’t choose over a flat slice, the implementation itself is very naive and unidiomatic.

hgomersall2y ago

Also it's using checked indexing, which apart from being not idiomatic, is also going to slow things down. A fairer comparison would be to use the unchecked indexing variants.

2 more replies

attractivechaosOP2y ago

FYI: rust has been updated to avoid unnecessary bound check in PR #4. It now matches the C version in performance.

tiehuis2y ago

Zig had a similar issue which I made an MR to fix. I didn't actually notice Rust had the same issue but that's probably just my forgotten knowledge with how the vec! macros expand when nested.

spense2y ago

rust is equivalent to C when using static allocation

i sent a pr to fix it

Aissen2y ago

Doesn't your PR monomorphize the function every time N is changed ? I realize it's simpler since it keeps the structure, simplifies allocation of arrays, and elides bound check. But it explodes generated code size and matrix size can't be changed at runtime, which doesn't really match C.

Edit: I have tried making an iterator-based version to elide bound checks, but had to resort to unsafe, and it's barely 50% faster than the original rust version (not as fast as C): https://gist.github.com/anisse/6b580628206293ef242faa7db6219...

Edit 2: updated, and my rust iterator version now ~equivalent to C with no unsafe.

Edit 3: too late, the repo has been updated with an other iterator-based version that is just as fast.

1 more reply

metaltyphoon2y ago· 6 in thread

Why not run C# as AOT? It’s a one line change.

borissk2y ago

Do people usually run C# code AOT compiled?

romanovcode2y ago

If it's for something stupid like benchmark comparison - yes.

metaltyphoon2y ago

I don't know about people but I do pretty much every where I can. I make such a difference that AWS is heavily investing on lambdas that uses AOT[0].

[0] - https://docs.aws.amazon.com/lambda/latest/dg/dotnet-native-a...

neonsunset2y ago

It’s gaining popularity for CLI and serverless, some people use it to replace C++ for writing native libraries and plugins as well.

mlhpdx2y ago

Yes, everywhere I possible can. Particularly for targeting CLI programs and Linux daemons.

xnorswap2y ago

Or use BenchmarkDotNet which, among other things to get an accurate benchmark, does JIT warmup outside of measurement.

( https://github.com/dotnet/BenchmarkDotNet ).

akashcoach2y ago· 6 in thread

Why is c# so slow for Matmul compared to java?

useerup2y ago

I believe because the C# version has been written using rectangular arrays. This requires every array access to use a multiplication. The Java version uses array-of-arrays and hoisting the inner array out before accessing it in the inner loop.

C# also has arrays-of-arrays, and could (should) be written in the same manner.

iruoy2y ago

I've just done this and it has been merged. The benchmarks table and image haven't been updated yet. But this should bring the C# result to ~2s instead of 4.67s

akashcoach2y ago

Thanks for the explanation. Yes this make sense to me.

mlhpdx2y ago

The implementation isn’t using any modern C# features?

neonsunset2y ago

Unfortunately it doesn't. The newest and hottest way to do this is to either use bespoke matmul from System.Numerics.Tensors or at least using Vector<T> for SIMD (which is trivial and not "the last mile" optimization it often seems to be).

akashcoach2y ago

Which modern features do you mean? Java and C# code looks similar to me.

theteapot2y ago· 5 in thread

PHP results: I was stupid enough to write some scientific code in PHP once so know how slow it can be - mostly around array access and manipulation. But if your going to try, use the HHVM interpreter. It's much faster and is a drop in replacement for the PHP interpreter. Hack (https://hacklang.org/) uses that under the hood by default.

EmberTwin2y ago

HHVM is not a drop-in replacement for a PHP interpreter. The semantics of Hack and PHP have diverged, typically in the direction of eliminating dynamic behavior from Hack that existed in PHP (examples: string -> function coercion, the PHP dual vector-hash-table array type, and non-throwing out-of-bounds array accesses are all gone from Hack). The semantics changes both simplify static analysis and make it easier to JIT fast code.

habibur2y ago

I put the pure math codes in C extension of PHP. Building C extension for PHP is easier than most of the other high level languages. And then things get blazingly fast.

brrrrrm2y ago

doesn't look that easy... zend_parse_parameters? pre-baked configure + make scripts?

check out how a modern language deals with this stuff https://bun.sh/docs/api/ffi#usage

1 more reply

Ayesh2y ago

The state of PHP 8.3 and especially 8.4 is a lot better than HHVM. They have diverged quite a lot to a point that it's no longer a drop-in replacement.

PHP added JIT in 8.0, and these math-heavy tasks can take advantage of it. It's not trivial to fine tine JIT configuration though.

In PHP 8.4 (scheduled Nov 2024), there is a major upgrade to JIT as well.

geek_at2y ago

Also since PHP was built for web requests it builds The cache on the First run. would be interesting to see All languages of the benchmark with a second run

keyle2y ago· 3 in thread

I love a good benchmark, thanks for putting this together! However, I have a bit of feedback.

First, the graph is misleading, stacking times with languages that have half the implementation, they appear faster, until you dig in. I'd suggest producing an alternate graph that shows only the implemented puzzles in every language, or make a unique graph for every language:puzzle.

Second, the examples are taken from rosetta code and are not necessarily what would be the best implementation, or even close to the best implementation, for benchmarking purposes.

Finally, those examples should be reproduced across various hardware platforms, I'm on arm64 Darwin myself, but you might find different results on Intel platforms due to the various compiler optimizations available based on the hardware.

More benchmarks would be interesting to see, such as actual real world operations, e.g. opening a file, reading it, parsing json, opening a socket server, etc.

cm21872y ago

Also it may or may not be relevant to not exclude the warm up time for JIT languages. If it is for a calculation you will do continuously, including the warm up time may bias the results materially (I know for .net it can be significant).

attractivechaosOP2y ago

Clarification: the examples are not taken from Rosetta code. Only the n-queens algorithm is inspired by (but still different from) an Rosetta code implementation; otherwise this benchmark has nothing to do with Rosetta code.

WhereIsTheTruth2y ago

There is one for data processing here: https://github.com/jinyus/related_post_gen

latenightcoding2y ago· 3 in thread

These are not the best benchmarks, but Python is indeed as slow as Perl, which I find insane considering that Python has 100-1000 more people working on the interpreter and performance has been a big emphasis the last few years.

sneed_chucker2y ago

Well, for a long time speed was not a priority at all for the team developing cpython. In fact, Python 3 was still a bit slower than Python 2 until a few versions ago.

Recently there have been some decent improvements to CPython's speed, but there are real upper limits to how fast you can make an interpreter. CPython will need JIT compilation if it is ever to break out of its current speed bracket.

JavaScript has had a feature complete JIT reference implementation since 2008 which is a major part of the reason JS applications exploded so much in the 2010s.

d0mine2y ago

The benchmark is not representative of how Python is actually used in practice. It ignores the existence of libraries (explicitly). For example, If numpy, pytorch were used for matmul, the results would be completely different.

latenightcoding2y ago

Perl people would use PDL or Math::GSL. I look into these benchmarks because I'm interested in VMs, and yeah the Python VM can be quite slow if you are not calling C code.

shpx2y ago· 3 in thread

You should add a chart of the number of gzip'd bytes of source code.

morepedantic2y ago

IMO, uncompressed bytes is a better representation, because it can be used to compare relative expressive power for the particular problem. I'd bet Python cleans house here, but the write-only languages are a wild card.

lifthrasiir2y ago

"Expressive power" is a very subjective term, and uncompressed size is a bad proxy as it includes too many variables specific to coding conventions. Compressed size with a stupid enough algorithm (here gzip) is meant to reduce these variables. The true Kolmogorov complexity in comparison can't be computed, and too smart algorithms can start to infer enough about the language itself.

1 more reply

dawnofdusk2y ago

Why would uncompressed bytes be better? Using a good compression algorithm better approximates the statistical entropy of the code which is at least correlated with e.g., Kolmogorov complexity.

2 more replies

borissk2y ago· 3 in thread

N-queens and matrix benchmark without Fortran and Wolfram Mathematica?

smitty11102y ago

I was just popping in to make a similar comment. Fortran is probably doing the heavy lifting for most of the entries here, you might as well show how much your language-specific overhead is by including it.

bjourne2y ago

No, it isn't. OP is benchmarking language-native implementations of all algos.

Haemm0r2y ago

Fortran the king of matrices :-)

daxfohl2y ago· 3 in thread

Odd that nqueens and sudoku have a high correlation but matmul seems to be largely doing its own thing.

nqueen vs. sudoku: 0.531

matmul vs. sudoku: 0.362

matmul vs. nqueen: 0.127

brrrrrm2y ago

matmul is old and useful -- there's a lot of hardware on a chip that makes it run much faster (prefetch, vectorization, instruction parallelism) and some of these languages have optimizations to expose those things automatically.

gnufx2y ago

Those will not simply give you close to optimal performance (>~90% of peak), even, say, POWER10's 4x4 matmul instruction. You need a structure matching the micro-architecture -- the cache and register structure [1]. That's not a triply-nested loop. Any remotely decent compiler will unroll and vectorize the micro-kernel appropriately, but you may still have to resort to assembler-level prefetch fiddling for the last 10s of percent performance (specifically on avx512). Compilers may recognize the loop structure and replace it with an optimal-ish implementation. I've an idea that FORTRAN H extended did that, but Polly does it in clang.

1. https://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.p...

yxhuvud2y ago

And some languages that perform array bounds check on each array deindex operation, which basically stops some of the above even if the optimizer can do all of it. Crystal is an example of that, where it would be possible and quite straightforward to write a specialized matrix class instead of the very generic dynamic array implementation.

1letterunixname2y ago· 3 in thread

For the love of god, log graph tiny values with large values please. :)

xnorswap2y ago

Another alternative here is to plot the reciprocal (Op / Sec) which has the benefit of both the more natural "Bigger is Better" and easier comparison of the fastest runtimes.

igouy2y ago

"Bigger is Better" may be "more natural" for Op / Sec but secs is "more natural" for "comparison of the fastest runtimes".

airstrike2y ago

Or just make two charts ;-)

lifthrasiir2y ago· 2 in thread

> It is obvious that c[i], b[k] and a[i][k] can be moved out of the inner loop to reduce the frequency of matrix access. [...] However, most other languages cannot optimize this nested loop. If we manually move a[i][k] to the loop above it, we can often improve their performance.

This is only true when three matrices are independent of each other, and also why C has a `restrict` qualifier that enables this assumption. The benchmark itself has no such assumption because all three variables are defined as `double **`, and it can be verified by assembly outputs. Clang's excellent performance in matmul is probably much more due to autovectorization.

yxhuvud2y ago

In some cases there are other reasons for hoiting the dereference too. For example, Crystal will check if the array access is out of bounds and by hoisting variables that will be done a lot less seldom, which can have huge effects for code that does a lot of that, like matmul.

cm21872y ago

Doesn’t that get mostly optimised away by the cpu branch prediction?

2 more replies

jakobnissen2y ago· 2 in thread

Neat!

The Julia matmul implementation has its rows and columns flipped though - unlike C, Julia uses row-major matrices. This has large implications for speed.

Also, the code may be much faster if you enable SIMD in the function, which is disabled in the code because a) the code unnecessarily checks bounds at every index instead of at the top of the function, and b) float SIMD is opt-in since SIMD changes the rounding

attractivechaosOP2y ago

Just come here to say this has been fixed in PR #2. The figure has been updated accordingly.

mafuy2y ago

Yes. For this type of test, Julia should be nearly the exact same speed as C unless there is a mistake in the code. Which, to be fair, can happen very easily, especially with implicit typing.

airstrike2y ago· 2 in thread

Very interesting. My two cents: because the stacked bars for php, ruby, perl and py:cpy, it's impossible to compare the other languages in that first chart. All the chart says is that the benchmark is much slower in those 4 languages relative to "all other languages we tested".

It would be nice to see those other languages in a chart that doesn't include the slower four. Alternatively, you could also show those slower four with "broken" columns like this https://peltiertech.com/broken-y-axis-in-excel-chart/

tmtvl2y ago

The second chart is basically the first chart with the slow 4 removed (well, not removed but at such a scale that they're irrelevant).

airstrike2y ago

Ah, duh, I see that now. That's what I get for commenting too quickly ;-)

The choice of stacked bar charts is also strange because the languages are not all comparable

> Every language has nqueen and matmul implementations. Some languages do not have sudoku or bedcov implementations. In addition, I implemented most algorithms in plb2 and adapted a few contributed matmul and sudoku implementations in plb. As I am mostly a C programmer, implementations in other languages may be suboptimal and there are no implementations in functional languages. Pull requests are welcomed!

So the point still stands that many charts doing one thing each would be better than fewer charts doing many things each

1 more reply

alberth2y ago· 2 in thread

> ” Timing on Apple M1 Macbook Pro”

Given that its become increasingly more common for CPUs to have both Performance & Efficency cores … how do benchmarks ensure they are only being run on the P-cores?

vlovich1232y ago

Sibling answered on macOS. Their scheduler prefers by default to run everything on performance and reserves efficiency for background OS tasks by default. On Linux you could set the affinity when running by pinning the cpu affinity either upfront through task set or programmatically at runtime (eg I have 32 cores and 0-15 are performance while the rest are efficiency)

sroussey2y ago

It actually takes a bit of effort to run on the efficiency cores.

I believe Game Mode will push the processes to e cores to keep a consistent game play without thermal throttling.

bjourne2y ago· 2 in thread

Having checked the C implementations of the algorithms, I'm a little skeptical because the implementations aren't optimized. Tiling matrix multiplication exploiting SIMD could easily improve performance 100-fold or more. At those speeds the cost of memory transfer usually dominate so the languages that give you the most fine-grained control over how data is laid out in memory tend to win. And it may not be the same languages that are now winning in your benchmark.

nerdponx2y ago

I think a benchmark of "naive" implementations is interesting too, because it shows you how fast your code is usually going to run, not how fast it theoretically could run at its best.

igouy2y ago

> shows you how fast your code is usually going to run

Why would you think that?

Maybe the "naive" implementation just shows how fast the easily removed hotspot in your code is going to run.

"Swap the order of two statements and see the Java code slow down … Swap globals for local variables in a function and see the Python code speed up. Swap language implementations and see the C code speed up."

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

montebicyclelo2y ago· 2 in thread

The reality is that it would be very hard to find Python code that does not use NumPy (or some tensor lib), for matmul.

Including time to JIT compile is questionable, why not also include time to compile the compiled languages?

igouy2y ago

"Nonetheless, because most benchmarks run for several seconds, including the startup time does not greatly affect the results."

https://github.com/attractivechaos/plb2#startup-time

montebicyclelo2y ago

Yes, I saw that and still consider the methodology questionable. A fairer approach might be: time from the cli, including compilation, for compiled languages. (Or warmup the jit compiled code.)

1 more reply

account-52y ago· 2 in thread

Glad to see dart in there, it's normally overlooked. Not a bad result. I wonder what speed it would be if it had been AOT compiled instead of JIT.

igouy2y ago

fwiw

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

adamredwoods2y ago

And LuaJit surprised me quite a bit!

FrustratedMonky2y ago· 2 in thread

It has .NET generic.

Would .NET with F# make big difference here?

I'm little surprised Java beat .NET. Is that typical these days?

Kuinox2y ago

> I'm little surprised Java beat .NET. Is that typical these days?

Not at all, this is just a bad benchmark, it measure from the CLI run, with no specific flags which is just terrible for cold starts.

airstrike2y ago

+1 curious about F#

raister2y ago· 2 in thread

I would expect Mojo to perform better in these settings.

jorgelopes2y ago

Mojo is on the list and performed well. Have a look

vsskanth2y ago

its 2x slower than C in matmul which is something I wouldn't expect mojo to be slow at

1 more reply

jodrellblank2y ago· 1 in thread

Doesn’t include Prolog which has decent constraint solver answers for n-queens and sudoku, which are pretty fast but I don’t know how they would compare in benchmarks:

https://www.metalevel.at/queens/

https://www.metalevel.at/sudoku/

trenchgun2y ago

Make a pull request.

pasc18782y ago· 1 in thread

See also https://benchmarksgame-team.pages.debian.net/benchmarksgame/... for more algorithms and code hand tuned as well as plain.

igouy2y ago

Thanks, and they know: "Plb2 complements the Computer Language Benchmark Games."

neonsunset2y ago· 1 in thread

C# code should be using an official package for GEMM which is System.Numerics.Tensors, it is pure C# but runs at max hw efficiency (it is also idiomatic). Or at least use Vector<T> instead of scalar operations. I’d expect this applies to most other popular languages here too.

attractivechaosOP2y ago

System.Numerics.Tensors is disqualified because it uses a different algorithm. If you have a faster matmul implementation in Vector<T>, a PR will be much appreciated. Thank you in advance.

GrumpySloth2y ago· 1 in thread

A curious thing about Swift: after https://github.com/attractivechaos/plb2/pull/23, the matrix multiplication example is comparable to C and Rust. However, I don’t see a way to idiomatically optimise the sudoku example, whose main overhead is allocating several arrays each time solve() is called. Apparently, in Swift there is no such thing as static array allocation. That’s very unfortunate.

attractivechaosOP2y ago

Figure updated. Now swift is pretty fast on nqueen+matmul but it has the longest green bar (i.e. longest running time for sudoku). This looks ... interesting.

There are only 20 thousand array allocations in total, not a lot. Javascript also has these many arrays allocated/deallocated but it is 4 times as fast.

mtreis862y ago· 1 in thread

A benchmark I would like to see is a comparison of languages in terms of how fast they are to beginners vs experts. I've been thinking about how to design it to get that result. What I think would work is taking something like these simple puzzles and have maybe a hundred people write up different solutions, so we can compare them using the programmer's level of expertise as one of the factors.

igouy2y ago

How would you decide who were beginners and who were experts?

Here are naive line-by-line transliterations from an original C program:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Here exhaustively-optimised + multicore + vector-instruction programs are included:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

naet2y ago· 1 in thread

Surprised how well JS (node and others) seem to come out when I've had firsthand experience of switching from JS to Go to speed up an algorithm type question and had the Go version crank through a bruteforce much much faster.

Maybe I made an accidental optimization in my language translation, or maybe there are some operations that are much slower in JS and these benchmarks didn't hit any of them.

lifthrasiir2y ago

Go's AOT compiler is actually not that sophiscated (comparable to -O1 in most C compilers, possibly even worse). Was your program running only for a fraction of a second? Then the JIT may haven't fully warmed up and even a basic AOT compiler has a better chance to win.

Jaxan2y ago· 1 in thread

One thing I always wonder with these benchmarks is: do we want to know the best possible runtime for a language or the runtime of idiomatic/average implementations? Even with C there are loads of compiler flags (and compilers) to choose from.

igouy2y ago

The data has nothing definite to say about language implementations that were not measured. Let's focus on what we might say about those that we're measured.

wslh2y ago· 1 in thread

Nowadays I would add lines of code and syntax complexity to the benchmark. It is apples and oranges comparison but you could prefer clarity than performance depending on the circumstances. Also, it is relatively easy to do with today language tools.

igouy2y ago

For example:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

"How source code size is measured"

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

brightball2y ago· 1 in thread

I wonder if JRuby or Truffle Ruby would have a significant impact on this one for Ruby?

igouy2y ago

Please make and publish those measurements!

wk_end2y ago· 1 in thread

This is great, thanks for posting this.

OP - are you interested in pull requests adding support for other languages?

attractivechaosOP2y ago

Of course! Please implement at least nqueen and matmul as they have been implemented in every language in the benchmark.

sakras2y ago· 1 in thread

How come not all benchmarks appear on all languages? For example, Zig's bar appears to be lower than C's by virtue of sudoku and bedconv being missing.

tmtvl2y ago

Someone needs to write sudoku and bedconv for Zig.

nnx2y ago

Nice to see Go performing well. Would be interesting to see the results with PGO enabled.

taeric2y ago

This is an area where "solvers" are heavily under utilized. For the absolute fastest solution, bespoke implementations are almost certainly required.

However, translating Sudoku and N-Queens into a similar problem that you can feed into a solver can get you a long way. Even better, you can move that solver into whatever language gives you the best optimizations that you can work. Even better, there are almost certainly optimizations in common solvers that you don't want to deal with implementing on your on.

_giorgio_2y ago

Given the times we're in, it would be interesting to test some libraries too, like numpy and pytorch

quelsolaar2y ago

Great post: Id be interested to see the difference between different C compilers, and to see what the difference is if the C code is compiled with a C++ compilers, if possible.

doug_durham2y ago

I think this shows the value of programmer productivity over performance at all costs. Python is one of the most popular languages despite having performance issues for complex algorithms. Users value clarity and ease of expression over performance. That's why Python is primarily used a glue code in these complex tasks.

2 more replies

j / k navigate · click thread line to collapse

186 comments

125 comments · 37 top-level

p-e-w2y ago· 10 in thread

What is this supposed to demonstrate?

There appears to be roughly the same code structure, ported to every language, while for some languages, arbitrary optimizations are introduced (such as using `array` instead of `list` in Python).

Without taking such real-world conventions into account, such comparisons say essentially nothing about the languages involved (and their all-important ecosystems).

tsimionescu2y ago

p-e-w2y ago

> But it can show the relative speed of pure Python at purely computational tasks.

But that's irrelevant if nobody uses "pure Python" for computational tasks.

It's like asking "how well do these languages run on a Lisp machine from 1979?". It simply has no relevance to real-world considerations today.

3 more replies

lifthrasiir2y ago

> Generally, the point of language benchmarks is to show how the languages compare at solving the same problem, without external libraries.

1 more reply

morepedantic2y ago

p-e-w2y ago

> It demonstrates that Python needs libraries like NumPy.

> It also means that adding performance to an existing Python program requires dropping into a different language

Nobody who needs performant matrix multiplication from C thinks "hey, let me just write two nested loops".

3 more replies

lifthrasiir2y ago

1 more reply

doix2y ago

It's actually not that bad. I think it's part of the reason Python became so popular, it's fairly easy to write C code and expose it via python.

RHSeeger2y ago

> It demonstrates that Python needs libraries like NumPy.

attractivechaosOP2y ago

> What is this supposed to demonstrate?

ggm2y ago

So you'd be measuring the speed of loops over out calls to c or Fortran (nag library) or their vectorisation and only testing the cost of data representation changes

Atotalnoob2y ago· 8 in thread

Aren’t JIT languages at a disadvantage since they are benchmarked through the CLI rather than using a benchmarking library to allow JIT to warmup?

brabel2y ago

nithril2y ago

From nothing to result rarely happens in real live. I hardly see someone to start/stop a program per unit task (like piping commands).

1 more reply

igouy2y ago

Some language implementations have run time startup costs (bytecode verifier), not-required at run time by other language implementations.

https://www.oracle.com/java/technologies/security-in-java.ht...

Why should that run time startup cost be ignored?

jakobnissen2y ago

Yes, but the author claims the longest JIT warmup is 0.3 seconds, so it's not an important issue in these benchmarks that take several seconds.

lifthrasiir2y ago

1 more reply

nithril2y ago

0.3s is significant for a task that only takes 1.14s

1 more reply

borissk2y ago

What's CLI?

zogrodea2y ago

Presumably command line interface/terminal. I can type `dotnet run` in the terminal and provide some options to runa .NET program for example.

hyperman12y ago· 6 in thread

I wondered why rust is so far behind C/nim/zig, they should have similar behaviour.

The difference is mostly in matmul.

I see how C, for an n x n matrix, does 2 allocations, while rust does n+1. C's matrix rows are right next to each other, rusts are probably all over the place. Didn't look at nim or zig.

nrabulinski2y ago

hgomersall2y ago

Also it's using checked indexing, which apart from being not idiomatic, is also going to slow things down. A fairer comparison would be to use the unchecked indexing variants.

2 more replies

attractivechaosOP2y ago

FYI: rust has been updated to avoid unnecessary bound check in PR #4. It now matches the C version in performance.

tiehuis2y ago

Zig had a similar issue which I made an MR to fix. I didn't actually notice Rust had the same issue but that's probably just my forgotten knowledge with how the vec! macros expand when nested.

spense2y ago

rust is equivalent to C when using static allocation

i sent a pr to fix it

Aissen2y ago

Edit 2: updated, and my rust iterator version now ~equivalent to C with no unsafe.

Edit 3: too late, the repo has been updated with an other iterator-based version that is just as fast.

1 more reply

metaltyphoon2y ago· 6 in thread

Why not run C# as AOT? It’s a one line change.

borissk2y ago

Do people usually run C# code AOT compiled?

romanovcode2y ago

If it's for something stupid like benchmark comparison - yes.

metaltyphoon2y ago

I don't know about people but I do pretty much every where I can. I make such a difference that AWS is heavily investing on lambdas that uses AOT[0].

[0] - https://docs.aws.amazon.com/lambda/latest/dg/dotnet-native-a...

neonsunset2y ago

It’s gaining popularity for CLI and serverless, some people use it to replace C++ for writing native libraries and plugins as well.

mlhpdx2y ago

Yes, everywhere I possible can. Particularly for targeting CLI programs and Linux daemons.

xnorswap2y ago

Or use BenchmarkDotNet which, among other things to get an accurate benchmark, does JIT warmup outside of measurement.

( https://github.com/dotnet/BenchmarkDotNet ).

akashcoach2y ago· 6 in thread

Why is c# so slow for Matmul compared to java?

useerup2y ago

C# also has arrays-of-arrays, and could (should) be written in the same manner.

iruoy2y ago

I've just done this and it has been merged. The benchmarks table and image haven't been updated yet. But this should bring the C# result to ~2s instead of 4.67s

akashcoach2y ago

Thanks for the explanation. Yes this make sense to me.

mlhpdx2y ago

The implementation isn’t using any modern C# features?

neonsunset2y ago

akashcoach2y ago

Which modern features do you mean? Java and C# code looks similar to me.

theteapot2y ago· 5 in thread

EmberTwin2y ago

habibur2y ago

I put the pure math codes in C extension of PHP. Building C extension for PHP is easier than most of the other high level languages. And then things get blazingly fast.

brrrrrm2y ago

doesn't look that easy... zend_parse_parameters? pre-baked configure + make scripts?

check out how a modern language deals with this stuff https://bun.sh/docs/api/ffi#usage

1 more reply

Ayesh2y ago

The state of PHP 8.3 and especially 8.4 is a lot better than HHVM. They have diverged quite a lot to a point that it's no longer a drop-in replacement.

PHP added JIT in 8.0, and these math-heavy tasks can take advantage of it. It's not trivial to fine tine JIT configuration though.

In PHP 8.4 (scheduled Nov 2024), there is a major upgrade to JIT as well.

geek_at2y ago

Also since PHP was built for web requests it builds The cache on the First run. would be interesting to see All languages of the benchmark with a second run

keyle2y ago· 3 in thread

I love a good benchmark, thanks for putting this together! However, I have a bit of feedback.

Second, the examples are taken from rosetta code and are not necessarily what would be the best implementation, or even close to the best implementation, for benchmarking purposes.

More benchmarks would be interesting to see, such as actual real world operations, e.g. opening a file, reading it, parsing json, opening a socket server, etc.

cm21872y ago

attractivechaosOP2y ago

WhereIsTheTruth2y ago

There is one for data processing here: https://github.com/jinyus/related_post_gen

latenightcoding2y ago· 3 in thread

sneed_chucker2y ago

Well, for a long time speed was not a priority at all for the team developing cpython. In fact, Python 3 was still a bit slower than Python 2 until a few versions ago.

JavaScript has had a feature complete JIT reference implementation since 2008 which is a major part of the reason JS applications exploded so much in the 2010s.

d0mine2y ago

latenightcoding2y ago

Perl people would use PDL or Math::GSL. I look into these benchmarks because I'm interested in VMs, and yeah the Python VM can be quite slow if you are not calling C code.

shpx2y ago· 3 in thread

You should add a chart of the number of gzip'd bytes of source code.

morepedantic2y ago

lifthrasiir2y ago

1 more reply

dawnofdusk2y ago

Why would uncompressed bytes be better? Using a good compression algorithm better approximates the statistical entropy of the code which is at least correlated with e.g., Kolmogorov complexity.

2 more replies

borissk2y ago· 3 in thread

N-queens and matrix benchmark without Fortran and Wolfram Mathematica?

smitty11102y ago

bjourne2y ago

No, it isn't. OP is benchmarking language-native implementations of all algos.

Haemm0r2y ago

Fortran the king of matrices :-)

daxfohl2y ago· 3 in thread

Odd that nqueens and sudoku have a high correlation but matmul seems to be largely doing its own thing.

nqueen vs. sudoku: 0.531

matmul vs. sudoku: 0.362

matmul vs. nqueen: 0.127

brrrrrm2y ago

gnufx2y ago

1. https://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.p...

yxhuvud2y ago

1letterunixname2y ago· 3 in thread

For the love of god, log graph tiny values with large values please. :)

xnorswap2y ago

Another alternative here is to plot the reciprocal (Op / Sec) which has the benefit of both the more natural "Bigger is Better" and easier comparison of the fastest runtimes.

igouy2y ago

"Bigger is Better" may be "more natural" for Op / Sec but secs is "more natural" for "comparison of the fastest runtimes".

airstrike2y ago

Or just make two charts ;-)

lifthrasiir2y ago· 2 in thread

yxhuvud2y ago

cm21872y ago

Doesn’t that get mostly optimised away by the cpu branch prediction?

2 more replies

jakobnissen2y ago· 2 in thread

Neat!

The Julia matmul implementation has its rows and columns flipped though - unlike C, Julia uses row-major matrices. This has large implications for speed.

attractivechaosOP2y ago

Just come here to say this has been fixed in PR #2. The figure has been updated accordingly.

mafuy2y ago

Yes. For this type of test, Julia should be nearly the exact same speed as C unless there is a mistake in the code. Which, to be fair, can happen very easily, especially with implicit typing.

airstrike2y ago· 2 in thread

tmtvl2y ago

The second chart is basically the first chart with the slow 4 removed (well, not removed but at such a scale that they're irrelevant).

airstrike2y ago

Ah, duh, I see that now. That's what I get for commenting too quickly ;-)

The choice of stacked bar charts is also strange because the languages are not all comparable

So the point still stands that many charts doing one thing each would be better than fewer charts doing many things each

1 more reply

alberth2y ago· 2 in thread

> ” Timing on Apple M1 Macbook Pro”

Given that its become increasingly more common for CPUs to have both Performance & Efficency cores … how do benchmarks ensure they are only being run on the P-cores?

vlovich1232y ago

sroussey2y ago

It actually takes a bit of effort to run on the efficiency cores.

I believe Game Mode will push the processes to e cores to keep a consistent game play without thermal throttling.

bjourne2y ago· 2 in thread

nerdponx2y ago

I think a benchmark of "naive" implementations is interesting too, because it shows you how fast your code is usually going to run, not how fast it theoretically could run at its best.

igouy2y ago

> shows you how fast your code is usually going to run

Why would you think that?

Maybe the "naive" implementation just shows how fast the easily removed hotspot in your code is going to run.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

montebicyclelo2y ago· 2 in thread

The reality is that it would be very hard to find Python code that does not use NumPy (or some tensor lib), for matmul.

Including time to JIT compile is questionable, why not also include time to compile the compiled languages?

igouy2y ago

"Nonetheless, because most benchmarks run for several seconds, including the startup time does not greatly affect the results."

https://github.com/attractivechaos/plb2#startup-time

montebicyclelo2y ago

Yes, I saw that and still consider the methodology questionable. A fairer approach might be: time from the cli, including compilation, for compiled languages. (Or warmup the jit compiled code.)

1 more reply

account-52y ago· 2 in thread

Glad to see dart in there, it's normally overlooked. Not a bad result. I wonder what speed it would be if it had been AOT compiled instead of JIT.

igouy2y ago

fwiw

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

adamredwoods2y ago

And LuaJit surprised me quite a bit!

FrustratedMonky2y ago· 2 in thread

It has .NET generic.

Would .NET with F# make big difference here?

I'm little surprised Java beat .NET. Is that typical these days?

Kuinox2y ago

> I'm little surprised Java beat .NET. Is that typical these days?

Not at all, this is just a bad benchmark, it measure from the CLI run, with no specific flags which is just terrible for cold starts.

airstrike2y ago

+1 curious about F#

raister2y ago· 2 in thread

I would expect Mojo to perform better in these settings.

jorgelopes2y ago

Mojo is on the list and performed well. Have a look

vsskanth2y ago

its 2x slower than C in matmul which is something I wouldn't expect mojo to be slow at

1 more reply

jodrellblank2y ago· 1 in thread

Doesn’t include Prolog which has decent constraint solver answers for n-queens and sudoku, which are pretty fast but I don’t know how they would compare in benchmarks:

https://www.metalevel.at/queens/

https://www.metalevel.at/sudoku/

trenchgun2y ago

Make a pull request.

pasc18782y ago· 1 in thread

See also https://benchmarksgame-team.pages.debian.net/benchmarksgame/... for more algorithms and code hand tuned as well as plain.

igouy2y ago

Thanks, and they know: "Plb2 complements the Computer Language Benchmark Games."

neonsunset2y ago· 1 in thread

attractivechaosOP2y ago

System.Numerics.Tensors is disqualified because it uses a different algorithm. If you have a faster matmul implementation in Vector<T>, a PR will be much appreciated. Thank you in advance.

GrumpySloth2y ago· 1 in thread

attractivechaosOP2y ago

Figure updated. Now swift is pretty fast on nqueen+matmul but it has the longest green bar (i.e. longest running time for sudoku). This looks ... interesting.

There are only 20 thousand array allocations in total, not a lot. Javascript also has these many arrays allocated/deallocated but it is 4 times as fast.

mtreis862y ago· 1 in thread

igouy2y ago

How would you decide who were beginners and who were experts?

Here are naive line-by-line transliterations from an original C program:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Here exhaustively-optimised + multicore + vector-instruction programs are included:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

naet2y ago· 1 in thread

Maybe I made an accidental optimization in my language translation, or maybe there are some operations that are much slower in JS and these benchmarks didn't hit any of them.

lifthrasiir2y ago

Jaxan2y ago· 1 in thread

igouy2y ago

The data has nothing definite to say about language implementations that were not measured. Let's focus on what we might say about those that we're measured.

wslh2y ago· 1 in thread

igouy2y ago

For example:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

"How source code size is measured"

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

brightball2y ago· 1 in thread

I wonder if JRuby or Truffle Ruby would have a significant impact on this one for Ruby?

igouy2y ago

Please make and publish those measurements!

wk_end2y ago· 1 in thread

This is great, thanks for posting this.

OP - are you interested in pull requests adding support for other languages?

attractivechaosOP2y ago

Of course! Please implement at least nqueen and matmul as they have been implemented in every language in the benchmark.

sakras2y ago· 1 in thread

How come not all benchmarks appear on all languages? For example, Zig's bar appears to be lower than C's by virtue of sudoku and bedconv being missing.

tmtvl2y ago

Someone needs to write sudoku and bedconv for Zig.

nnx2y ago

Nice to see Go performing well. Would be interesting to see the results with PGO enabled.

taeric2y ago

This is an area where "solvers" are heavily under utilized. For the absolute fastest solution, bespoke implementations are almost certainly required.

_giorgio_2y ago

Given the times we're in, it would be interesting to test some libraries too, like numpy and pytorch

quelsolaar2y ago

Great post: Id be interested to see the difference between different C compilers, and to see what the difference is if the C code is compiled with a C++ compilers, if possible.

doug_durham2y ago

2 more replies

j / k navigate · click thread line to collapse