How does Clang 2.7 hold up in 2021? (opens in new tab)

(gist.github.com)

163 pointstbodt5y ago32 comments

32 comments

27 comments · 9 top-level

kevmo3145y ago· 6 in thread

> It is possible in theory that code that's less carefully optimized exhibits different behavior, or that the benchmarks chosen here are simply not as amenable to compiler optimization as they could be

This seems like a rather important point that's glossed over. Typical code is not often as optimized and meticulously written. It would be nice to see how much compilers have improved there.

Narishma5y ago

It's not glossed over. It's mentioned multiple times in the article.

bombcar5y ago

Yeah, I would suspect that a code that has "algorithms" is already pretty well optimized before it hits the compiler; the place where gains would be seen would be in "enterprise" or business code - or something like OpenOffice.

Harder to benchmark.

mpweiher5y ago

> harder to benchmark

And typically much harder to care about, as in: those applications are not stuck in algorithms that benefit from compiler optimisation.

They're waiting for the database. Or the user. Or on the graphics library.

1 more reply

josephg5y ago

Yeah I agree. It would be really interesting to see the performance difference of a larger program - firefox, the linux kernel, postgres, or maybe clang itself.

Unfortunately it might be hard to get the same program to compile with both compilers without a bit of work.

bsaul5y ago

isn't llvm 11 able to build old llvm 2.7 software ?

1 more reply

flohofwoe5y ago

More work for the optimizer (because the code hasn't been manually "pre-optimized") most likely means longer compile times too though, so the relation between "twice the compile time for 15% speedup" might not improve much, and the optimizer might spend a lot of time on code that actually doesn't need optimizing (because it's not "hot" code).

I agree though, it would be great to see the same experiment on other code bases.

einpoklum5y ago· 3 in thread

It's not clear that the author of that post used `-march=native -mtune=native`. And if they didn't, that could account for the odd results.

my1235y ago

In practice, you can almost never use that for desktop and tablet software.

Most of your users would not be able to use the software otherwise, which is not a small problem.

aktenlage5y ago

If they did, the article would really need a distinction between the speedup by new hardware features (which the old compiler cannot know) and hardware-independent smarter optimizations.

Since the author seems to care about the latter, I assume they did not use those flags.

brokensegue5y ago

unless they ran it on hardware from clang 2.7's era.

uep5y ago· 2 in thread

I have a simple C++ raytracer I wrote by going through Ray Tracing in One Weekend. I have not even made an attempt to optimize it. I really only made it parallel by splitting it up into tiles.

Clang 10 was able to automatically vectorize the code, so it performs >2x as fast as GCC 8.3. To be fair to GCC, I'm using my distro's GCC, but I built a newer Clang for C++ coroutine support.

slacka5y ago

Are you sure? Modern clang and gcc both have auto-vectorizers. clang's is enabled by default.[1] gcc requires '-ftree-vectorize'[2]. For my use case, I've seen the most improvements with clang + openmp + polly, requiring code changes along with hinting. Good news if your analysis is correct.

As far as the article, I'm surprised Cache and Meshlets are 5% slower in 11 than 2.7. Some insight could be gained as to what caused this regression.

[1] https://llvm.org/docs/Vectorizers.html

[2] https://gcc.gnu.org/projects/tree-ssa/vectorization.html

uep5y ago

Am I sure about what? If it is auto-vectorizing? Yes. If the performance difference at O2 for both compilers is that dramatic? Yes. If the vectorization is the ultimate difference in the performance? No, not really.

I looked at the disassembly with objdump. I tend to build with both clang and GCC regularly, for some reason I like comparing them. Since I'm sending many rays and bounces, a 50% reduction in time is very noticeable, so I looked at the generated code. I mentioned the GCC version because it is slightly unfair to compare a very new clang to GCC from a few years back. The GCC output has some vectorization as well, but the clang output seems to generate smaller code with more vectorization. It would be interesting to compare it side-by-side on godbolt, but I'd have to cut-and-paste a bunch of files to do so, and it's not a priority at the moment.

Maybe I should have responded to another comment here. The intention of my previous comment was to bolster the idea that more typical naive and less-optimized code might benefit more than already-optimized code like in the article. 3d math in general is obviously a domain that can benefit from vectorization more than most.

Another fun find, was that sharing the PRNG state among threads destroyed performance. I have other higher priority side-projects, so I haven't had a chance to investigate why yet. Whether it was something like the cache-line bouncing between cores (I wouldn't be surprised if the PRNG was the hottest code in the whole program), or a cascading effect on the generated code. A lot of my code is visible to the compiler for the ray tracing hot path, so it's also possible it broke inlining or some other compiler optimizations.

FartyMcFarter5y ago· 2 in thread

I would expect Proebsting's law to hit a wall faster than Moore's law, simply because software performance is better understood than physics.

Perhaps someone could compare FORTRAN compilers to get a longer term view.

ChrisLomont5y ago

Both general relativity and quantum field theories make predictions that match experiments to around 12 digits of accuracy.

I doubt anything in software performance comes within many orders of magnitude of that.

FartyMcFarter5y ago

Accurate precisions are not enough to fulfill Moore's law.

person_of_color5y ago· 2 in thread

Does anyone have a good resource/book on how to do close to the metal benchmarking?

matt_d5y ago

I can definitely recommend https://book.easyperf.net/perf_book

The author's blog has been consistently great throughout the years, https://easyperf.net/notes/

See also microarchitectural performance analysis tools & readings, https://github.com/MattPD/cpplinks/blob/master/performance.t... and "Comments on timing short code sections on Intel processors", http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timin...

sesuximo5y ago

Intel’s vtune manual and anger fog’s manuals cover this (and much more)

person_of_color5y ago· 2 in thread

Wow, 10 years for only 15%.

gameswithgo5y ago

To be fair even the less optimized areas of meshoptimizer are very low level code. Probably not much optimization to be done. I've seen this in other domains too, I have some graphics/art code that is very low level C#, and .NET 4.6 to the latest .NET Core, which has huge performance gains in normal enterprise code, does nothing for it. Which makes sense its all carefully thought out loops on arrays, not much to be done.

But that does bring up the point, would we be better served by just, writing lower level code when needed and turning off all these optimizations for faster compile times?

dtech5y ago

Better compiler means less spots need hand-optimizing. Machine time is cheap compared to human labor. I suspect the break-even comes very quickly in favor of slower but better compilers.

vendiddy5y ago· 1 in thread

Are bigger optimizations to be hard in the design of the higher level languages that are easier for compilers to optimize?

As an extreme example, I imagine dynamic languages are hard to optimize because the compiler can make few assumptions about the code.

(Have little knowledge of compilers so correct me if I'm wrong.)

samus5y ago

Higher level languages often use complex data structures at runtime that are a maze of pointers and thus suffer from bad cache locality. Such languages benefit the most from language-specific high-level optimizations. Haskell for example uses strictness analysis to eliminate pointless lazy evaluation, and loop fusion to combine calls to `filter`, `map` and friends, thereby avoiding building up intermediary lists.

A compiler can do very little for dynamic languages. It could try to apply high-level optimizations, but as you say they are few and between, and hard. A Just-in-time compiler that optimizes hot path at runtime is usually the way to go. Unfortunately, they are quite a bit more complex. Most dynamic languages did not have them for a long time.

chalst5y ago

I'm not all that surprised by the small improvement on regular C++ code: the last decade hasn't seen radical changes in how this is done; compiler innovation has been elsewhere with only the SIMD story seen in this article. I was surprised by the lousy build times, though.

The choice of WSL2 as platform introduces a few confounders, especially filesystem performance, which might distort the differences between build times in particular. If someone wants to get a better understanding of what's going on, maybe a breakdown of where the time is spent or performing the benchmarks on other platforms would be a good idea.

KirillPanov5y ago

> This takes me back to "The death of optimizing compilers" by David J. Bernstein

DJB is Daniel J. Bernstein

j / k navigate · click thread line to collapse

32 comments

27 comments · 9 top-level

kevmo3145y ago· 6 in thread

This seems like a rather important point that's glossed over. Typical code is not often as optimized and meticulously written. It would be nice to see how much compilers have improved there.

Narishma5y ago

It's not glossed over. It's mentioned multiple times in the article.

bombcar5y ago

Harder to benchmark.

mpweiher5y ago

> harder to benchmark

And typically much harder to care about, as in: those applications are not stuck in algorithms that benefit from compiler optimisation.

They're waiting for the database. Or the user. Or on the graphics library.

1 more reply

josephg5y ago

Yeah I agree. It would be really interesting to see the performance difference of a larger program - firefox, the linux kernel, postgres, or maybe clang itself.

Unfortunately it might be hard to get the same program to compile with both compilers without a bit of work.

bsaul5y ago

isn't llvm 11 able to build old llvm 2.7 software ?

1 more reply

flohofwoe5y ago

I agree though, it would be great to see the same experiment on other code bases.

einpoklum5y ago· 3 in thread

It's not clear that the author of that post used `-march=native -mtune=native`. And if they didn't, that could account for the odd results.

my1235y ago

In practice, you can almost never use that for desktop and tablet software.

Most of your users would not be able to use the software otherwise, which is not a small problem.

aktenlage5y ago

If they did, the article would really need a distinction between the speedup by new hardware features (which the old compiler cannot know) and hardware-independent smarter optimizations.

Since the author seems to care about the latter, I assume they did not use those flags.

brokensegue5y ago

unless they ran it on hardware from clang 2.7's era.

uep5y ago· 2 in thread

I have a simple C++ raytracer I wrote by going through Ray Tracing in One Weekend. I have not even made an attempt to optimize it. I really only made it parallel by splitting it up into tiles.

Clang 10 was able to automatically vectorize the code, so it performs >2x as fast as GCC 8.3. To be fair to GCC, I'm using my distro's GCC, but I built a newer Clang for C++ coroutine support.

slacka5y ago

As far as the article, I'm surprised Cache and Meshlets are 5% slower in 11 than 2.7. Some insight could be gained as to what caused this regression.

[1] https://llvm.org/docs/Vectorizers.html

[2] https://gcc.gnu.org/projects/tree-ssa/vectorization.html

uep5y ago

FartyMcFarter5y ago· 2 in thread

I would expect Proebsting's law to hit a wall faster than Moore's law, simply because software performance is better understood than physics.

Perhaps someone could compare FORTRAN compilers to get a longer term view.

ChrisLomont5y ago

Both general relativity and quantum field theories make predictions that match experiments to around 12 digits of accuracy.

I doubt anything in software performance comes within many orders of magnitude of that.

FartyMcFarter5y ago

Accurate precisions are not enough to fulfill Moore's law.

person_of_color5y ago· 2 in thread

Does anyone have a good resource/book on how to do close to the metal benchmarking?

matt_d5y ago

I can definitely recommend https://book.easyperf.net/perf_book

The author's blog has been consistently great throughout the years, https://easyperf.net/notes/

sesuximo5y ago

Intel’s vtune manual and anger fog’s manuals cover this (and much more)

person_of_color5y ago· 2 in thread

Wow, 10 years for only 15%.

gameswithgo5y ago

But that does bring up the point, would we be better served by just, writing lower level code when needed and turning off all these optimizations for faster compile times?

dtech5y ago

Better compiler means less spots need hand-optimizing. Machine time is cheap compared to human labor. I suspect the break-even comes very quickly in favor of slower but better compilers.

vendiddy5y ago· 1 in thread

Are bigger optimizations to be hard in the design of the higher level languages that are easier for compilers to optimize?

As an extreme example, I imagine dynamic languages are hard to optimize because the compiler can make few assumptions about the code.

(Have little knowledge of compilers so correct me if I'm wrong.)

samus5y ago

chalst5y ago

KirillPanov5y ago

> This takes me back to "The death of optimizing compilers" by David J. Bernstein

DJB is Daniel J. Bernstein

j / k navigate · click thread line to collapse