Outsmarting the compiler: Short story about optimization (opens in new tab)

(dpzmick.com)

67 pointsdpzmick8y ago24 comments

24 comments

22 comments · 8 top-level

taneq8y ago· 6 in thread

Outsmarting the compiler: A short story about optimisation:

"Don't."

The end.

(Basically, as the story shows, with this kind of micro-optimization you may or may not beat the compiler but you're almost certainly wasting your time compared with more effective optimization methods, like rethinking the problem.)

frozenport8y ago

This is stupid because it categorically denies the business case for micro-optomization.

Yet, a business case might exist when the library is heavily utilized, or often when a compiler isn't able to produce the correct code.

There are also cases primarily, in finance, where single threaded low-latency distinguishes competing groups. Some of those guys count every nanosecond.

The techniques described here ( and in other places) are universally applicable.

spc4768y ago

I recently played around with some code I wrote in college [1] which involves a pair of intertwined functions. I decided to play around with the SSE extensions on modern CPUs and write a vectorized version of the code.

Try as I might, I could not beat GCC [2], which used non-vectorized code. I chalk it up to not knowing how best to write optimized x86 code anymore (it's been years since I did any real assembly language programming) and I might be hitting some scheduling or pipeline issues, I just don't know.

[1] I described the code years ago here: http://boston.conman.org/2004/06/09.2

[2] I beat clang easily though.

pjmlp8y ago

The business case only makes sense when backed up by a profiler and an actual real business case regarding unhappy customers.

When that isn't the case, it is just wasting money.

FeepingCreature8y ago

In my experience, microoptimization tends to be more a matter of how to organize your code so as to enable the compiler to do the optimizations for you. Cases where the compiler is outright too stupid to find the optimization you want even with repeated prompting tend to be rare nowadays, and probably justify a bug report.

taneq8y ago

It does not. I said (emphasis added):

> you're almost certainly wasting your time compared with more effective optimization methods

Some rare cases absolutely do exist where specific micro-optimizations such as these may be useful - I'm not arguing that they don't.

Even in those cases, though, you're far more likely to achieve significant performance gains by taking a step back and re-examining your high level goals and your approach to the problem.

I'm not saying that eliminating a branch or two is never going to be useful. I'm just saying you should focus your attention elsewhere first.

dpzmickOP8y ago

In this case, yes, optimizations don't appear to be giving me much of a win. Although, I'm very glad I checked! There are certainly some very realistic scenarios in which it would be valuable to save this branch.

Veedrac8y ago· 2 in thread

> Then, a bit later, we need to very quickly finish populating the structs.

I am finding it extremely hard to envision a circumstance where this is a bottleneck for anything. Care to clarify the context?

I also find the struct layout really odd; why not just move c?

Your benchmarks are also probably broken; branch predictors use global state so will almost certainly predict fine the way you've used things. You need to repopulate a significantly-sized array each time with randomly chosen values. You can't use the same array because it'll be learnt, and you can't use a short array because it'll be predicted globally.

dpzmickOP8y ago

I'll give that benchmark trick a try. Thanks for the feedback.

For some additional context, these structs are packed wire-line protocol structs. In reality the padding bytes are full of other fun details.

Veedrac8y ago

The question for context is how this became something you want to optimise. It seems almost anything else you do with the values would cost more than the part you tried to optimise.

1 more reply

dmh20008y ago· 2 in thread

gcc warning: ISO C++ forbids zero-size array ‘payload’ [-Wpedantic]

clang warning: flexible array members are a C99 feature [-Wc99-extensions]

is this still the case?

dpzmickOP8y ago

Try compiling with c++11 turned on.

dmh20008y ago

I used -std=c++14

1 more reply

BenjiWiebe8y ago· 1 in thread

Quick upvote for a well designed mobile friendly site. The last few HN posts I read were awful on mobile.

dpzmickOP8y ago

Thanks! I started with a fork of this Jekyll site: https://github.com/johnotander/pixyll so I can't take much credit for the design

JoachimSchipper8y ago· 1 in thread

This is not so much "outsmarting the compiler" as "working with the compiler" - tweaking code to trigger certain optimizations/generate particular output. The missing part, of course, is showing that these optimizations actually help...

dpzmickOP8y ago

In this case I don't think that being clever can actually help us much. I have some microbenchmarks towards the end of the post that are not very indicative of any performance gained by writing this "clever" code.

Dreami8y ago· 1 in thread

How does this wrapper and each variant connect together? I supposed he would include a pointer to the actual struct in the wrapper, but I just see payload (that I'm gonna assume needs to be written to that array with padding in the actual struct).

For me, a wrapper includes the original thing and just wraps stuff around it. How does this work here? May also be a question to the author, I suppose now...

JoachimSchipper8y ago

In this wrapper, the payload[] field holds an instance of either struct A or of struct B - that is, the wrapper is "prepended" to the actual data.

bloaf8y ago· 1 in thread

So why wouldn't you just put c before the arr[padding#] chars?

dpzmickOP8y ago

In the real world version of this problem the padding bytes contain some real values and the structs are both written straight to a network line (protocol is defined externally)

cperciva8y ago

When looking at code performance, it's important to remember that conditional branches are almost always cheap inside microbenchmarks, because the CPU can figure out when the same branches get taken on every loop... but far more expensive in the real world. A similar issue applies to cache: Your code might fit inside the L1 cache in your benchmarks, but when it's used in the real world you get cache misses since the rest of the program accesses data too.

j / k navigate · click thread line to collapse

24 comments

22 comments · 8 top-level

taneq8y ago· 6 in thread

Outsmarting the compiler: A short story about optimisation:

"Don't."

The end.

frozenport8y ago

This is stupid because it categorically denies the business case for micro-optomization.

Yet, a business case might exist when the library is heavily utilized, or often when a compiler isn't able to produce the correct code.

There are also cases primarily, in finance, where single threaded low-latency distinguishes competing groups. Some of those guys count every nanosecond.

The techniques described here ( and in other places) are universally applicable.

spc4768y ago

[1] I described the code years ago here: http://boston.conman.org/2004/06/09.2

[2] I beat clang easily though.

pjmlp8y ago

The business case only makes sense when backed up by a profiler and an actual real business case regarding unhappy customers.

When that isn't the case, it is just wasting money.

FeepingCreature8y ago

taneq8y ago

It does not. I said (emphasis added):

> you're almost certainly wasting your time compared with more effective optimization methods

Some rare cases absolutely do exist where specific micro-optimizations such as these may be useful - I'm not arguing that they don't.

Even in those cases, though, you're far more likely to achieve significant performance gains by taking a step back and re-examining your high level goals and your approach to the problem.

I'm not saying that eliminating a branch or two is never going to be useful. I'm just saying you should focus your attention elsewhere first.

dpzmickOP8y ago

Veedrac8y ago· 2 in thread

> Then, a bit later, we need to very quickly finish populating the structs.

I am finding it extremely hard to envision a circumstance where this is a bottleneck for anything. Care to clarify the context?

I also find the struct layout really odd; why not just move c?

dpzmickOP8y ago

I'll give that benchmark trick a try. Thanks for the feedback.

For some additional context, these structs are packed wire-line protocol structs. In reality the padding bytes are full of other fun details.

Veedrac8y ago

The question for context is how this became something you want to optimise. It seems almost anything else you do with the values would cost more than the part you tried to optimise.

1 more reply

dmh20008y ago· 2 in thread

gcc warning: ISO C++ forbids zero-size array ‘payload’ [-Wpedantic]

clang warning: flexible array members are a C99 feature [-Wc99-extensions]

is this still the case?

dpzmickOP8y ago

Try compiling with c++11 turned on.

dmh20008y ago

I used -std=c++14

1 more reply

BenjiWiebe8y ago· 1 in thread

Quick upvote for a well designed mobile friendly site. The last few HN posts I read were awful on mobile.

dpzmickOP8y ago

Thanks! I started with a fork of this Jekyll site: https://github.com/johnotander/pixyll so I can't take much credit for the design

JoachimSchipper8y ago· 1 in thread

dpzmickOP8y ago

Dreami8y ago· 1 in thread

For me, a wrapper includes the original thing and just wraps stuff around it. How does this work here? May also be a question to the author, I suppose now...

JoachimSchipper8y ago

In this wrapper, the payload[] field holds an instance of either struct A or of struct B - that is, the wrapper is "prepended" to the actual data.

bloaf8y ago· 1 in thread

So why wouldn't you just put c before the arr[padding#] chars?

dpzmickOP8y ago

In the real world version of this problem the padding bytes contain some real values and the structs are both written straight to a network line (protocol is defined externally)

cperciva8y ago

j / k navigate · click thread line to collapse