The cost of dynamic vs. static dispatch in C++ (opens in new tab)

(eli.thegreenplace.net)

125 pointsmekishizufu12y ago70 comments

70 comments

49 comments · 15 top-level

rottyguy12y ago· 7 in thread

anything similar for higher level languages (c# or the likes)?

The Java Hotspot VM can still optimize for this case, if the virtual call leads to only a few classes most of the time. Several virtual methods can be inlined, but of course there's still an extra step compared to static dispatch: the classes of the current object and the inlined methods have to be compared. If no matching method is inlined, control needs to be passed back to the VM.

pmjordan12y ago

A fascinating article on this type of optimisation:

http://www.azulsystems.com/blog/cliff/2010-04-08-inline-cach...

1 more reply

army12y ago

This is less of a problem for systems with JIT compilation (including Java, C#, etc). They can recompile the code at runtime, which allows some nice tricks for virtual calls. They can turn a virtual call into a regular call with inline caching (http://en.wikipedia.org/wiki/Inline_caching), or can even compile a specialized version of the code for a given type and inline the entire virtual function.

kevingadd12y ago

The .NET and Java VMs are in fact able to do some inlining on virtual/interface calls and do other sorts of smart dispatch. So the cost of a virtual method in .NET and Java is not necessarily equivalent to the cost in C++.

MichaelGG12y ago

I tried a simple example[1] in C#, .NET 4.5, both 32 and 64-bit, just looping and calling an Add method. The adding the keyword virtual increased runtimes > 200%. JVMs might do this, but the CLR's codegen doesn't.

An old blog post by one of the CLR engineers[1] states:

"We don't inline across virtual calls. The reason for not doing this is that we don't know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it's not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we're not forced to be so aggressive about optimizing this case."

I guess things haven't changed. My testing with the CLR indicates that for best performance, you should make sure your IL is already inlined. The CLR does much better with huge function bodies.

1: http://pastebin.com/98c7Bt7f 2: http://blogs.msdn.com/b/davidnotario/archive/2004/11/01/2503...

1 more reply

CmonDev12y ago

I think there is usually something else to optimize before this becomes a problem. And if it becomes a problem you need a low-level language anyway?

MichaelGG12y ago

But in cases of needless virtual calls (doesn't Java default to virtual for some strange reason?) it may be a quick and easy win.

Additionally, it's not always so easy to drop to a low-level language. If your architecture is enormous and complicated, it might be totally unfeasible to change languages for hot parts.

2 more replies

pslam12y ago· 6 in thread

A big extra cost of virtual functions in the underlying CPU not mentioned in the article: they effectively create a branch target dependency on a pointer chase. Put another way:

1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.

2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.

In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.

nkurz12y ago

Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.

You seem very familiar with these issues, but this doesn't sound right to me. Maybe I'm not understanding your terminology, but don't all modern processors support speculative execution? All instructions (including dependent) are executed, but the results are held in the Reorder Buffer until the branch choice is confirmed. If this is still a large issue, why don't Eli's measurements show it to be?

pslam12y ago

If the branch target is an address loaded from memory, and there is no cached result for the branch instruction, then there's no way it can predict which instruction to execute next. The target could be anywhere in valid memory.

The reason the measurements don't show it is the micro-benchmark will be predicting very well. In fact it's quite difficult to defeat prediction even for giant codebases, and you probably have bigger issues with L1 thrashing at that point. The more subtle problem is even with prediction, there's a (quite high) limit to the number of unretired speculated instructions. Again, a micro-benchmark won't show that up - you'd need a large function in the inner loop.

I'm making it sound like there's no cost to virtual functions in real applications, but it's there, usually measurable and every little adds up. If anything, I think a better reason to not simply spray "virtual" everywhere is it demonstrates that the author didn't understand the data structures they created.

1 more reply

MichaelGG12y ago

Can profile-guided optimization realise that a certain virtual function almost always resolves to a specific implementation and have a conditional check to inline or optimize when needed?

I'm not overly experienced with complicated OO systems, but sometimes it seems the OO is just an abstraction for convenience, but runtime will always take a particular path.

seanmcdirmid12y ago

Note you've just described inline caching [1]. A big research topic in the 90s, I'm sure this is pretty much a non-issue these days.

[1] http://en.wikipedia.org/wiki/Inline_caching

froydnj12y ago

Microsoft's PGO/LTCG implementation does just this. GCC can do something similar as well.

adamtj12y ago

My understanding is that good virtual machines basically do this sort of profiling and optimization at runtime and JIT compile specializations as necessary.

Does anybody know why JIT isn't done in classically AOT compilers? Is JIT overhead generally higher than cost savings of the optimizations?

2 more replies

alextingle12y ago· 5 in thread

    for (unsigned i = 0; i < N; ++i) {
      for (unsigned j = 0; j < i; ++j) {
        obj->tick(j);
      }
    }

I wouldn't go quite so far as to say that benchmarks with tight inner loops like this are completely useless, but they are nearly so.

The author is clearly aware that the real world of performance is much bigger & more complex than his simple Petri dish. Credit to him for mentioning that. It's also really refreshing to see him analysing the optimised assembly.

The trouble with this approach is that it's tempting to draw simple conclusions. In this case, you might be tempted to conclude "CRTP always faster than virtual dispatch", when the truth is likely to be much more situation dependent.

I have seen a biggish project go though a lot of effort to switch to CRTP, only to see a negligible performance impact.

eliben12y ago

And I have seen projects whose performance was crippled by layers upon layers of endless virtual calls. YMMV ;-)

army12y ago

Agreed, for almost all code it doesn't matter, but for the remaining small fraction it's worth thinking about these things. It sounds pretty insane to go with a blanket approach of removing virtual calls throughout an entire codebase without understanding which ones are the problematic ones. Especially since some ways of solving the problem could potentially lead to other problems like increased compiled code size.

I've seen plenty of software (especially systems software) that does spend much of it's time in tight inner loops. Pulling out all the optimization stops there can give measurable gains. I've personally seen measurable gains on real applications from tricks like reordering branches so that the more predictable branches go first.

1 more reply

alextingle12y ago

Yes, of course. But the real challenge is to determine that it's your virtual indirection that's causing the problem. Once you know what's causing the problem, fixing it is (relatively) easy.

The danger is that benchmarks like this encourage naïve programmers to use complex constructs as a matter of course, when simpler would usually be better. "Premature optimisation is the root of all evil" and all that...

ksk12y ago

I have never heard of any project where virtual calls are the dominant factor in performance.

Are there any open source projects amongst your examples?

1 more reply

banachtarski12y ago

Also, CRTP prevents you from storing all derived objects in a single container, since the underlying types are now heterogeneous. There are also more restrictions present in terms of slicing, casting, among other things that render CRTP a poor choice in many situations.

In the end, it's just another tool which is the right one in particular circumstances, and the wrong one in all others.

Taniwha12y ago· 3 in thread

I worked on serious x86 clone once - we took a lot of real-world trace and ran it through our various microarchitectures to see how it would fly - dynamic C++ dispatch was interesting normally you expect something like

   mov r1, n(bp) ; get vtable
   mov r2, n(r2) ; get method pointer 
   call (r2)     ; call

that's a really bad pipe break a double indirect load and a call - but branch prediction may be your friend ...

However some of the code we saw (I think it came from a Borland compiler)

   mov r1, n(bp) ; get vtable
   push n(r2)    ; get method pointer 
   ret           ; call

an extra memory write/read but always caught in L1 and on the register poor x86 it saves a register right> ... but on most CPUs of the time you're screwed for the branch prediction - CPUs had a return cache, a cheap way to predict the branch target of a return - by doing a return without a call you've popped the return cache leaving it in a bad state - EVERY return in an enclosing method is going to mispredict as well - the code will run, but slowly

mappu12y ago

I use push/ret idiom all the time to stdcall off the stack.. did not realise there was a return cache, that's very interesting.

Taniwha12y ago

depends on the CPU - but it's relatively trivial thing to build (especially because unlike other caches it's a stack) on x86s return nominally is ALWAYS a bad pipe bubble: a pop followed by an indirect jump - the pop gets resolved at the end of its micro-op and the jump wants to be resolved early on so as to start decoding the next instruction

In the end it can't hurt to generate a bad jump prediction off of the return cache, it's no worse than being idle - the effect of messing with the cache though can cause it to always fail so as a result you get no advantage from it

Taniwha12y ago

(I should add - it's an x86, you're really register poor - sometimes you do have to do stuff like that - but if you have a register "mov reg, a;jmp (reg)" is better than "push a;ret")

kbutler12y ago· 2 in thread

"If anything doesn’t feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2)."

This is explicit support for confirmation bias.

See Feynman's discussion of measuring the charge of the electron in Cargo Cult Science:

"Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that..."

http://neurotheory.columbia.edu/~ken/cargo_cult.html

nkurz12y ago

And as an alternative, would you suggest laboriously using low level counters to verify that every measurement you think is correct is indeed correct? Given finite resources, what's a better approach than concentrating on the apparent anomalous measurements? I'm not sure I see the parallel.

tezka12y ago

it was funny you felt the need to post your wisdom both here and under the actual post.

tomp12y ago· 2 in thread

Instead of devirtualization, a simpler optimization, which would additionally also help in the dynamic case, is simple loop hoisting of the method pointer fetch. Instead of doing

    while(...) {
      (obj->vtable[0])(...)
    }

we could have

    void(*fn)(...) = obj->vtable[0]
    while(...) {
      fn(...)
    }

which would avoid two redirections per inner loop! Actually, I'm almost sure that is what LuaJIT would do, and many other high-level programming languages could perform this optimization as well. However, maybe C is too low-level to be able to do that, and I don't know about C++.

eliben12y ago

That would save the indirection, but I hope the article shows that by far the biggest cost comes from the lack of inlining. The latter would not be solved by your function pointer.

jheriko12y ago

this is not about C... C++ certainly. C has no such thing as a virtual function or dynamic dispatch (unless you implement it yourself).

jheriko12y ago· 2 in thread

its interesting to see a break down of this - especially using modern compilers on the intel platform.

did you try the intel compiler? for raw low level optimisation it sometimes massively out performs the ms, gcc or clang versions...

i'd imagine these problems are worse on ARM chips, and dynamic dispatch is even less effective there - certainly on PPC architectures I've seen much worse performance than on similarly powered Intels in precisely this situation. the caches are less and slower...

i'm not 100% but i think i've seen virtual calls 'devirtualised' by the MS compiler a couple of years ago... I might be thinking of something else though, it was a while back now. I was unpicking some CRTP mess in something that /was not performance critical in anyway/...

pmjordan12y ago

You may be thinking of this: IIRC the standard recommends that compilers omit dynamic dispatch when the dynamic type is known at compile time - this essentially boils down to the case where a virtual method call follows creation of the object with 'new' or as an automatic variable. In my experience, this is commonly implemented correctly in compilers.

The other case where the dynamic type is known is in the constructor itself of course.

eliasmacpherson12y ago

thanks again for your comment the other day about memcpy(), I am after finding a deep and rich seam of optimisation out of it!

cma12y ago· 2 in thread

I'd like to see a comparison of calling a dynamically linked function call vs a non-dynamically linked virtual call.

Dynamic linking has more indirection than you might expect because the function addresses can't always just be put at the call site during the library load (the places where you would want to write the address can be in code that is read-only mmapped to aid in sharing memory between processes and to avoid loading unused stuff from disk).

zwieback12y ago

In an ideal world the OS could still replace the call sites with straight calls to the loaded library, circumventing a jump table altogether. I don't remember what this is called, maybe something like a thunk, but I've seen it happen in the debugger where the first call causes a fault which rewrites the call site with the target address and subsequent calls are straight to the lib. This can work even if the chunk of code containing the call sites is shared and readonly, as long as the OS can override that.

pmjordan12y ago

The calls into jump tables are generally static, so the jump table itself can be prefetched. The jump table code is then a regular function pointer call, which is also monomorphic and so can be reliably predicted. I'd expect the impact to be small compared to a regular monomorphic function pointer call.

vicaya12y ago· 2 in thread

This _could_ be another case of premature optimization, as gcc 4.9+ could automagically devirtualize non-overridden virtual functions. icc could do that for years.

nkurz12y ago

That's not the way the phrase 'premature optimization' is usually used. Usually, it means spending time optimizing something that is not a limiting factor, or that otherwise will not make a difference in the final result. Keeping your code simple in the hope that eventually it will become fast is something else, probably falling closer to 'Sufficiently Smart Compiler' http://c2.com/cgi/wiki?SufficientlySmartCompiler.

pmjordan12y ago

Presumably this needs to be done at link-time? (And you'd have to disable it if you're planning to load code dynamically)

nly12y ago· 1 in thread

When you think you can use CRTP instead of virtual dispatch in your program, you didn't need virtual dispatch to begin with... you needed a generic algorithm to operate over your object classes. That's exactly what run_crtp() is, the CRTPInterface class is completely redundant except that it provides some degree of compile-time concept checking (which we'll hopefully get in C++17)

Virtual dispatch is useful for type erasure, when using abstract types from plugins, DLLs or generally "somebody elses code". IMHO, the valid use cases within a standalone program are actually fairly small.

jamesaguilar12y ago

Unit testing is my #1 use for virtual functions. "somebody else's code" a.k.a. standard ML modules is a distant second.

berkut12y ago· 1 in thread

I've done benchmarks on this fairly recently, and with the functions actually doing a lot of work (ray intersection for a raytracer), I saw practically no difference between CRTP and Virtual Functions:

http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-...

And this was with billions of calls to the functions...

blt12y ago

Yes, the penalty is most glaring for calls that do a tiny amount of work. Imagine if

  String.charAt(int index)

was a virtual call inside of strlen().

gjm1112y ago· 1 in thread

So he found that dynamic dispatch was a lot more expensive. Fair enough and not very surprising. But let's quantify it a bit in absolute terms. The dynamic version of the code took 1.25s to run, during which time it performed approximately 8 x 10^8 virtual function calls. That translates to a cost per call of 1.5 nanoseconds.

From which my takeaway would be: In inner-loopy code for which an extra nanosecond or so per call is critical, you should avoid virtual function calls. For anything else, don't worry about it.

MichaelGG12y ago

1.5 nanoseconds per call in the best case. In some huge monstrosity where you've got to go chase down object headers not in the cache, things may be quite different.

vinkelhake12y ago

This is a nice article and props for including and dissecting generated assembler!

A key thing here is that inlining is what enables zero-cost abstractions in C++. A virtual call is slower than a regular call, but the main problem is that it builds a barrier that effectively stops inlining.

It'll be interesting to see how devirtualization in GCC will do for real world programs.

namuol12y ago

Observation: the intricacies of our technologies are growing to such complexity that analysis of the things we once had a direct hand in the design of plays out much like the analysis of some sort of mysterious natural phenomenon.

simfoo12y ago

I really like the "Mandatory precaution about benchmarks", it's spot on

j / k navigate · click thread line to collapse

70 comments

49 comments · 15 top-level

rottyguy12y ago· 7 in thread

anything similar for higher level languages (c# or the likes)?

andor12y ago

pmjordan12y ago

A fascinating article on this type of optimisation:

http://www.azulsystems.com/blog/cliff/2010-04-08-inline-cach...

1 more reply

army12y ago

kevingadd12y ago

MichaelGG12y ago

An old blog post by one of the CLR engineers[1] states:

I guess things haven't changed. My testing with the CLR indicates that for best performance, you should make sure your IL is already inlined. The CLR does much better with huge function bodies.

1: http://pastebin.com/98c7Bt7f 2: http://blogs.msdn.com/b/davidnotario/archive/2004/11/01/2503...

1 more reply

CmonDev12y ago

I think there is usually something else to optimize before this becomes a problem. And if it becomes a problem you need a low-level language anyway?

MichaelGG12y ago

But in cases of needless virtual calls (doesn't Java default to virtual for some strange reason?) it may be a quick and easy win.

Additionally, it's not always so easy to drop to a low-level language. If your architecture is enormous and complicated, it might be totally unfeasible to change languages for hot parts.

2 more replies

pslam12y ago· 6 in thread

A big extra cost of virtual functions in the underlying CPU not mentioned in the article: they effectively create a branch target dependency on a pointer chase. Put another way:

nkurz12y ago

pslam12y ago

1 more reply

MichaelGG12y ago

Can profile-guided optimization realise that a certain virtual function almost always resolves to a specific implementation and have a conditional check to inline or optimize when needed?

I'm not overly experienced with complicated OO systems, but sometimes it seems the OO is just an abstraction for convenience, but runtime will always take a particular path.

seanmcdirmid12y ago

Note you've just described inline caching [1]. A big research topic in the 90s, I'm sure this is pretty much a non-issue these days.

[1] http://en.wikipedia.org/wiki/Inline_caching

froydnj12y ago

Microsoft's PGO/LTCG implementation does just this. GCC can do something similar as well.

adamtj12y ago

My understanding is that good virtual machines basically do this sort of profiling and optimization at runtime and JIT compile specializations as necessary.

Does anybody know why JIT isn't done in classically AOT compilers? Is JIT overhead generally higher than cost savings of the optimizations?

2 more replies

alextingle12y ago· 5 in thread

    for (unsigned i = 0; i < N; ++i) {
      for (unsigned j = 0; j < i; ++j) {
        obj->tick(j);
      }
    }

I wouldn't go quite so far as to say that benchmarks with tight inner loops like this are completely useless, but they are nearly so.

I have seen a biggish project go though a lot of effort to switch to CRTP, only to see a negligible performance impact.

eliben12y ago

And I have seen projects whose performance was crippled by layers upon layers of endless virtual calls. YMMV ;-)

army12y ago

1 more reply

alextingle12y ago

Yes, of course. But the real challenge is to determine that it's your virtual indirection that's causing the problem. Once you know what's causing the problem, fixing it is (relatively) easy.

ksk12y ago

I have never heard of any project where virtual calls are the dominant factor in performance.

Are there any open source projects amongst your examples?

1 more reply

banachtarski12y ago

In the end, it's just another tool which is the right one in particular circumstances, and the wrong one in all others.

Taniwha12y ago· 3 in thread

   mov r1, n(bp) ; get vtable
   mov r2, n(r2) ; get method pointer 
   call (r2)     ; call

that's a really bad pipe break a double indirect load and a call - but branch prediction may be your friend ...

However some of the code we saw (I think it came from a Borland compiler)

   mov r1, n(bp) ; get vtable
   push n(r2)    ; get method pointer 
   ret           ; call

mappu12y ago

I use push/ret idiom all the time to stdcall off the stack.. did not realise there was a return cache, that's very interesting.

Taniwha12y ago

(I should add - it's an x86, you're really register poor - sometimes you do have to do stuff like that - but if you have a register "mov reg, a;jmp (reg)" is better than "push a;ret")

kbutler12y ago· 2 in thread

"If anything doesn’t feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2)."

This is explicit support for confirmation bias.

See Feynman's discussion of measuring the charge of the electron in Cargo Cult Science:

http://neurotheory.columbia.edu/~ken/cargo_cult.html

nkurz12y ago

tezka12y ago

it was funny you felt the need to post your wisdom both here and under the actual post.

tomp12y ago· 2 in thread

Instead of devirtualization, a simpler optimization, which would additionally also help in the dynamic case, is simple loop hoisting of the method pointer fetch. Instead of doing

    while(...) {
      (obj->vtable[0])(...)
    }

we could have

    void(*fn)(...) = obj->vtable[0]
    while(...) {
      fn(...)
    }

eliben12y ago

That would save the indirection, but I hope the article shows that by far the biggest cost comes from the lack of inlining. The latter would not be solved by your function pointer.

jheriko12y ago

this is not about C... C++ certainly. C has no such thing as a virtual function or dynamic dispatch (unless you implement it yourself).

jheriko12y ago· 2 in thread

its interesting to see a break down of this - especially using modern compilers on the intel platform.

did you try the intel compiler? for raw low level optimisation it sometimes massively out performs the ms, gcc or clang versions...

pmjordan12y ago

The other case where the dynamic type is known is in the constructor itself of course.

eliasmacpherson12y ago

thanks again for your comment the other day about memcpy(), I am after finding a deep and rich seam of optimisation out of it!

cma12y ago· 2 in thread

I'd like to see a comparison of calling a dynamically linked function call vs a non-dynamically linked virtual call.

zwieback12y ago

pmjordan12y ago

vicaya12y ago· 2 in thread

This _could_ be another case of premature optimization, as gcc 4.9+ could automagically devirtualize non-overridden virtual functions. icc could do that for years.

nkurz12y ago

pmjordan12y ago

Presumably this needs to be done at link-time? (And you'd have to disable it if you're planning to load code dynamically)

nly12y ago· 1 in thread

jamesaguilar12y ago

Unit testing is my #1 use for virtual functions. "somebody else's code" a.k.a. standard ML modules is a distant second.

berkut12y ago· 1 in thread

http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-...

And this was with billions of calls to the functions...

blt12y ago

Yes, the penalty is most glaring for calls that do a tiny amount of work. Imagine if

  String.charAt(int index)

was a virtual call inside of strlen().

gjm1112y ago· 1 in thread

From which my takeaway would be: In inner-loopy code for which an extra nanosecond or so per call is critical, you should avoid virtual function calls. For anything else, don't worry about it.

MichaelGG12y ago

1.5 nanoseconds per call in the best case. In some huge monstrosity where you've got to go chase down object headers not in the cache, things may be quite different.

vinkelhake12y ago

This is a nice article and props for including and dissecting generated assembler!

It'll be interesting to see how devirtualization in GCC will do for real world programs.

namuol12y ago

simfoo12y ago

I really like the "Mandatory precaution about benchmarks", it's spot on

j / k navigate · click thread line to collapse