http://www.azulsystems.com/blog/cliff/2010-04-08-inline-cach...
An old blog post by one of the CLR engineers[1] states:
"We don't inline across virtual calls. The reason for not doing this is that we don't know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it's not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we're not forced to be so aggressive about optimizing this case."
I guess things haven't changed. My testing with the CLR indicates that for best performance, you should make sure your IL is already inlined. The CLR does much better with huge function bodies.
1: http://pastebin.com/98c7Bt7f 2: http://blogs.msdn.com/b/davidnotario/archive/2004/11/01/2503...
Additionally, it's not always so easy to drop to a low-level language. If your architecture is enormous and complicated, it might be totally unfeasible to change languages for hot parts.
1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.
2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.
In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.
You seem very familiar with these issues, but this doesn't sound right to me. Maybe I'm not understanding your terminology, but don't all modern processors support speculative execution? All instructions (including dependent) are executed, but the results are held in the Reorder Buffer until the branch choice is confirmed. If this is still a large issue, why don't Eli's measurements show it to be?
The reason the measurements don't show it is the micro-benchmark will be predicting very well. In fact it's quite difficult to defeat prediction even for giant codebases, and you probably have bigger issues with L1 thrashing at that point. The more subtle problem is even with prediction, there's a (quite high) limit to the number of unretired speculated instructions. Again, a micro-benchmark won't show that up - you'd need a large function in the inner loop.
I'm making it sound like there's no cost to virtual functions in real applications, but it's there, usually measurable and every little adds up. If anything, I think a better reason to not simply spray "virtual" everywhere is it demonstrates that the author didn't understand the data structures they created.
I'm not overly experienced with complicated OO systems, but sometimes it seems the OO is just an abstraction for convenience, but runtime will always take a particular path.
Does anybody know why JIT isn't done in classically AOT compilers? Is JIT overhead generally higher than cost savings of the optimizations?
for (unsigned i = 0; i < N; ++i) {
for (unsigned j = 0; j < i; ++j) {
obj->tick(j);
}
}
I wouldn't go quite so far as to say that benchmarks with tight inner loops like this are completely useless, but they are nearly so.The author is clearly aware that the real world of performance is much bigger & more complex than his simple Petri dish. Credit to him for mentioning that. It's also really refreshing to see him analysing the optimised assembly.
The trouble with this approach is that it's tempting to draw simple conclusions. In this case, you might be tempted to conclude "CRTP always faster than virtual dispatch", when the truth is likely to be much more situation dependent.
I have seen a biggish project go though a lot of effort to switch to CRTP, only to see a negligible performance impact.
I've seen plenty of software (especially systems software) that does spend much of it's time in tight inner loops. Pulling out all the optimization stops there can give measurable gains. I've personally seen measurable gains on real applications from tricks like reordering branches so that the more predictable branches go first.
The danger is that benchmarks like this encourage naïve programmers to use complex constructs as a matter of course, when simpler would usually be better. "Premature optimisation is the root of all evil" and all that...
Are there any open source projects amongst your examples?
In the end, it's just another tool which is the right one in particular circumstances, and the wrong one in all others.
mov r1, n(bp) ; get vtable
mov r2, n(r2) ; get method pointer
call (r2) ; call
that's a really bad pipe break a double indirect load and a call - but branch prediction may be your friend ...However some of the code we saw (I think it came from a Borland compiler)
mov r1, n(bp) ; get vtable
push n(r2) ; get method pointer
ret ; call
an extra memory write/read but always caught in L1 and on the register poor x86 it saves a register right> ... but on most CPUs of the time you're screwed for the branch prediction - CPUs had a return cache, a cheap way to predict the branch target of a return - by doing a return without a call you've popped the return cache leaving it in a bad state - EVERY return in an enclosing method is going to mispredict as well - the code will run, but slowlyIn the end it can't hurt to generate a bad jump prediction off of the return cache, it's no worse than being idle - the effect of messing with the cache though can cause it to always fail so as a result you get no advantage from it
This is explicit support for confirmation bias.
See Feynman's discussion of measuring the charge of the electron in Cargo Cult Science:
"Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that..."
while(...) {
(obj->vtable[0])(...)
}
we could have void(*fn)(...) = obj->vtable[0]
while(...) {
fn(...)
}
which would avoid two redirections per inner loop! Actually, I'm almost sure that is what LuaJIT would do, and many other high-level programming languages could perform this optimization as well. However, maybe C is too low-level to be able to do that, and I don't know about C++.did you try the intel compiler? for raw low level optimisation it sometimes massively out performs the ms, gcc or clang versions...
i'd imagine these problems are worse on ARM chips, and dynamic dispatch is even less effective there - certainly on PPC architectures I've seen much worse performance than on similarly powered Intels in precisely this situation. the caches are less and slower...
i'm not 100% but i think i've seen virtual calls 'devirtualised' by the MS compiler a couple of years ago... I might be thinking of something else though, it was a while back now. I was unpicking some CRTP mess in something that /was not performance critical in anyway/...
The other case where the dynamic type is known is in the constructor itself of course.
Dynamic linking has more indirection than you might expect because the function addresses can't always just be put at the call site during the library load (the places where you would want to write the address can be in code that is read-only mmapped to aid in sharing memory between processes and to avoid loading unused stuff from disk).
Virtual dispatch is useful for type erasure, when using abstract types from plugins, DLLs or generally "somebody elses code". IMHO, the valid use cases within a standalone program are actually fairly small.
http://imagine-rt.blogspot.co.uk/2013/08/c-virtual-function-...
And this was with billions of calls to the functions...
String.charAt(int index)
was a virtual call inside of strlen().From which my takeaway would be: In inner-loopy code for which an extra nanosecond or so per call is critical, you should avoid virtual function calls. For anything else, don't worry about it.
A key thing here is that inlining is what enables zero-cost abstractions in C++. A virtual call is slower than a regular call, but the main problem is that it builds a barrier that effectively stops inlining.
It'll be interesting to see how devirtualization in GCC will do for real world programs.