@cmuratori: "Actually virtual function dispatch is a major problem on a number of platforms because it is extremely cache unfriendly."
I have no doubt that this is true, but it's not the whole story.
On any modern CPU, a virtual function call is within epsilon of a direct call under two assumptions:
1. The call site and vtable are in cache (which is true if the call site is commonly executed).
2. The vtable is the same every time.
If all of the above is true, then any modern CPU will be able to predict the call. I encourage anyone who finds this hard to believe to test it and report back their findings.
That last condition seems unnecessarily restrictive, but it's correct for a few common use cases. For example, if you have an OpenGL renderer and a Direct3D renderer, and you implement using OO as two implementations behind a common interface, then almost all of the time, you will be exclusively calling only one of these for the life of the program.
A call to this interface will be almost as cheap as a standard call, and about as cheap as a call to a shared library.
Now that we've said that, there are a few common situations where virtual dispatch performs poorly.
1. Uberstedt's example of the heterogeneous container is a case in point. In that situation, almost every call is mispredicted. Unfortunately, heterogeneous containers are the case that OOP people seem to love the most.
2. The phrase "modern CPU" is an important caveat. There are a surprisingly large number of important platforms out there where the CPUs do not do speculative or out-of-order execution, including most last-generation console and mobile devices. While these platforms matter, even the "good" case won't work well.
3. If the vtable isn't in cache, then performance dies along with it. However, we tend to ignore cases like this because of the 90/10 rule. If the vtable isn't in cache, then the code path isn't in the 10% of code which is executed 90% of the time. I wish this were true, but I suspect this isn't the case in big over-complicated game engines like Unreal.
But hold on a moment, is this actually a problem specific to vtables? Suppose we translate Uberstedt's example into a switch statement instead:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | for (int i = 0; i < count; ++i) {
switch (structs[i].type_key) {
case TYPE_1:
{
doSomethingStrange1();
break;
}
case TYPE_2:
{
doSomethingStrange2();
break;
}
// etc
}
}
|
Haven't we just swapped an unpredictable call for an unpredictable branch? Yes, it's a bit more cache-friendly, and a mispredicted branch may be slightly cheaper than a mispredicted call, but it's still not cheap by any stretch.
Incidentally, interpreter writers are extremely familiar with this problem, because opcode dispatch is essentially the same thing. There are several ways to get around the problem, such as the switch-of-gotos method:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28 | #define DISPATCH() \
i = nextInstruction(); \
switch (i) { OP_ADD: goto L_ADD; OP_SUB: goto L_SUB; /* etc */ }
void run()
{
enum opcode i;
L_START:
DISPATCH()
L_ADD:
{
value x = pop();
value y = pop();
push(y + x);
}
DISPATCH()
L_SUB:
{
value x = pop();
value y = pop();
push(y - x);
}
DISPATCH()
// etc
}
|
This gives the CPU a better chance at prediction because there isn't just one hard-to-predict branch any more, there are as many as there are opcodes, and some of them are easier to predict than others.
GCC's computed gotos are also a popular feature among interpreter writers. In high-performance interpreters, it's common to use different techniques on different platforms. It pays to go to some trouble to optimise this, because it's the interpreter's innermost loop.
So while I have no doubt that Casey is 100% correct that virtual function dispatch is a performance problem on many platforms, I strongly suspect that in many cases it's hiding an algorithmic issue. (I have a rant about binary search which you'll probably all get to see some time.)
Implementing the same algorithm some other way may well have the same problems, but they are just better hidden. In that sense, Kladdehelvete has a good point, echoed by Casey in his "DOS wasn't good design" comment.
@skeeto: "The biggest performance issue, besides the interrupt itself, would probably be blowing the instruction pipeline when performing the dispatch."
I'm not so sure about that. Last time I checked, the biggest performance issue with calling an interrupt on a modern x86-class CPU was manipulating and switching the stack. Things may have changed since I last checked, of course.