DOS and BIOS uses OOP ??

Livet Ersomen Strøm

#562

December 14, 2014

And it hit me: Isn't it a fact that int 21, is a call through a indexed table of functions? Much like OOP (and COM)? Bcs that's what it is?

OOP is not slow, by design. If I draw 2000 objects in the game loop, it matters not if it's a dynamic virtual call or a direct call to paint them. Since 99,99% of the time is in Drawtext. Here, the problem is bad design, not OOP.

The problem is redundant calls, but it is not the dynamic virtual calls, but the code inside DrawText that hurt the loop.

If I need to draw 2000 objects I must detect which don't need another drawtext call.

And draw them to some kind of current state buffer. A offscreen buffer. And then update it and draw it. OOP or not, it doesn't even matter? I need the statebuffer in anycase. It is the number of changes to the drawn text, or graphics, that is the limiting factor here. It is not OOP.

Am I right or am I right?

skeeto

#566

December 14, 2014

Even the interrupt itself goes through the interrupt table in the same fashion, so that's two layers! I wouldn't call this OOP, though. It's dynamic dispatch, sure, but there's no "object" here. It's just a common global table. The biggest performance issue, besides the interrupt itself, would probably be blowing the instruction pipeline when performing the dispatch.

Casey Muratori

#568

December 14, 2014

Actually virtual function dispatch is a major problem on a number of platforms because it is extremely cache unfriendly. The processor has to first load the object to get the vtable pointer, then load the vtable to fetch the dispatch, then do the dispatch.

So, hopefully you were just joking, but if you're actually suggesting that virtual functions are OK for high-performance code, that is definitely not true and is not likely to be true any time in the near future.

- Casey

Johan Öfverstedt

#570

December 14, 2014

Another performance problem with virtual dispatch would come if you have an array of pointers to polymorphic objects, not sorted by actual type, and call a method from a loop:

1
2
3

for(int i = 0; i < count; ++i) {
    objects[i]->doSomethingStrange();
}

If the objects in the array have the types in this order:
A B C D A B C D
Then we might break instruction-caching as the CPU loads the instructions for type A's doSomethingStrange method, then has to load the instructions for type B's method, and then C, and then D, then when it comes back to an object of type A, the instructions for that method is no longer in cache, so we get another cache-miss. And repeat.

I see that this is along the lines of what skeeto was saying... :silly:

Here is a link to a presentation on the pitfalls of OOP:
Pitfalls of OOP

Edited by Johan Öfverstedt on December 14, 2014, 10:29am

Casey Muratori

#573

December 14, 2014

Also, not to beat a dead horse, but no operating system guy would go "DOS was super performant! Triggering an interrupt for every operation is the way to go." So saying that DOS did or didn't do something is not relevant.

Modern OS performance people are very clearly of the camp that the OS should not be involved at all for things like disk I/O, network I/O, etc. Modern OS performance research centers on having the OS broker only security and quota management for devices insomuch as possible, and instead letting DMA-based devices do everything via queues and doorbells. So apps write long arrays of operation information into memory, then the OS may get minorly involved in the kickoff, but otherwise it's all the device at that point, DMAing the operation queue and writing back the results directly to memory where the app can pick them up. Ain't no triggering of interrupts per op, not on any kind of performance-oriented research kernel I'm aware of anyway.

So, yeah. DOS was not a good OS design for performance, just in case anyone was unclear on that :)

- Casey

Edited by Casey Muratori on December 14, 2014, 12:49pm

Andrew Bromage

#593

December 15, 2014

@cmuratori: "Actually virtual function dispatch is a major problem on a number of platforms because it is extremely cache unfriendly."

I have no doubt that this is true, but it's not the whole story.

On any modern CPU, a virtual function call is within epsilon of a direct call under two assumptions:

1. The call site and vtable are in cache (which is true if the call site is commonly executed).

2. The vtable is the same every time.

If all of the above is true, then any modern CPU will be able to predict the call. I encourage anyone who finds this hard to believe to test it and report back their findings.

That last condition seems unnecessarily restrictive, but it's correct for a few common use cases. For example, if you have an OpenGL renderer and a Direct3D renderer, and you implement using OO as two implementations behind a common interface, then almost all of the time, you will be exclusively calling only one of these for the life of the program.

A call to this interface will be almost as cheap as a standard call, and about as cheap as a call to a shared library.

Now that we've said that, there are a few common situations where virtual dispatch performs poorly.

1. Uberstedt's example of the heterogeneous container is a case in point. In that situation, almost every call is mispredicted. Unfortunately, heterogeneous containers are the case that OOP people seem to love the most.

2. The phrase "modern CPU" is an important caveat. There are a surprisingly large number of important platforms out there where the CPUs do not do speculative or out-of-order execution, including most last-generation console and mobile devices. While these platforms matter, even the "good" case won't work well.

3. If the vtable isn't in cache, then performance dies along with it. However, we tend to ignore cases like this because of the 90/10 rule. If the vtable isn't in cache, then the code path isn't in the 10% of code which is executed 90% of the time. I wish this were true, but I suspect this isn't the case in big over-complicated game engines like Unreal.

But hold on a moment, is this actually a problem specific to vtables? Suppose we translate Uberstedt's example into a switch statement instead:

for (int i = 0; i < count; ++i) {
  switch (structs[i].type_key) {
    case TYPE_1:
    {
      doSomethingStrange1();
      break;
    }
    case TYPE_2:
    {
      doSomethingStrange2();
      break;
    }
    // etc
  }
}

Haven't we just swapped an unpredictable call for an unpredictable branch? Yes, it's a bit more cache-friendly, and a mispredicted branch may be slightly cheaper than a mispredicted call, but it's still not cheap by any stretch.

Incidentally, interpreter writers are extremely familiar with this problem, because opcode dispatch is essentially the same thing. There are several ways to get around the problem, such as the switch-of-gotos method:

#define DISPATCH() \
    i = nextInstruction(); \
    switch (i) { OP_ADD: goto L_ADD; OP_SUB: goto L_SUB; /* etc */ }

void run()
{
  enum opcode i;
L_START:
  DISPATCH()

L_ADD:
  {
    value x = pop();
    value y = pop();
    push(y + x);
  }
  DISPATCH()

L_SUB:
  {
    value x = pop();
    value y = pop();
    push(y - x);
  }
  DISPATCH()

// etc
}

This gives the CPU a better chance at prediction because there isn't just one hard-to-predict branch any more, there are as many as there are opcodes, and some of them are easier to predict than others.

GCC's computed gotos are also a popular feature among interpreter writers. In high-performance interpreters, it's common to use different techniques on different platforms. It pays to go to some trouble to optimise this, because it's the interpreter's innermost loop.

So while I have no doubt that Casey is 100% correct that virtual function dispatch is a performance problem on many platforms, I strongly suspect that in many cases it's hiding an algorithmic issue. (I have a rant about binary search which you'll probably all get to see some time.)

Implementing the same algorithm some other way may well have the same problems, but they are just better hidden. In that sense, Kladdehelvete has a good point, echoed by Casey in his "DOS wasn't good design" comment.

@skeeto: "The biggest performance issue, besides the interrupt itself, would probably be blowing the instruction pipeline when performing the dispatch."

I'm not so sure about that. Last time I checked, the biggest performance issue with calling an interrupt on a modern x86-class CPU was manipulating and switching the stack. Things may have changed since I last checked, of course.

Casey Muratori

#602

December 15, 2014

As a side note, the "correct" way to do this is not to change the virtual function to a switch statement, but rather to segregate things by type and process them in loops where there is neither a dispatch nor a switch statement.

- Casey

Andrew Bromage

#607

December 15, 2014

Precisely. Hence, "algorithmic issue".

Of course, you can't always reorder stuff freely, but if you can, it's often worth it.

Johan Öfverstedt

#611

December 15, 2014

Pseudonym73. I would love to hear your rant on Binary Search! I have a feeling it might be some great stuff! B)

Edited by Johan Öfverstedt on December 15, 2014, 8:23am

Casey Muratori

#633

December 16, 2014

Before anyone gets too ranty on Binary Search, required reading:

http://www.pvk.ca/Blog/2012/07/03...nates-star-branch-mispredictions/
http://www.pvk.ca/Blog/2012/07/30...s-a-pathological-case-for-caches/

;)

- Casey

Edited by Casey Muratori on December 16, 2014, 1:34am

Andrew Bromage

#634

December 16, 2014

Damn, Paul Khuong stole my rant.

Dejan

#647

December 16, 2014

cmuratori
Before anyone gets too ranty on Binary Search, required reading:

http://www.pvk.ca/Blog/2012/07/03...nates-star-branch-mispredictions/
http://www.pvk.ca/Blog/2012/07/30...s-a-pathological-case-for-caches/

;)

- Casey

Interesting reading there! Goes to show (again) measuring execution times and knowing exactly what your data+algorithms are doing is key for maximising performance. Relying on "common wisdom" can be misguided.