Explaination for the inline function slow down?

popcorn

#3499

May 9, 2015

Not sure if this is relevant or if it explains the reason

"The problem with inline functions is that they replicate the function body every time they are called. Each use of an inline function thus makes the kernel executable bigger. A bigger executable means more cache misses, and that slows things down."

from http://lwn.net/Articles/82495/

It looks like it inline loop inter two times for U and V? Is that right?

Mārtiņš Možeiko

#3501

May 9, 2015

In this case inlining is not the problem in a sense that it "replicates" function body. Compiler optimizer still should optimize replicated code.

Look at this example:

inline int inc(int x)
{
  return x + 1;
}

...

int a = ...;
a = inc(a);
a = inc(a);
a = inc(a);
a = inc(a);
a = inc(a);
a = inc(a);

when inlining inc function you don't expect compiler just to replicate function body and stop:

int a = ...;
a = a + 1;
a = a + 1;
a = a + 1;
a = a + 1;
a = a + 1;
a = a + 1;

You expect compiler to optimize inlined code:

1 2	int a = ...; a = a + 6;

So inlining doesn't increase code size in this example. It actually reduces. Because adding constant 6 to variable integer is much shorter machine code than calling function 6 times.

And that is what all the modern compilers I am aware of (msvc, clang, gcc) will do in this example.

As for why it happened in HH - its because MSVC is not very good at optimizing. Simple as that. I compiled yesterdays code with clang and look what it produced:

Approximately same stuff what MSVC produces after Casey manually inlined functions. So clang with function calls (yesterdays code) generates more or less same code (regarding performance) as MSVC after manual inlining (todays code).

Why clang is so much better and MSVC isn't? I don't know. I know that a lot of people are working on clang optimizer so it can optimize these kind of situations as much as possible. For more information on this topic see "Zero-Cost Abstractions and Future Directions for Modern Optimizing Compilers" presentation from Chandler Carruth:
* slides - http://llvm.org/devmtg/2012-11/Carruth-OptimizingAbstractions.pdf
* video - http://llvm.org/devmtg/2012-11/vi...arruth-OptimizingAbstractions.mp4

In general case of course you don't want compiler to inline large functions too much. Because of limited size of CPU instruction cache. And compilers know that.

Edited by Mārtiņš Možeiko on May 9, 2015, 3:32am

Dghelneshi

#3509

May 9, 2015

Fabian Giesen's explanation makes a lot of sense:
https://twitter.com/rygorous/status/596949831517507584

In the slow version, the two values were stored in a struct. The X part was dependent on I (the loop variable) and the Y part wasn't. Unfortunately, it seems like MSVC moves loop-invariant stuff out of the loop before expanding the struct to two separate values. It cannot move the struct, as part of it depends on I.

In the fast version we already have the values separated into two floats, so MSVC moves the Y part out of the loop, as it should.

It is essentially an unfortunate ordering of the optimization routines in the compiler.

Edited by Dghelneshi on May 9, 2015, 9:48am

popcorn

#3512

May 9, 2015

That's good to hear because at my work I use inline and gcc at lot and I don't want to go back and redo everything, just to get it at the fastest speed possible.

Thanks! I'll watch the video.

Melesie

#3618

May 17, 2015

A few weeks back I did some testing based on talks by Chandler Carruth regarding abstractions and recommending pass-by-value to see how well it applies to MSVC. Turns out it's much better at optimizing reference passing compared to value passing for structures. It was a contrived example that gcc and clang managed to vectorize automatically. VC++ only managed to copy pairs of floats using double instructions (at least that's how I explain use of movsd when passing 4-float vectors, not sure if that was the point) but copies stayed there even after functions have been inlined, making value passing version about 3-times slower than reference passing, if I remember right. I suspect the reference passing one was held up by cache though, since simd version took the same amount of cycles.