Casey's break down of the keywork "static"

Mārtiņš Možeiko

#25851

January 27, 2022

Inlining is about extra optimizations compiler can potentially do. Often that allows to remove large pieces of unused code, or eliminate common subexpressions. As a result less code is generated - meaning more bytes fit into CPU code cache and fewer bytes to execute, meaning code runs faster.

For example: https://godbolt.org/z/6T1KebExb

You can see in right side of code where inlining is forbidden (or just in general not possible), the compiler generates so much more code. Versus Left side compiler where everything is inlined to simple one instruction mov eax, 450 - way more efficient code.

If you're asking about how code cache works - that is the same as data cache. Every time CPU wants to read next bytes of instruction to execute it must be in L1 cache. If it isn't then it will stall waiting for fetching it from main memory. That's why it prefetches them ahead of time - same way how data prefetching works. Nothing different there.

Edited by Mārtiņš Možeiko on January 27, 2022, 6:53pm

Replying to da447m (#25850)

da447m

#25852

January 27, 2022

Thanks for explanation. Basically that inline calculated the whole loop at compile time and replaced with the result? What if that's not possible to know at compile time? Does inline even do any good in that case?

That's why it prefetches them ahead of time - same way how data prefetching works. Nothing different there.

In the case of data, it is conceptually more clear how to try to optimize for cache, e.g. not bloating structs with a lot of stuff that can be processed separated, etc.

How to reason over doing the same with functions? I'm aware this is a hard question and kinda unrelated to hmh. Is this even a thing in Data Oriented Design? Or "hey try to make your functions compact" is really the best we can do?

Is Casey's code using some sort of data oriented design that is noticeable? Or is he for now simply getting logic done first?

Edited by da447m on January 27, 2022, 8:20pm

Replying to mmozeiko (#25851)

Mārtiņš Možeiko

#25853

January 27, 2022

Basically that inline calculated the whole loop at compile time and replaced with the result?

yes, exactly

What if that's not possible to know at compile time? Does inline even do any good in that case?

In such case inline will most likely hurt performance of your code. That's why forcing to inline all the code is bad for performance. Compiler does its best to figure out what to inline or not, but it cannot always make correct decision. To fix this there exists such thing as "profile-guided optimization". First you compile code with compiler inserting special markups, run binary on real data, this markup measures how often/where/how functions are called. Then you compile again and ask compiler to use this information to decide better what places to inline and what not. Because now compiler knows which code is called and how often, it can do better decisions - inline often called code, and don't inline rarely called code & similar.

Usually you don't think much about inlining, because you have very little control over it. Just mark as much functions you can "static" (so compiler thinks nobody else from other TU's will use it) and that's it.

But if you're serious about having compiler automatically optimize inlining, you should do PGO builds. For example of large projects that do this - Chrome & Firefox. Chrome reports this increases performance up to something like 10% - https://blog.chromium.org/2020/08/chrome-just-got-faster-with-profile.html

Replying to da447m (#25852)