is presently its sole maintainer,
You can support him:
Lesson: Das keyboards are horrible.
Recap of last episode and today's agenda
Prep work for getting pre-optimization vs post-optimization cycle counts
Add cycle counting to DrawRectangleSlowly
... ~350 vs ~50 cycles per pixel!
How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...
... How can we automate this counting process?
Answer: Override the intrinsics with macros that add to some counter variables
Oops, there's still some SIMDizing left to do here...
Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)
dx and dy can be baked into PixelPx and PixelPy (2 cycles better)
Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?
Maybe loft just the multiplies but not the add? Hmm...
... try lofting the multiplications. (1-2 cycles worse)
Note: Texture fetches can't be done in SIMD
Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.
Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)
Start setting up the instrinsic #defines to count operations
Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params
Define load/store to nothing
Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically
We've got counts!
Double check that counts make sense
Multiply counts by throughputs to get total latency estimate
_mm_castps_si128 latency is difficult to know.
looking up the processor core type in windows
_mm_and_ps and bitwise ops are 1/3 cycle on nehalem
Use a macro to sum up the latency*counts to get a rough throughput total
Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle
How many units are in nehalem core?
On the limitations of executing multiple instructions per clock
We're quite close to the max theoretical throughput.
Memory latency probably isn't hurting performance
Make an #if toggle for the instrinsic measurement code
How much is gamma (sqrt) costing us?
A troubling visual artifact appears around our hero...
Aha! An issue with the linear/SRGB code
gamma is costing only ~6 cycles
This is a reasonably optimized pixel loop
Agenda for next session: Optimize outside/around the pixel loop.
stelar7 Q: Is this what you were looking for?
Nehalem diagram: Only one FPU?
grumpygiant256 Q: Worth timing the load/stores with no ALU ops to see how much we're memory bound?
thesizik Q: You counted _mm_and_ps wrong.
ieee754 Q: Are you doing pre-multipled alpha? (Yes)
tenbroya Q: Could you run the game with task manager open?
jayp2 Q: Will this game only work for your specific processor?
toppstv Q: Are you going to update the yellow background textures?
braincruser Q: The texture fetch should be an L1 cache fetch.
0xwid Q: In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?
miblo Q: Any idea why my cores get maxed out when running Handmade hero with the XCB platform layer?
robotchocolatedino Q: Why wasn't there a greater speed increase after removing gamma correction?
marumoto Q: How will we split up the drawing onto multiple cores?
dingernalt2 Q: What's the floating head?
nothings2 Q: Question about _mm_ps_sqrt and common subexpression elimination
thesizik Q: What's that drum-like background noise?
jayp2 Q: Do you see all the questions?
thevaber Q: Can rdtsc be inaccurate with CPUs that vary their cycle rate?
cubercaleb Q: How does the CPU do things ahead of time if things are supposed to be done in order?
ttbjm Q: Do you expect a 16x speedup from multi-threading?
gasto5 Q: How do you select the instruction set for optimizing?
nothings2 Q: Aren't the Unity hardware survey results pretty different than the Steam ones?
captainkraft Q: What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?
jayp2 Q: Can a processor work through different types of calculations in a single cycle?
ca2dev Q: What kinds of things can be delegated to the GPU?