is presently its sole maintainer,
You can support him:
Review of last session: cycle counting code
Fabian: Instruction counts including throughput numbers is not accurate, does not properly take into account CPU's ability to overlap different ops
Accurate way is to write tool to simulate the CPU, as Casey did for the XB360
... or Intel architecture code analyzer (IACA)
Overview on how to use IACA with the code
Marking sections with IACA_START and IACA_END
Modifying build.bat to include the iaca directory
For linux/unix compatibility, mind your case when including files
Running the IACA command line
Reading the IACA results
Trying to decipher the meanings of the letters in the IACA table
IACA can output graphs?
IACA reports max throughput of 86.60 cycles
There maybe some more room for optimization...
IACA is pretty nice!
Adding some macros to turn IACA on/off
Thanks to Fabian for the suggestion
Fabian: bilinear and squaring don't need floating point
Move the sRGB->linear conversion after the bilinear
Bake normalization into color
Works fine (not much improvement)
Remove a number of multiply ops by keeping things in 0-255 space (no improvement)
Diff IACA output from the run with the removed multiplies and the one prior
Getting rid of 43 instructions did not improve throughput reported by IACA
Seems to be doing the same number of multiplies either way
Compiler was smart enough to do the transformations?
What other optimizations could we do?
Use _mm_mul_mulhi_epi16 to do the square operations more wide prior to the FP conversion?
Blackboard: Mask out A and G, which leaves R and B aligned to the 16-bit SIMD boundaries
Blackboard: _mm_mullo_epi16 vs. _mm_mulhi_epi16
problem: this will square our alpha as well
We'll have to use another instruction to handle alpha
Bitshifting / masking to pull the components from their 16-bit lanes
Wrong result! It's Q&A, but let's try to debug first...
Issue found: Should be masking 16-bit, not 8-bit
Better, but still a strange result
How to avoid squaring the alpha?
Just pull the alpha out prior to the squaring?
.. that works fine
Now let's convert everything to use the 16-bit squaring
... around 6 cycles improvement, but small visual problem with the bilinear
Found the issue: We were reading only from SampleA
Bilinear looks better, but still oddity with green fringing around the hero
Found the issue, looks good, but...
... we're actually 8 cycles worse now
Why? Let's run it through IACA
Throughput bottleneck: Inter-iteration? Good question for Fabian
Total number of micro-ops have gotten smaller - 350 vs 306 vs. 283 but throughput is worse
Could use the same technique when loading the destination, but probably not a good idea
cubercaleb Q: CP stands for Critical path
flaturated Q: IACA was showing Port 1 as the bottleneck, so reducing multplies won't help
stelar7 Q: Inter-iteration means that run x of the loop depends on the prior run
butwhynot1 Q: Try hoisting out the TexturePitch/TextureMemory (several cycles improvement)
roflraging Q: How do you support AVX? What about register saving through context switches?
mmozeiko Q: Replace sqrt with mul/rsqrt?
Some comments on port 1 pressure from Fabian
robotchocolatedino Q: How can removing the sqrt help if it's done on the multiply port, not the adder port?