is presently its sole maintainer,
You can support him:
Open things up and recap
DrawRectangleSlowly: Increase efficiency
DrawRectangleHopefullyQuickly: Skip the preamble
Remove all unnecessary code
Look at what's happening
Make the edge testing code more explicit
Blackboard: See what's happening with these inner products
DrawRectangleHopefullyQuickly: Test U and V instead
Run the game
Make these U and V computations more efficient
Run the game and ensure that everything still blits fine
Flatten the routine
Blow out v4 Blended into scalar form
Take a close look at the routine and precompute InvTexelA
Blow out v4 Dest and Texel into scalar form
Flatten BilinearSample and SRGBBilinearBlend
Assess our situation
Unpack and optimise the Lerps
Run the game and annotate the code
That's everything flattened
Note that the code is faster
We have a nasty problem with the unpackings
Blackboard: What is our "wide" strategy?
Set the stage for SIMD
Consider solidifying texture boundaries
Leave it for today
braincruser Q: The way the code is written now you have a very long dependency chain (between instructions). Will you break down the code to remove it?
stelar7 Q: Why did you write float instead of real32 this stream?
stelar7 Q: Why use -O2 instead of -O3 or -Ofast (possibly with -fverbose-asm)?
garryjohanson Q: Do you ever use exclusive or operations to avoid pipeline stalls? If not, what do you use?
g3rain1 Q: Aren't those square roots pretty expensive?
andsz_ Q: Will you make multiple SIMD backends? (SSE?/AVX/FMA versions)
davidthomas426 Q: You could loft some of those variables out one more loop
waterlimon Q: How expensive is the float<>int conversion compared to the rest of the workload?
davidthomas426 Q: Since xAxis and yAxis are usually perpendicular, should we special case for that? In the same vein, should we special-case for axis-aligned?
waterlimon Q: Does the compiler do any automatic SSE optimization (or have option for it?)
stelar7 Q: sqrt_ss vs sqrt_ps vs sqrt_pd?
waterlimon Q: Would SSE allow doing sRGB using exponent 2.2 instead of approximating using one of 2, without a huge performance hit?
pseudonym73 Q: The main reason why you don't get automatic SIMD is precise exceptions. You probably need to tell the compiler that you don't need them
waterlimon Q: What happens if "/arch:AVX2" switch is enabled?
Look at this AVX-512 stuff
braincruser Q: FMA is fused multiply add
andsz_ Q: Yeah, looks like different caps bits
Wrap things up