is presently its sole maintainer,
You can support him:
Load up the code and consider optimisation
handmade_render_group.cpp: Comment out if(ShouldFill[I])
Blackboard: Interleaving four SIMD values
Blackboard: Establishing the order we need
handmade_render_group.cpp: Write the SIMD register names that we want to end up with
Internet: Intel Intrinsics Guide [see Resources]
Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32
Blackboard: Using these operations to generate what we need
handmade_render_group.cpp: Name the registers in register order
Internet: Double-check the parameter order of the unpack operations
handmade_render_group.cpp: Start to populate the registers
Internet: Keeping in mind how often you move between __m128 and __m128i
handmade_render_group.cpp: Cast the Blended values from float to int
Use structured art to enable us to see what's happening
Debugger: Watch how our art gets shuffled
handmade_render_group.cpp: Produce the rest of the pixel values we need
Convert 32-bit floating point values to 8-bit integers
// TODO(casey): Set the rounding to something known
Blackboard: Using 8-bits of these 32-bit registers
handmade_render_group.cpp: Bitwise OR and Shift these values
Blackboard: How the shift operations work
handmade_render_group.cpp: Implement these shifts
Debugger: Take a look at the Out value
handmade_render_group.cpp: Break out the values
Debugger: Inspect these values
handmade_render_group.cpp: Fix the test case
Debugger: Inspect our stuff
handmade_render_group.cpp: Write Out to Pixel
Debugger: Crash and reload
Debugger: Note that we are writing unaligned
handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction
Recap and glimpse into the future
braincruser Q: Will the operations be reordered to reduce the number of ops and load / stores?
mmozeiko Q: You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?
handmade_render_group.cpp: Write it the way mmozeiko suggests
uspred Q: Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?
Blackboard: Multiplying floats vs Multiplying integers
mmozeiko Q: Same for texture bilinear adds together
handmade_render_group.cpp: Implement mmozeiko's suggestion
flaturated Q: Can you compile /O2 to compare it to last week's performance?
brblackmer Q: Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?
quikligames Q: Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?
mmozeiko Q: Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)
plain_flavored Q: Is scalar access to __m128 elements still slow on Intel?
braincruser Q: The processor window is 192 instructions
gasto5 Q: I don't understand how one optimizes by using the intrinsic or function
mmozeiko Q: _mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?
handmade_render_group.cpp: Switch to _mm_cvttps_epi32