Packing Pixels for the Framebuffer | Handmade Hero Episode Guide

2:05Load up the code and consider optimisation

4:09handmade_render_group.cpp: Comment out if(ShouldFill[I])

5:34Blackboard: Interleaving four SIMD values

14:27Blackboard: Establishing the order we need

15:46handmade_render_group.cpp: Write the SIMD register names that we want to end up with

16:29Internet: Intel Intrinsics Guide¹

17:23Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32

19:04Blackboard: Using these operations to generate what we need

24:17handmade_render_group.cpp: Name the registers in register order

25:15Internet: Double-check the parameter order of the unpack operations

26:22handmade_render_group.cpp: Start to populate the registers

26:52Internet: Keeping in mind how often you move between __m128 and __m128i

28:39handmade_render_group.cpp: Cast the Blended values from float to int

29:47Use structured art to enable us to see what's happening

34:47Debugger: Watch how our art gets shuffled

38:40handmade_render_group.cpp: Produce the rest of the pixel values we need

41:43Convert 32-bit floating point values to 8-bit integers

44:07// TODO(casey): Set the rounding to something known

45:08Blackboard: Using 8-bits of these 32-bit registers

47:32handmade_render_group.cpp: Bitwise OR and Shift these values

50:27Blackboard: How the shift operations work

52:44handmade_render_group.cpp: Implement these shifts

55:06Debugger: Take a look at the Out value

57:33handmade_render_group.cpp: Break out the values

58:22Debugger: Inspect these values

58:35handmade_render_group.cpp: Fix the test case

59:32Debugger: Inspect our stuff

1:00:13handmade_render_group.cpp: Write Out to Pixel

1:01:08Debugger: Crash and reload

1:01:43Debugger: Note that we are writing unaligned

1:04:22Blackboard: Alignment

1:05:54handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction

1:07:23Recap and glimpse into the future

1:08:30Q&A

🗩

1:08:30Q&A

🗩

1:08:30Q&A

🗩

1:09:59braincruser Will the operations be reordered to reduce the number of ops and load / stores?

🗪

1:09:59braincruser Will the operations be reordered to reduce the number of ops and load / stores?

🗪

1:09:59braincruser Will the operations be reordered to reduce the number of ops and load / stores?

🗪

1:12:01mmozeiko You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?

🗪

1:12:01mmozeiko You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?

🗪

1:12:01mmozeiko You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?

🗪

1:14:57handmade_render_group.cpp: Write it the way mmozeiko suggests

1:17:31uspred Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?

🗪

1:17:31uspred Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?

🗪

1:17:31uspred Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?

🗪

1:18:21Blackboard: Multiplying floats vs Multiplying integers

1:19:54mmozeiko Same for texture bilinear adds together

🗪

1:19:54mmozeiko Same for texture bilinear adds together

🗪

1:19:54mmozeiko Same for texture bilinear adds together

🗪

1:20:03handmade_render_group.cpp: Implement mmozeiko's suggestion

1:23:00flaturated Can you compile /O2 to compare it to last week's performance?

🗪

1:23:00flaturated Can you compile /O2 to compare it to last week's performance?

🗪

1:23:00flaturated Can you compile /O2 to compare it to last week's performance?

🗪

1:23:16brblackmer Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?

🗪

1:23:16brblackmer Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?

🗪

1:23:16brblackmer Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?

🗪

1:23:39quikligames Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?

🗪

1:23:39quikligames Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?

🗪

1:23:39quikligames Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?

🗪

1:24:40mmozeiko Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)

🗪

1:24:40mmozeiko Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)

🗪

1:24:40mmozeiko Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)

🗪

1:26:25plain_flavored Is scalar access to __m128 elements still slow on Intel?

🗪

1:26:25plain_flavored Is scalar access to __m128 elements still slow on Intel?

🗪

1:26:25plain_flavored Is scalar access to __m128 elements still slow on Intel?

🗪

1:27:18braincruser The processor window is 192 instructions

🗪

1:27:18braincruser The processor window is 192 instructions

🗪

1:27:18braincruser The processor window is 192 instructions

🗪

1:28:01gasto5 I don't understand how one optimizes by using the intrinsic or function

🗪

1:28:01gasto5 I don't understand how one optimizes by using the intrinsic or function

🗪

1:28:01gasto5 I don't understand how one optimizes by using the intrinsic or function

🗪

1:28:51mmozeiko _mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?

🗪

1:28:51mmozeiko _mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?

🗪

1:28:51mmozeiko _mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?

🗪

1:30:45handmade_render_group.cpp: Switch to _mm_cvttps_epi32

1:32:50Wrap up

🗩

1:32:50Wrap up

🗩

1:32:50Wrap up

🗩

Handmade Hero

Keyboard Navigation

Global Keys

Menu toggling

In-Menu Movement

Quotes and References Menus

Quotes, References and Credits Menus

Filter Menu

Filter and Link Menus

Credits Menu