Handmade Hero»Episode Guide
Optimizing with SSE2 and AVX2
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next marker
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu Movement

a
w
s
d
h j k l


Quotes and References Menus

Enter Jump to timecode

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
0:02Recap and set the stage for the day
🗩
0:02Recap and set the stage for the day
🗩
0:02Recap and set the stage for the day
🗩
1:26Run the program to show the current picture
🏃
1:26Run the program to show the current picture
🏃
1:26Run the program to show the current picture
🏃
4:17Begin to implement the LANE_WIDTH == 4 versions for our various functions / operators1
4:17Begin to implement the LANE_WIDTH == 4 versions for our various functions / operators1
4:17Begin to implement the LANE_WIDTH == 4 versions for our various functions / operators1
13:57Describe the _mm_xor_si128 instruction2
📖
13:57Describe the _mm_xor_si128 instruction2
📖
13:57Describe the _mm_xor_si128 instruction2
📖
16:52Implement a full set of lane width-agnostic operators3
16:52Implement a full set of lane width-agnostic operators3
16:52Implement a full set of lane width-agnostic operators3
42:33Fix up CastSampleRays() to convert everything to the correct lane widthα
42:33Fix up CastSampleRays() to convert everything to the correct lane widthα
42:33Fix up CastSampleRays() to convert everything to the correct lane widthα
45:35Introduce LaneV3FromV3() and continue fixing up CastSampleRays()
45:35Introduce LaneV3FromV3() and continue fixing up CastSampleRays()
45:35Introduce LaneV3FromV3() and continue fixing up CastSampleRays()
48:45Implement the various lane_v3 functions / operators4
48:45Implement the various lane_v3 functions / operators4
48:45Implement the various lane_v3 functions / operators4
1:03:14Implement scalar comparison operators5
1:03:14Implement scalar comparison operators5
1:03:14Implement scalar comparison operators5
1:12:16Introduce AndNot() using _mm_andnot_si128 for ConditionalAssign() to use6
1:12:16Introduce AndNot() using _mm_andnot_si128 for ConditionalAssign() to use6
1:12:16Introduce AndNot() using _mm_andnot_si128 for ConditionalAssign() to use6
1:23:00Continue to implement our scalar functions
1:23:00Continue to implement our scalar functions
1:23:00Continue to implement our scalar functions
1:32:28Double-check C's specification for comparison operators7
📖
1:32:28Double-check C's specification for comparison operators7
📖
1:32:28Double-check C's specification for comparison operators7
📖
1:34:00Continue to fix up compile errors
1:34:00Continue to fix up compile errors
1:34:00Continue to fix up compile errors
1:34:52Implement scalar loading of materials using _mm_setr_ps8
1:34:52Implement scalar loading of materials using _mm_setr_ps8
1:34:52Implement scalar loading of materials using _mm_setr_ps8
1:50:15Continue to fix up compile errors9
1:50:15Continue to fix up compile errors9
1:50:15Continue to fix up compile errors9
1:59:59Implement multiple permutations of MaskIsZeroed() and HorizontalAdd()10
1:59:59Implement multiple permutations of MaskIsZeroed() and HorizontalAdd()10
1:59:59Implement multiple permutations of MaskIsZeroed() and HorizontalAdd()10
2:06:20Make RenderTile() pack the sRGB colour inline and initialise everything in scalar
2:06:20Make RenderTile() pack the sRGB colour inline and initialise everything in scalar
2:06:20Make RenderTile() pack the sRGB colour inline and initialise everything in scalar
2:11:39Introduce Extract0() for RenderTile() to call
2:11:39Introduce Extract0() for RenderTile() to call
2:11:39Introduce Extract0() for RenderTile() to call
2:14:17Change Materials, Planes and Spheres to be initialiser lists
2:14:17Change Materials, Planes and Spheres to be initialiser lists
2:14:17Change Materials, Planes and Spheres to be initialiser lists
2:18:47Make the Entropy stuff work properly11
2:18:47Make the Entropy stuff work properly11
2:18:47Make the Entropy stuff work properly11
2:21:42Run the program to see totally bogus results
🏃
2:21:42Run the program to see totally bogus results
🏃
2:21:42Run the program to see totally bogus results
🏃
2:22:18Print out the lane width and flip LANE_WIDTH back to 1 so we can get that working again
2:22:18Print out the lane width and flip LANE_WIDTH back to 1 so we can get that working again
2:22:18Print out the lane width and flip LANE_WIDTH back to 1 so we can get that working again
2:32:07Run the program in 1-wide lanes to see that this no longer works
🏃
2:32:07Run the program in 1-wide lanes to see that this no longer works
🏃
2:32:07Run the program in 1-wide lanes to see that this no longer works
🏃
2:34:23Step through CastSampleRays() and inspect its values
2:34:23Step through CastSampleRays() and inspect its values
2:34:23Step through CastSampleRays() and inspect its values
2:43:07Make the operator& for lane_v3 zero out the mask if needed
2:43:07Make the operator& for lane_v3 zero out the mask if needed
2:43:07Make the operator& for lane_v3 zero out the mask if needed
2:44:29Run the program...
🏃
2:44:29Run the program...
🏃
2:44:29Run the program...
🏃
2:45:19Bump the CPUCount back up
2:45:19Bump the CPUCount back up
2:45:19Bump the CPUCount back up
2:45:27Run the program to see what's going on
🏃
2:45:27Run the program to see what's going on
🏃
2:45:27Run the program to see what's going on
🏃
2:46:58Increase the LANE_WIDTH to 4
2:46:58Increase the LANE_WIDTH to 4
2:46:58Increase the LANE_WIDTH to 4
2:47:19Run the program to see a bizarre picture
🏃
2:47:19Run the program to see a bizarre picture
🏃
2:47:19Run the program to see a bizarre picture
🏃
2:48:26Switch back to the slow mode and step through CastSampleRays() to inspect its values
2:48:26Switch back to the slow mode and step through CastSampleRays() to inspect its values
2:48:26Switch back to the slow mode and step through CastSampleRays() to inspect its values
2:54:12Fix ConditionalAssign() to cast rather than convert
2:54:12Fix ConditionalAssign() to cast rather than convert
2:54:12Fix ConditionalAssign() to cast rather than convert
2:54:54Step back through CastSampleRays() to see more expected values
2:54:54Step back through CastSampleRays() to see more expected values
2:54:54Step back through CastSampleRays() to see more expected values
2:56:20Run our program to see a better image
🏃
2:56:20Run our program to see a better image
🏃
2:56:20Run our program to see a better image
🏃
2:58:02Compare our GatherF32_() functions
2:58:02Compare our GatherF32_() functions
2:58:02Compare our GatherF32_() functions
2:59:52Step into CastSampleRays() and inspect the material values
2:59:52Step into CastSampleRays() and inspect the material values
2:59:52Step into CastSampleRays() and inspect the material values
3:03:16Scrutinise our operator!= for lane_u32
3:03:16Scrutinise our operator!= for lane_u32
3:03:16Scrutinise our operator!= for lane_u32
3:04:39Fix our operator!= for lane_u32 to use _mm_set1_epi32(0xFFFFFFFF) rather than _mm_setzero_si128()
3:04:39Fix our operator!= for lane_u32 to use _mm_set1_epi32(0xFFFFFFFF) rather than _mm_setzero_si128()
3:04:39Fix our operator!= for lane_u32 to use _mm_set1_epi32(0xFFFFFFFF) rather than _mm_setzero_si128()
3:05:55Step in to CastSampleRays() to see that our lane mask is set properly
3:05:55Step in to CastSampleRays() to see that our lane mask is set properly
3:05:55Step in to CastSampleRays() to see that our lane mask is set properly
3:06:57Run our program to see that we're now only a little bit wrong
🏃
3:06:57Run our program to see that we're now only a little bit wrong
🏃
3:06:57Run our program to see that we're now only a little bit wrong
🏃
3:08:08Read through our scalar code for any obvious mistakes
3:08:08Read through our scalar code for any obvious mistakes
3:08:08Read through our scalar code for any obvious mistakes
3:18:07Run our program on 1 lane, to compare our image with the 4 lane version
🏃
3:18:07Run our program on 1 lane, to compare our image with the 4 lane version
🏃
3:18:07Run our program on 1 lane, to compare our image with the 4 lane version
🏃
3:20:51Rename Scatter to Specular and try to force all Specular values to 1
3:20:51Rename Scatter to Specular and try to force all Specular values to 1
3:20:51Rename Scatter to Specular and try to force all Specular values to 1
3:23:52Run our program to see what that looks like
🏃
3:23:52Run our program to see what that looks like
🏃
3:23:52Run our program to see what that looks like
🏃
3:25:53Revert those specular values and investigate whether the PureBounce, RandomBounce and RayDirection are being computed correctly
3:25:53Revert those specular values and investigate whether the PureBounce, RandomBounce and RayDirection are being computed correctly
3:25:53Revert those specular values and investigate whether the PureBounce, RandomBounce and RayDirection are being computed correctly
3:28:49Step in to the lane_v3 Lerp() to see what it produces
3:28:49Step in to the lane_v3 Lerp() to see what it produces
3:28:49Step in to the lane_v3 Lerp() to see what it produces
3:32:42Check the normalisation of RayDirection
3:32:42Check the normalisation of RayDirection
3:32:42Check the normalisation of RayDirection
3:33:37Step through RandomBilateral()
3:33:37Step through RandomBilateral()
3:33:37Step through RandomBilateral()
3:35:01Step into LaneF32FromU32() and double-check what it is computing
3:35:01Step into LaneF32FromU32() and double-check what it is computing
3:35:01Step into LaneF32FromU32() and double-check what it is computing
3:36:48Make LaneU32FromU32 cast its incoming u32 to an int when passing it to _mm_set1_epi32β
3:36:48Make LaneU32FromU32 cast its incoming u32 to an int when passing it to _mm_set1_epi32β
3:36:48Make LaneU32FromU32 cast its incoming u32 to an int when passing it to _mm_set1_epi32β
3:38:43Step back in to RandomUnilateral() to see possibly more expected results
3:38:43Step back in to RandomUnilateral() to see possibly more expected results
3:38:43Step back in to RandomUnilateral() to see possibly more expected results
3:40:06Assert in RandomUnilateral() that Result < 0.6f
3:40:06Assert in RandomUnilateral() that Result < 0.6f
3:40:06Assert in RandomUnilateral() that Result < 0.6f
3:41:06Run the game and don't hit that assert, to determine that RandomUnilateral() is not producing the full range of values from 0 to 1
🏃
3:41:06Run the game and don't hit that assert, to determine that RandomUnilateral() is not producing the full range of values from 0 to 1
🏃
3:41:06Run the game and don't hit that assert, to determine that RandomUnilateral() is not producing the full range of values from 0 to 1
🏃
3:43:00Make RandomUnilateral() shift down its terms by 1γ
3:43:00Make RandomUnilateral() shift down its terms by 1γ
3:43:00Make RandomUnilateral() shift down its terms by 1γ
3:44:29Run and hit our assertion in RandomUnilateral()
🏃
3:44:29Run and hit our assertion in RandomUnilateral()
🏃
3:44:29Run and hit our assertion in RandomUnilateral()
🏃
3:44:34Remove that assert and run the game to see a reasonable result
🏃
3:44:34Remove that assert and run the game to see a reasonable result
🏃
3:44:34Remove that assert and run the game to see a reasonable result
🏃
3:46:17Run the program at full quality and compare our imagesδ
🏃
3:46:17Run the program at full quality and compare our imagesδ
🏃
3:46:17Run the program at full quality and compare our imagesδ
🏃
3:49:06Step in to CastSampleRays() to see that we do break out properly
3:49:06Step in to CastSampleRays() to see that we do break out properly
3:49:06Step in to CastSampleRays() to see that we do break out properly
3:49:20Cast significantly fewer rays per pixel to determine that we are not over-casting
3:49:20Cast significantly fewer rays per pixel to determine that we are not over-casting
3:49:20Cast significantly fewer rays per pixel to determine that we are not over-casting
3:52:21Step in to CastSampleRays() and inspect the asm
3:52:21Step in to CastSampleRays() and inspect the asm
3:52:21Step in to CastSampleRays() and inspect the asm
3:55:07Make CastSampleRays() count up the LoopsComputed for us to print out
3:55:07Make CastSampleRays() count up the LoopsComputed for us to print out
3:55:07Make CastSampleRays() count up the LoopsComputed for us to print out
4:00:35Run our program and inspect its statistics to see a mere 10.61% wasted bounces
🏃
4:00:35Run our program and inspect its statistics to see a mere 10.61% wasted bounces
🏃
4:00:35Run our program and inspect its statistics to see a mere 10.61% wasted bounces
🏃
4:01:40Q&A
🗩
4:01:40Q&A
🗩
4:01:40Q&A
🗩
4:02:40thecodedragon You didn't replace &= and |= with the correct operator inside the function
🗪
4:02:40thecodedragon You didn't replace &= and |= with the correct operator inside the function
🗪
4:02:40thecodedragon You didn't replace &= and |= with the correct operator inside the function
🗪
4:03:03popcorn0x90 Q: Is your beard fake? It grew pretty fast
🗪
4:03:03popcorn0x90 Q: Is your beard fake? It grew pretty fast
🗪
4:03:03popcorn0x90 Q: Is your beard fake? It grew pretty fast
🗪
4:03:11Kelimion cmuratori: Not just day 3, but also 2017-11-19 (for the image filename)
🗪
4:03:11Kelimion cmuratori: Not just day 3, but also 2017-11-19 (for the image filename)
🗪
4:03:11Kelimion cmuratori: Not just day 3, but also 2017-11-19 (for the image filename)
🗪
4:03:38pragmascrypt Q: 64 samples looked very smooth. Could you compare 64 samples with 4 wide to 64 samples 1 wide?
🗪
4:03:38pragmascrypt Q: 64 samples looked very smooth. Could you compare 64 samples with 4 wide to 64 samples 1 wide?
🗪
4:03:38pragmascrypt Q: 64 samples looked very smooth. Could you compare 64 samples with 4 wide to 64 samples 1 wide?
🗪
4:03:59chrysos42 Q: Due to floating point precision, is there a significant difference between generating a random float by dividing 32 random bits by the max 32 bit integer vs dividing 24 random bits by the max 24 bit integer?
🗪
4:03:59chrysos42 Q: Due to floating point precision, is there a significant difference between generating a random float by dividing 32 random bits by the max 32 bit integer vs dividing 24 random bits by the max 24 bit integer?
🗪
4:03:59chrysos42 Q: Due to floating point precision, is there a significant difference between generating a random float by dividing 32 random bits by the max 32 bit integer vs dividing 24 random bits by the max 24 bit integer?
🗪
4:05:34the_lyribolical_coach_b Q: You said in a much earlier stream that using operator overloads for SIMD could confuse the compiler, preferring to use macros. Why the change?
🗪
4:05:34the_lyribolical_coach_b Q: You said in a much earlier stream that using operator overloads for SIMD could confuse the compiler, preferring to use macros. Why the change?
🗪
4:05:34the_lyribolical_coach_b Q: You said in a much earlier stream that using operator overloads for SIMD could confuse the compiler, preferring to use macros. Why the change?
🗪
4:06:53pragmascrypt Q: I was thinking maybe it does more samples than it should with 4 wide, so by comparing 64 samples 1 wide with 64 samples 4 wide maybe it would look different
🗪
4:06:53pragmascrypt Q: I was thinking maybe it does more samples than it should with 4 wide, so by comparing 64 samples 1 wide with 64 samples 4 wide maybe it would look different
🗪
4:06:53pragmascrypt Q: I was thinking maybe it does more samples than it should with 4 wide, so by comparing 64 samples 1 wide with 64 samples 4 wide maybe it would look different
🗪
4:07:12Run the program on 1 lane and 64 RaysPerPixel and compare the images
🏃
4:07:12Run the program on 1 lane and 64 RaysPerPixel and compare the images
🏃
4:07:12Run the program on 1 lane and 64 RaysPerPixel and compare the images
🏃
4:09:23groggeh Q: Nine women can't grow a baby any faster; would smaller packing potentially be better? Is the packing taking too much time? Just spit-balling
🗪
4:09:23groggeh Q: Nine women can't grow a baby any faster; would smaller packing potentially be better? Is the packing taking too much time? Just spit-balling
🗪
4:09:23groggeh Q: Nine women can't grow a baby any faster; would smaller packing potentially be better? Is the packing taking too much time? Just spit-balling
🗪
4:10:24Enable CastSampleRays() to early-out as often as possible
4:10:24Enable CastSampleRays() to early-out as often as possible
4:10:24Enable CastSampleRays() to early-out as often as possible
4:14:40Run the program to see that it is now twice as fast
🏃
4:14:40Run the program to see that it is now twice as fast
🏃
4:14:40Run the program to see that it is now twice as fast
🏃
4:16:00Consider avoiding gathering for rays that haven't hit
4:16:00Consider avoiding gathering for rays that haven't hit
4:16:00Consider avoiding gathering for rays that haven't hit
4:16:59Explicitly establish that the LaneMask is not zeroed before setting the Attenuation, Bounces and RayDirection
4:16:59Explicitly establish that the LaneMask is not zeroed before setting the Attenuation, Bounces and RayDirection
4:16:59Explicitly establish that the LaneMask is not zeroed before setting the Attenuation, Bounces and RayDirection
4:17:56Run the program to see another speedup
🏃
4:17:56Run the program to see another speedup
🏃
4:17:56Run the program to see another speedup
🏃
4:19:02Pull out the lane width-specific code to their own .h files, introducing 8-wide versions for everything12
4:19:02Pull out the lane width-specific code to their own .h files, introducing 8-wide versions for everything12
4:19:02Pull out the lane width-specific code to their own .h files, introducing 8-wide versions for everything12
4:22:17Check out the _CMP* defines in immintrin.h13ε
4:22:17Check out the _CMP* defines in immintrin.h13ε
4:22:17Check out the _CMP* defines in immintrin.h13ε
4:27:08Learn what "ordered" means in the context of these _CMP* defines14
📖
4:27:08Learn what "ordered" means in the context of these _CMP* defines14
📖
4:27:08Learn what "ordered" means in the context of these _CMP* defines14
📖
4:28:22Continue to implement the 8-wide versions of our functions / operators15
4:28:22Continue to implement the 8-wide versions of our functions / operators15
4:28:22Continue to implement the 8-wide versions of our functions / operators15
4:36:07Run the program in 8-wide lanes and crash immediately
🏃
4:36:07Run the program in 8-wide lanes and crash immediately
🏃
4:36:07Run the program in 8-wide lanes and crash immediately
🏃
4:38:21Inspect the asm for RenderTile() to see that we are failing on the vunpcklps call, and investigate if it is an alignment issue
4:38:21Inspect the asm for RenderTile() to see that we are failing on the vunpcklps call, and investigate if it is an alignment issue
4:38:21Inspect the asm for RenderTile() to see that we are failing on the vunpcklps call, and investigate if it is an alignment issue
4:42:03Search the Intel 64 and IA-32 Architectures Software Developer Manuals for vunpcklps16
📖
4:42:03Search the Intel 64 and IA-32 Architectures Software Developer Manuals for vunpcklps16
📖
4:42:03Search the Intel 64 and IA-32 Architectures Software Developer Manuals for vunpcklps16
📖
4:44:33Pass -arch:AVX2 on the build line to prevent the vunpcklps call from using bcst17
4:44:33Pass -arch:AVX2 on the build line to prevent the vunpcklps call from using bcst17
4:44:33Pass -arch:AVX2 on the build line to prevent the vunpcklps call from using bcst17
4:46:26Run our program in 8-wide lanes to see that we are slower, more wasteful and darker
🏃
4:46:26Run our program in 8-wide lanes to see that we are slower, more wasteful and darker
🏃
4:46:26Run our program in 8-wide lanes to see that we are slower, more wasteful and darker
🏃
4:47:18Fix our 8-wide HorizontalAdd()
4:47:18Fix our 8-wide HorizontalAdd()
4:47:18Fix our 8-wide HorizontalAdd()
4:48:10Run our program to see that we are much better, and save off our images and statistics
🏃
4:48:10Run our program to see that we are much better, and save off our images and statistics
🏃
4:48:10Run our program to see that we are much better, and save off our images and statistics
🏃
4:52:28That's about it for today
🗩
4:52:28That's about it for today
🗩
4:52:28That's about it for today
🗩
4:53:20butwhynot1 Q: Do AVX512 now
🗪
4:53:20butwhynot1 Q: Do AVX512 now
🗪
4:53:20butwhynot1 Q: Do AVX512 now
🗪
4:54:04That's it
🗩
4:54:04That's it
🗩
4:54:04That's it
🗩