Handmade Hero»Episode Guide
Finishing the Main SIMD Raycasting Loop
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next marker
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu Movement

a
w
s
d
h j k l


Quotes and References Menus

Enter Jump to timecode

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
0:01Recap and set the stage for the day finishing SIMD optimising the lighting
🗩
0:01Recap and set the stage for the day finishing SIMD optimising the lighting
🗩
0:01Recap and set the stage for the day finishing SIMD optimising the lighting
🗩
0:47Toggle on the threading and add a b32 Hit to raycast_result for RayCast() to encode that a ray did not hit
0:47Toggle on the threading and add a b32 Hit to raycast_result for RayCast() to encode that a ray did not hit
0:47Toggle on the threading and add a b32 Hit to raycast_result for RayCast() to encode that a ray did not hit
5:57Run the game to find that we're running at 32ms
🏃
5:57Run the game to find that we're running at 32ms
🏃
5:57Run the game to find that we're running at 32ms
🏃
6:29Toggle off the threading
6:29Toggle off the threading
6:29Toggle off the threading
6:49Run the game to see that we're running at 128ms per frame
🏃
6:49Run the game to see that we're running at 128ms per frame
🏃
6:49Run the game to see that we're running at 128ms per frame
🏃
7:18Toggle on the threading
7:18Toggle on the threading
7:18Toggle on the threading
7:26Run the game and consider our 32ms per frame rate
🏃
7:26Run the game and consider our 32ms per frame rate
🏃
7:26Run the game and consider our 32ms per frame rate
🏃
9:43Excise from RayCast() the non-SIMD tRay code, and start to consider how to retire rays hits
9:43Excise from RayCast() the non-SIMD tRay code, and start to consider how to retire rays hits
9:43Excise from RayCast() the non-SIMD tRay code, and start to consider how to retire rays hits
11:42Preserving ray hits vs traversing the spatial hierarchy, when threading
🖌
11:42Preserving ray hits vs traversing the spatial hierarchy, when threading
🖌
11:42Preserving ray hits vs traversing the spatial hierarchy, when threading
🖌
15:57Enable RayCast() to record ray hits for each SIMD component before traversing the spatial hierarchy
15:57Enable RayCast() to record ray hits for each SIMD component before traversing the spatial hierarchy
15:57Enable RayCast() to record ray hits for each SIMD component before traversing the spatial hierarchy
20:02Run the game to see that we're running at the same 32ms
🏃
20:02Run the game to see that we're running at the same 32ms
🏃
20:02Run the game to see that we're running at the same 32ms
🏃
20:37Revert RayCast() to traverse the spatial hierarchy, applying the ray hit mask for each component, and streamline how this works, introducing f32_4x versions of &= and |=
20:37Revert RayCast() to traverse the spatial hierarchy, applying the ray hit mask for each component, and streamline how this works, introducing f32_4x versions of &= and |=
20:37Revert RayCast() to traverse the spatial hierarchy, applying the ray hit mask for each component, and streamline how this works, introducing f32_4x versions of &= and |=
26:58Run the game to see that that's fine
🏃
26:58Run the game to see that that's fine
🏃
26:58Run the game to see that that's fine
🏃
27:06Start to streamline the tRay setting code
27:06Start to streamline the tRay setting code
27:06Start to streamline the tRay setting code
28:56Fix the CloseEnough check in RayCast()
28:56Fix the CloseEnough check in RayCast()
28:56Fix the CloseEnough check in RayCast()
30:05Run the game to see not much difference
🏃
30:05Run the game to see not much difference
🏃
30:05Run the game to see not much difference
🏃
30:20Introduce Select() to streamline the tRay setting code1
30:20Introduce Select() to streamline the tRay setting code1
30:20Introduce Select() to streamline the tRay setting code1
34:15Run the game to see that we are at ~28ms per frame
🏃
34:15Run the game to see that we are at ~28ms per frame
🏃
34:15Run the game to see that we are at ~28ms per frame
🏃
34:44Make RayCast() set the BoxIndex and BoxSurface in SIMD using Select()
34:44Make RayCast() set the BoxIndex and BoxSurface in SIMD using Select()
34:44Make RayCast() set the BoxIndex and BoxSurface in SIMD using Select()
42:30Run the game and crash in ComputeLightPropagation()
🏃
42:30Run the game and crash in ComputeLightPropagation()
🏃
42:30Run the game and crash in ComputeLightPropagation()
🏃
44:45Step in to GetBox() to see that our BoxIndex is busted
🏃
44:45Step in to GetBox() to see that our BoxIndex is busted
🏃
44:45Step in to GetBox() to see that our BoxIndex is busted
🏃
46:33Step through RayCast() to see what's happening
🏃
46:33Step through RayCast() to see what's happening
🏃
46:33Step through RayCast() to see what's happening
🏃
49:46Make RayCast() actually set the BoxIndex and BoxSurfaceIndex
🦉
🖮
49:46Make RayCast() actually set the BoxIndex and BoxSurfaceIndex
🦉
🖮
49:46Make RayCast() actually set the BoxIndex and BoxSurfaceIndex
🦉
🖮
50:55Run the game with the selection happening
🏃
🦉
50:55Run the game with the selection happening
🏃
🦉
50:55Run the game with the selection happening
🏃
🦉
51:09Make RayCast() set the RayP in SIMD using a v3_4x version of Select()
🦉
🖮
51:09Make RayCast() set the RayP in SIMD using a v3_4x version of Select()
🦉
🖮
51:09Make RayCast() set the RayP in SIMD using a v3_4x version of Select()
🦉
🖮
53:38Run the game to see that we are down to ~26ms
🏃
🦉
53:38Run the game to see that we are down to ~26ms
🏃
🦉
53:38Run the game to see that we are down to ~26ms
🏃
🦉
54:22Add a TIMED_FUNCTION() in RayCast()
🦉
🖮
54:22Add a TIMED_FUNCTION() in RayCast()
🦉
🖮
54:22Add a TIMED_FUNCTION() in RayCast()
🦉
🖮
54:41Run the game to consult the profiler
🏃
🦉
54:41Run the game to consult the profiler
🏃
🦉
54:41Run the game to consult the profiler
🏃
🦉
54:55Add a TIMED_BLOCK() around the startup code in RayCast()
🦉
🖮
54:55Add a TIMED_BLOCK() around the startup code in RayCast()
🦉
🖮
54:55Add a TIMED_BLOCK() around the startup code in RayCast()
🦉
🖮
55:35Run the game and consult the profiler to see that the startup cost is not high
🏃
🦉
55:35Run the game and consult the profiler to see that the startup cost is not high
🏃
🦉
55:35Run the game and consult the profiler to see that the startup cost is not high
🏃
🦉
56:00Perform SampleHemisphere() in SIMD
🦉
🖮
56:00Perform SampleHemisphere() in SIMD
🦉
🖮
56:00Perform SampleHemisphere() in SIMD
🦉
🖮
1:01:22Run the game to see that we're down to 22ms per frame
🏃
🦉
1:01:22Run the game to see that we're down to 22ms per frame
🏃
🦉
1:01:22Run the game to see that we're down to 22ms per frame
🏃
🦉
1:02:04Temporarily make SampleHemisphere() use complete randomisation
🦉
🖮
1:02:04Temporarily make SampleHemisphere() use complete randomisation
🦉
🖮
1:02:04Temporarily make SampleHemisphere() use complete randomisation
🦉
🖮
1:02:20Run the game to see that this would put us back up to 30ms per frame, and note why
🏃
🦉
1:02:20Run the game to see that this would put us back up to 30ms per frame, and note why
🏃
🦉
1:02:20Run the game to see that this would put us back up to 30ms per frame, and note why
🏃
🦉
1:04:08Drop the RayCount down to 4 in ComputeLightPropagation()
🦉
🖮
1:04:08Drop the RayCount down to 4 in ComputeLightPropagation()
🦉
🖮
1:04:08Drop the RayCount down to 4 in ComputeLightPropagation()
🦉
🖮
1:04:25Run the game and unexpectedly see no speed improvement
🏃
🦉
1:04:25Run the game and unexpectedly see no speed improvement
🏃
🦉
1:04:25Run the game and unexpectedly see no speed improvement
🏃
🦉
1:05:45Remove variable suffixes in RayCast()
🦉
🖮
1:05:45Remove variable suffixes in RayCast()
🦉
🖮
1:05:45Remove variable suffixes in RayCast()
🦉
🖮
1:08:55Consider removing the Depth loop in RayCast() and reposition the AnyTrue(Mask) test
1:08:55Consider removing the Depth loop in RayCast() and reposition the AnyTrue(Mask) test
1:08:55Consider removing the Depth loop in RayCast() and reposition the AnyTrue(Mask) test
1:10:31Run the game and consider where to go from here
🏃
1:10:31Run the game and consider where to go from here
🏃
1:10:31Run the game and consider where to go from here
🏃
1:11:25Inspect the assembly of RayCast()
1:11:25Inspect the assembly of RayCast()
1:11:25Inspect the assembly of RayCast()
1:14:32Remove the Mask tests from RayCast() entirely
1:14:32Remove the Mask tests from RayCast() entirely
1:14:32Remove the Mask tests from RayCast() entirely
1:15:16Run the game to see no real difference
🏃
1:15:16Run the game to see no real difference
🏃
1:15:16Run the game to see no real difference
🏃
1:15:40Try removing the AnyTrue(tCheck)
1:15:40Try removing the AnyTrue(tCheck)
1:15:40Try removing the AnyTrue(tCheck)
1:15:55Run the game to see that that would put us up to ~25ms per frame
🏃
1:15:55Run the game to see that that would put us up to ~25ms per frame
🏃
1:15:55Run the game to see that that would put us up to ~25ms per frame
🏃
1:16:51Compute RayP at the very end of RayCast()
1:16:51Compute RayP at the very end of RayCast()
1:16:51Compute RayP at the very end of RayCast()
1:18:30Run the game to see no difference
🏃
1:18:30Run the game to see no difference
🏃
1:18:30Run the game to see no difference
🏃
1:18:41Replace RayP with tRay in RayCast()
1:18:41Replace RayP with tRay in RayCast()
1:18:41Replace RayP with tRay in RayCast()
1:20:40Run the game to see no difference
🏃
1:20:40Run the game to see no difference
🏃
1:20:40Run the game to see no difference
🏃
1:21:01Let RayCast() break if(AllTrue(Mask))
1:21:01Let RayCast() break if(AllTrue(Mask))
1:21:01Let RayCast() break if(AllTrue(Mask))
1:21:12Run the game to see no difference
🏃
1:21:12Run the game to see no difference
🏃
1:21:12Run the game to see no difference
🏃
1:21:37Toggle off the snake
1:21:37Toggle off the snake
1:21:37Toggle off the snake
1:21:47Run the game with our consistently lit scene
🏃
1:21:47Run the game with our consistently lit scene
🏃
1:21:47Run the game with our consistently lit scene
🏃
1:22:28Inline AccumulateSample() in ComputeLightPropagation()
1:22:28Inline AccumulateSample() in ComputeLightPropagation()
1:22:28Inline AccumulateSample() in ComputeLightPropagation()
1:25:37Run the game to see no difference, and consider further improvements
🏃
1:25:37Run the game to see no difference, and consider further improvements
🏃
1:25:37Run the game to see no difference, and consider further improvements
🏃
1:27:06Make SampleHemisphere() operate entirely SIMD, introducing RandomBilateral_4x() and versions of Inner(), LengthSq() and NOZ() that take v3_4x
1:27:06Make SampleHemisphere() operate entirely SIMD, introducing RandomBilateral_4x() and versions of Inner(), LengthSq() and NOZ() that take v3_4x
1:27:06Make SampleHemisphere() operate entirely SIMD, introducing RandomBilateral_4x() and versions of Inner(), LengthSq() and NOZ() that take v3_4x
1:41:35A few words on _mm_rsqrt_ps and _mm_sqrt_ps2
📖
1:41:35A few words on _mm_rsqrt_ps and _mm_sqrt_ps2
📖
1:41:35A few words on _mm_rsqrt_ps and _mm_sqrt_ps2
📖
1:44:20Rename our new NOZ() to ApproxNOZ() for SampleHemisphere() to call, and introduce ApproxInvSquareRoot() using _mm_rsqrt_ps3
1:44:20Rename our new NOZ() to ApproxNOZ() for SampleHemisphere() to call, and introduce ApproxInvSquareRoot() using _mm_rsqrt_ps3
1:44:20Rename our new NOZ() to ApproxNOZ() for SampleHemisphere() to call, and introduce ApproxInvSquareRoot() using _mm_rsqrt_ps3
1:49:24Run the game to see no difference
🏃
1:49:24Run the game to see no difference
🏃
1:49:24Run the game to see no difference
🏃
1:49:52Inline SampleHemisphere() in ComputeLightPropagation()
1:49:52Inline SampleHemisphere() in ComputeLightPropagation()
1:49:52Inline SampleHemisphere() in ComputeLightPropagation()
1:51:51Run the game at ~22ms per frame, and consider that this CPU rendered lighting is performant enough for us
🏃
1:51:51Run the game at ~22ms per frame, and consider that this CPU rendered lighting is performant enough for us
🏃
1:51:51Run the game at ~22ms per frame, and consider that this CPU rendered lighting is performant enough for us
🏃
1:52:52Q&A
🗩
1:52:52Q&A
🗩
1:52:52Q&A
🗩
1:53:22printf_armin How much % of the CPU does it drain?
🗪
1:53:22printf_armin How much % of the CPU does it drain?
🗪
1:53:22printf_armin How much % of the CPU does it drain?
🗪
1:56:07tbodt_ Q: What version control system do you use?
🗪
1:56:07tbodt_ Q: What version control system do you use?
🗪
1:56:07tbodt_ Q: What version control system do you use?
🗪
1:56:50nxsy Q: Can you look at the thread profile view when you lower the samples from 16 to 4 and don’t improve frame rate?
🗪
🏃
1:56:50nxsy Q: Can you look at the thread profile view when you lower the samples from 16 to 4 and don’t improve frame rate?
🗪
🏃
1:56:50nxsy Q: Can you look at the thread profile view when you lower the samples from 16 to 4 and don’t improve frame rate?
🗪
🏃
1:57:48Temporarily Toggle off VSync
1:57:48Temporarily Toggle off VSync
1:57:48Temporarily Toggle off VSync
1:58:46Run the game and consult the profiler to determine that the pixel shader is too slow
🏃
1:58:46Run the game and consult the profiler to determine that the pixel shader is too slow
🏃
1:58:46Run the game and consult the profiler to determine that the pixel shader is too slow
🏃
2:02:17mallesbixie Q: You cast the input value in the U32_4x loader to a float. Why is that?4
🗪
2:02:17mallesbixie Q: You cast the input value in the U32_4x loader to a float. Why is that?4
🗪
2:02:17mallesbixie Q: You cast the input value in the U32_4x loader to a float. Why is that?4
🗪
2:05:32jamoflaw Q: How does the mask replace an if in the SIMD instructions?
🗪
2:05:32jamoflaw Q: How does the mask replace an if in the SIMD instructions?
🗪
2:05:32jamoflaw Q: How does the mask replace an if in the SIMD instructions?
🗪
2:05:46Masking in SIMD5,6
🖌
2:05:46Masking in SIMD5,6
🖌
2:05:46Masking in SIMD5,6
🖌
2:19:38Read about PBLENDVB - 'Variable Blend Packed Bytes' in the Intel 64 and IA-32 Architectures Software Developer Manuals7 and consult the Steam Hardware Survey8 for instruction set use
📖
2:19:38Read about PBLENDVB - 'Variable Blend Packed Bytes' in the Intel 64 and IA-32 Architectures Software Developer Manuals7 and consult the Steam Hardware Survey8 for instruction set use
📖
2:19:38Read about PBLENDVB - 'Variable Blend Packed Bytes' in the Intel 64 and IA-32 Architectures Software Developer Manuals7 and consult the Steam Hardware Survey8 for instruction set use
📖
2:25:23alexkelbo Q: Why does the light flicker even when the cube is not moving?
🗪
2:25:23alexkelbo Q: Why does the light flicker even when the cube is not moving?
🗪
2:25:23alexkelbo Q: Why does the light flicker even when the cube is not moving?
🗪
2:25:42vaualbus Q: Could we ship with SSE4, so we have _m256 and get more performance improvement?
🗪
2:25:42vaualbus Q: Could we ship with SSE4, so we have _m256 and get more performance improvement?
🗪
2:25:42vaualbus Q: Could we ship with SSE4, so we have _m256 and get more performance improvement?
🗪
2:25:52sgtrumbi Q: What can you see in TaskManager's GPU tab?
🗪
2:25:52sgtrumbi Q: What can you see in TaskManager's GPU tab?
🗪
2:25:52sgtrumbi Q: What can you see in TaskManager's GPU tab?
🗪
2:27:25longboolean Q: Do you have any tips on talking with non programmers about programming related concepts?
🗪
2:27:25longboolean Q: Do you have any tips on talking with non programmers about programming related concepts?
🗪
2:27:25longboolean Q: Do you have any tips on talking with non programmers about programming related concepts?
🗪
2:27:34tbodt_ Q: How does your profiler work? Does it hook into the compiler or something?
🗪
2:27:34tbodt_ Q: How does your profiler work? Does it hook into the compiler or something?
🗪
2:27:34tbodt_ Q: How does your profiler work? Does it hook into the compiler or something?
🗪
2:27:51alexkelbo Q: Could we compile one exe with SSE4 that falls back to SSE2 if it's not present on the CPU, or would we need to compile into several exe's?
🗪
2:27:51alexkelbo Q: Could we compile one exe with SSE4 that falls back to SSE2 if it's not present on the CPU, or would we need to compile into several exe's?
🗪
2:27:51alexkelbo Q: Could we compile one exe with SSE4 that falls back to SSE2 if it's not present on the CPU, or would we need to compile into several exe's?
🗪
2:28:50Close up shop
🗩
2:28:50Close up shop
🗩
2:28:50Close up shop
🗩