Handmade Hero»Episode Guide
Converting Math Operations to SIMD
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next marker
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu Movement

a
w
s
d
h j k l


Quotes and References Menus

Enter Jump to timecode

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
1:23Recap yesterday's work
1:23Recap yesterday's work
1:23Recap yesterday's work
2:46build.bat: Switch to -O2
2:46build.bat: Switch to -O2
2:46build.bat: Switch to -O2
4:22Think about doing the TestPixel TIMED_BLOCK over a wider range
4:22Think about doing the TestPixel TIMED_BLOCK over a wider range
4:22Think about doing the TestPixel TIMED_BLOCK over a wider range
5:20handmade_render_group.cpp: Move the timer around the for loops
5:20handmade_render_group.cpp: Move the timer around the for loops
5:20handmade_render_group.cpp: Move the timer around the for loops
5:50Debugger: See that there are two loops that are more or less the same
5:50Debugger: See that there are two loops that are more or less the same
5:50Debugger: See that there are two loops that are more or less the same
6:26handmade_platform.h: Number these DebugCycleCounters
6:26handmade_platform.h: Number these DebugCycleCounters
6:26handmade_platform.h: Number these DebugCycleCounters
6:49handmade_render_group.cpp: Rename TestPixel to ProcessPixel and remove TIMED_BLOCK around DrawRectangleSlowly
6:49handmade_render_group.cpp: Rename TestPixel to ProcessPixel and remove TIMED_BLOCK around DrawRectangleSlowly
6:49handmade_render_group.cpp: Rename TestPixel to ProcessPixel and remove TIMED_BLOCK around DrawRectangleSlowly
7:35Debugger: Look at the DEBUG CYCLE COUNTS
7:35Debugger: Look at the DEBUG CYCLE COUNTS
7:35Debugger: Look at the DEBUG CYCLE COUNTS
8:12handmade_render_group.cpp: Introduce END_TIMED_BLOCK_COUNTED
8:12handmade_render_group.cpp: Introduce END_TIMED_BLOCK_COUNTED
8:12handmade_render_group.cpp: Introduce END_TIMED_BLOCK_COUNTED
9:36Debugger: See that the ProcessPixel count is now more accurate [243cy/h]
9:36Debugger: See that the ProcessPixel count is now more accurate [243cy/h]
9:36Debugger: See that the ProcessPixel count is now more accurate [243cy/h]
10:34handmade_render_group.cpp: Write this in SIMD
10:34handmade_render_group.cpp: Write this in SIMD
10:34handmade_render_group.cpp: Write this in SIMD
16:35Run and see that it's still producing the correct result
16:35Run and see that it's still producing the correct result
16:35Run and see that it's still producing the correct result
16:47build.bat: Switch to -Od
16:47build.bat: Switch to -Od
16:47build.bat: Switch to -Od
17:27Debugger: Inspect TexelAr
17:27Debugger: Inspect TexelAr
17:27Debugger: Inspect TexelAr
21:28handmade_render_group.cpp: Continue transforming these Texel computations into SIMD
21:28handmade_render_group.cpp: Continue transforming these Texel computations into SIMD
21:28handmade_render_group.cpp: Continue transforming these Texel computations into SIMD
29:21Run and note that we're running just fine [575cy/h]
29:21Run and note that we're running just fine [575cy/h]
29:21Run and note that we're running just fine [575cy/h]
29:46handmade_render_group.cpp: Continue making these wide
29:46handmade_render_group.cpp: Continue making these wide
29:46handmade_render_group.cpp: Continue making these wide
37:14Compile and see if we made any mistakes [557cy/h]
37:14Compile and see if we made any mistakes [557cy/h]
37:14Compile and see if we made any mistakes [557cy/h]
37:31handmade_render_group.cpp: Do the rest of this wide, except for the Clamp
37:31handmade_render_group.cpp: Do the rest of this wide, except for the Clamp
37:31handmade_render_group.cpp: Do the rest of this wide, except for the Clamp
40:39Intel Intrinsics Guide: _mm_sqrt_ps1
40:39Intel Intrinsics Guide: _mm_sqrt_ps1
40:39Intel Intrinsics Guide: _mm_sqrt_ps1
41:11handmade_render_group.cpp: Do _mm_sqrt_ps and continue converting to SIMD
41:11handmade_render_group.cpp: Do _mm_sqrt_ps and continue converting to SIMD
41:11handmade_render_group.cpp: Do _mm_sqrt_ps and continue converting to SIMD
43:39Run and note that we are blitting correctly [427cy/h]
43:39Run and note that we are blitting correctly [427cy/h]
43:39Run and note that we are blitting correctly [427cy/h]
43:54Debugger: Look at what Clamp01 does
43:54Debugger: Look at what Clamp01 does
43:54Debugger: Look at what Clamp01 does
47:17Intel Intrinsics Guide: _mm_min_ps and _mm_max_ps2
47:17Intel Intrinsics Guide: _mm_min_ps and _mm_max_ps2
47:17Intel Intrinsics Guide: _mm_min_ps and _mm_max_ps2
48:45handmade_render_group.cpp: Do the Clamps wide [179cy/h]
48:45handmade_render_group.cpp: Do the Clamps wide [179cy/h]
48:45handmade_render_group.cpp: Do the Clamps wide [179cy/h]
50:02Run and note that the game is already running faster
50:02Run and note that the game is already running faster
50:02Run and note that the game is already running faster
50:47Reflect on the straightforwardness of this work
50:47Reflect on the straightforwardness of this work
50:47Reflect on the straightforwardness of this work
51:54Consider what's left to convert to SIMD
51:54Consider what's left to convert to SIMD
51:54Consider what's left to convert to SIMD
52:46handmade_render_group.cpp: Do PixelP wide
52:46handmade_render_group.cpp: Do PixelP wide
52:46handmade_render_group.cpp: Do PixelP wide
54:16Run and note how fast it's running [124cy/h]
54:16Run and note how fast it's running [124cy/h]
54:16Run and note how fast it's running [124cy/h]
56:18Debugger: Investigate what the compiler is doing with those 50 cycles
56:18Debugger: Investigate what the compiler is doing with those 50 cycles
56:18Debugger: Investigate what the compiler is doing with those 50 cycles
1:02:54handmade_render_group.cpp: Finish doing the SIMD here
1:02:54handmade_render_group.cpp: Finish doing the SIMD here
1:02:54handmade_render_group.cpp: Finish doing the SIMD here
1:07:32Run and note that we're creeping forwards [121cy/h]
1:07:32Run and note that we're creeping forwards [121cy/h]
1:07:32Run and note that we're creeping forwards [121cy/h]
1:08:06Recap and glimpse into the future of doing the Loads and Repack in SIMD
1:08:06Recap and glimpse into the future of doing the Loads and Repack in SIMD
1:08:06Recap and glimpse into the future of doing the Loads and Repack in SIMD
1:11:08Q&A
🗩
1:11:08Q&A
🗩
1:11:08Q&A
🗩
1:11:32kknewkles How do you cover multiple CPU technologies intrinsic-wise? Preprocessor switches on dedicated intrinsics for each? Also, whom to read on ASM? I'm thinking Mike Abrash?
🗪
1:11:32kknewkles How do you cover multiple CPU technologies intrinsic-wise? Preprocessor switches on dedicated intrinsics for each? Also, whom to read on ASM? I'm thinking Mike Abrash?
🗪
1:11:32kknewkles How do you cover multiple CPU technologies intrinsic-wise? Preprocessor switches on dedicated intrinsics for each? Also, whom to read on ASM? I'm thinking Mike Abrash?
🗪
1:13:09houb_ We have come from 385 cycles to 123. Does something like the 80%-20% rule apply? Do you think we will get down to 50 cycles?
🗪
1:13:09houb_ We have come from 385 cycles to 123. Does something like the 80%-20% rule apply? Do you think we will get down to 50 cycles?
🗪
1:13:09houb_ We have come from 385 cycles to 123. Does something like the 80%-20% rule apply? Do you think we will get down to 50 cycles?
🗪
1:15:22maexono The way we use mmSquare, does it calculate the argument twice?
🗪
1:15:22maexono The way we use mmSquare, does it calculate the argument twice?
🗪
1:15:22maexono The way we use mmSquare, does it calculate the argument twice?
🗪
1:15:41Debugger: Determine if the compiler is doing common subexpression elimination for these multiplies
1:15:41Debugger: Determine if the compiler is doing common subexpression elimination for these multiplies
1:15:41Debugger: Determine if the compiler is doing common subexpression elimination for these multiplies
1:21:11Deep, concentrated investigationα
1:21:11Deep, concentrated investigationα
1:21:11Deep, concentrated investigationα
1:25:54Look at how fast the game's running
1:25:54Look at how fast the game's running
1:25:54Look at how fast the game's running
1:26:19cvaucher Where do OpenCL and other GPGPU frameworks fit into optimization? It seems like if something is SIMD-able, it could just be done wider on a GPU. Are there workloads that are better suited to the CPU and SIMD?
🗪
1:26:19cvaucher Where do OpenCL and other GPGPU frameworks fit into optimization? It seems like if something is SIMD-able, it could just be done wider on a GPU. Are there workloads that are better suited to the CPU and SIMD?
🗪
1:26:19cvaucher Where do OpenCL and other GPGPU frameworks fit into optimization? It seems like if something is SIMD-able, it could just be done wider on a GPU. Are there workloads that are better suited to the CPU and SIMD?
🗪
1:29:06garlandobloom We have optimizations still on?
🗪
1:29:06garlandobloom We have optimizations still on?
🗪
1:29:06garlandobloom We have optimizations still on?
🗪
1:29:19gasto5 Why are there optimizing options in the compiler if one will end up typing SIMD functions?
🗪
1:29:19gasto5 Why are there optimizing options in the compiler if one will end up typing SIMD functions?
🗪
1:29:19gasto5 Why are there optimizing options in the compiler if one will end up typing SIMD functions?
🗪
1:31:01quylthulg Do you know of the _mm_setr_ps intrinsic (and _pd etc) - note the r in setr? It loads the values in reverse order, i.e. in the order that is more intuitive
🗪
1:31:01quylthulg Do you know of the _mm_setr_ps intrinsic (and _pd etc) - note the r in setr? It loads the values in reverse order, i.e. in the order that is more intuitive
🗪
1:31:01quylthulg Do you know of the _mm_setr_ps intrinsic (and _pd etc) - note the r in setr? It loads the values in reverse order, i.e. in the order that is more intuitive
🗪
1:31:38garlandobloom When do you think we will thread the renderer?
🗪
1:31:38garlandobloom When do you think we will thread the renderer?
🗪
1:31:38garlandobloom When do you think we will thread the renderer?
🗪
1:31:57goodoldmalk Possibly misguided question, is there a way to overload operators to use SIMD instructions instead?
🗪
1:31:57goodoldmalk Possibly misguided question, is there a way to overload operators to use SIMD instructions instead?
🗪
1:31:57goodoldmalk Possibly misguided question, is there a way to overload operators to use SIMD instructions instead?
🗪
1:32:45digitaldomovoi Is padding and alignment still something you have to concern yourself with? I remember doing SIMD in the mid 2000s, and SIMD was essentially worthless (much of the time) if your data wasn't aligned
🗪
1:32:45digitaldomovoi Is padding and alignment still something you have to concern yourself with? I remember doing SIMD in the mid 2000s, and SIMD was essentially worthless (much of the time) if your data wasn't aligned
🗪
1:32:45digitaldomovoi Is padding and alignment still something you have to concern yourself with? I remember doing SIMD in the mid 2000s, and SIMD was essentially worthless (much of the time) if your data wasn't aligned
🗪
1:33:43digitaldomovoi Addendum: By "concern yourself", I mean, is it something the compiler now handles more autonomously when you "engage" SIMD
🗪
1:33:43digitaldomovoi Addendum: By "concern yourself", I mean, is it something the compiler now handles more autonomously when you "engage" SIMD
🗪
1:33:43digitaldomovoi Addendum: By "concern yourself", I mean, is it something the compiler now handles more autonomously when you "engage" SIMD
🗪
1:34:15kil4h Will you generate asm for NEON (if you port to arm of course)? GCC seems to be pretty bad at generating correct code with intrinsics (from my experience on Android)
🗪
1:34:15kil4h Will you generate asm for NEON (if you port to arm of course)? GCC seems to be pretty bad at generating correct code with intrinsics (from my experience on Android)
🗪
1:34:15kil4h Will you generate asm for NEON (if you port to arm of course)? GCC seems to be pretty bad at generating correct code with intrinsics (from my experience on Android)
🗪
1:35:03culver_fly How would you know if doing something will speed up the code? Especially when it's a fairly large change to the codebase and when time is limited, I find myself reluctant to perform such optimizations in fear of introducing bugs
🗪
1:35:03culver_fly How would you know if doing something will speed up the code? Especially when it's a fairly large change to the codebase and when time is limited, I find myself reluctant to perform such optimizations in fear of introducing bugs
🗪
1:35:03culver_fly How would you know if doing something will speed up the code? Especially when it's a fairly large change to the codebase and when time is limited, I find myself reluctant to perform such optimizations in fear of introducing bugs
🗪
1:36:46miblo What do you think you'll next want to convert to SIMD, in case I want to practise over the weekend?
🗪
1:36:46miblo What do you think you'll next want to convert to SIMD, in case I want to practise over the weekend?
🗪
1:36:46miblo What do you think you'll next want to convert to SIMD, in case I want to practise over the weekend?
🗪
1:38:52flaturated Can you compile it -Od and show how SIMD has helped there?
🗪
1:38:52flaturated Can you compile it -Od and show how SIMD has helped there?
🗪
1:38:52flaturated Can you compile it -Od and show how SIMD has helped there?
🗪
1:39:32kknewkles Would it be a good exercise (albeit a large one) to study a simple CPU and write some soft for it? Arduino or something ancient? I wanted to learn coding for GBA for a while
🗪
1:39:32kknewkles Would it be a good exercise (albeit a large one) to study a simple CPU and write some soft for it? Arduino or something ancient? I wanted to learn coding for GBA for a while
🗪
1:39:32kknewkles Would it be a good exercise (albeit a large one) to study a simple CPU and write some soft for it? Arduino or something ancient? I wanted to learn coding for GBA for a while
🗪
1:41:04kknewkles Let's rephrase: what CPU would you advise to study that would be simple enough yet representative enough of the general stuff you should know about when working with CPUs?β
🗪
1:41:04kknewkles Let's rephrase: what CPU would you advise to study that would be simple enough yet representative enough of the general stuff you should know about when working with CPUs?β
🗪
1:41:04kknewkles Let's rephrase: what CPU would you advise to study that would be simple enough yet representative enough of the general stuff you should know about when working with CPUs?β
🗪
1:42:52theitchyninja How long have you been working on this and when do you think you will finish?
🗪
1:42:52theitchyninja How long have you been working on this and when do you think you will finish?
🗪
1:42:52theitchyninja How long have you been working on this and when do you think you will finish?
🗪
1:43:29gasto5 re you going to optimize gameplay code as well?
🗪
1:43:29gasto5 re you going to optimize gameplay code as well?
🗪
1:43:29gasto5 re you going to optimize gameplay code as well?
🗪
1:43:45houb_ Have you heard of the JayStation2 Project from Jaymin Kessler, working with the Raspberry Pi 2 B+?
🗪
1:43:45houb_ Have you heard of the JayStation2 Project from Jaymin Kessler, working with the Raspberry Pi 2 B+?
🗪
1:43:45houb_ Have you heard of the JayStation2 Project from Jaymin Kessler, working with the Raspberry Pi 2 B+?
🗪
1:44:03Close things down with a recap of the week's optimisation work
🗩
1:44:03Close things down with a recap of the week's optimisation work
🗩
1:44:03Close things down with a recap of the week's optimisation work
🗩
1:48:03Shout out to the mods
🗩
1:48:03Shout out to the mods
🗩
1:48:03Shout out to the mods
🗩