Wide Unpacking and Masking ⚠ Click here to regain focus ⚠
?
?

Keyboard Navigation

Global Keys

W, A, P / S, D, N Jump to previous / next marker
t / T Toggle theatre / SUPERtheatre mode
z Toggle filter mode V Revert filter to original state

Menu toggling

q Quotes r References f Filter c Credits

Movement

a
w
s
d
h j k l


Quotes and References Menus

Enter Jump to timecode

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Credits Menu

Enter Open URL (in new tab)
0:00:25Overview of optimization work
0:00:25Overview of optimization work
0:00:25Overview of optimization work
0:01:30Recap where we were yesterday
0:01:30Recap where we were yesterday
0:01:30Recap where we were yesterday
0:01:50Current issue: Black bars
0:01:50Current issue: Black bars
0:01:50Current issue: Black bars
0:03:20Blackboard: Writing correct values to destination
0:03:20Blackboard: Writing correct values to destination
0:03:20Blackboard: Writing correct values to destination
0:05:35It's ok to do all operations for all pixels
0:05:35It's ok to do all operations for all pixels
0:05:35It's ok to do all operations for all pixels
0:06:52Blackboard: Another option: Combine old/new values
0:06:52Blackboard: Another option: Combine old/new values
0:06:52Blackboard: Another option: Combine old/new values
0:08:14Blackboard: Build a mask
0:08:14Blackboard: Build a mask
0:08:14Blackboard: Build a mask
0:09:00Masking out the invalid new values
0:09:00Masking out the invalid new values
0:09:00Masking out the invalid new values
0:10:50Making sure we save the original destination
0:10:50Making sure we save the original destination
0:10:50Making sure we save the original destination
0:11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently
0:11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently
0:11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently
0:12:55Problem with WriteMask: Haven't computed it yet!
0:12:55Problem with WriteMask: Haven't computed it yet!
0:12:55Problem with WriteMask: Haven't computed it yet!
0:14:00Use cheesy set macros to set WriteMask
0:14:00Use cheesy set macros to set WriteMask
0:14:00Use cheesy set macros to set WriteMask
0:14:16Handmade Hero: A Bit Garish edition
0:14:16Handmade Hero: A Bit Garish edition
0:14:16Handmade Hero: A Bit Garish edition
0:15:20Fixing the 'problem': Mi macro for uint setting
0:15:20Fixing the 'problem': Mi macro for uint setting
0:15:20Fixing the 'problem': Mi macro for uint setting
0:16:00Another thing: Fabian's rounding mode comment
0:16:00Another thing: Fabian's rounding mode comment
0:16:00Another thing: Fabian's rounding mode comment
0:16:57Some work to do with the last for(I) loop
0:16:57Some work to do with the last for(I) loop
0:16:57Some work to do with the last for(I) loop
0:19:34The explicit version of unrolling the loop
0:19:34The explicit version of unrolling the loop
0:19:34The explicit version of unrolling the loop
0:22:00Checking we're still working: under 100 cycles now
0:22:00Checking we're still working: under 100 cycles now
0:22:00Checking we're still working: under 100 cycles now
0:23:10Doing the destination the same way
0:23:10Doing the destination the same way
0:23:10Doing the destination the same way
0:23:50Just saved more cycles moving things out
0:23:50Just saved more cycles moving things out
0:23:50Just saved more cycles moving things out
0:24:35Fixing the WriteMask nonsense
0:24:35Fixing the WriteMask nonsense
0:24:35Fixing the WriteMask nonsense
0:25:38SSE Comparison Operations
0:25:38SSE Comparison Operations
0:25:38SSE Comparison Operations
0:26:20Blackboard: Comparisons for wide operations
0:26:20Blackboard: Comparisons for wide operations
0:26:20Blackboard: Comparisons for wide operations
0:29:43Using comparisons to generate WriteMask directly
0:29:43Using comparisons to generate WriteMask directly
0:29:43Using comparisons to generate WriteMask directly
0:31:50Working WriteMask with wide operations
0:31:50Working WriteMask with wide operations
0:31:50Working WriteMask with wide operations
0:32:10Problem: can't get rid of if entirely...
0:32:10Problem: can't get rid of if entirely...
0:32:10Problem: can't get rid of if entirely...
0:32:40Solution: Clamp U and V
0:32:40Solution: Clamp U and V
0:32:40Solution: Clamp U and V
0:33:40Get rid of the if entirely!
0:33:40Get rid of the if entirely!
0:33:40Get rid of the if entirely!
0:33:54Handmade Hero: Uniformly Stretchy Edition
0:33:54Handmade Hero: Uniformly Stretchy Edition
0:33:54Handmade Hero: Uniformly Stretchy Edition
0:34:05Fixing the bug: U/V copypasta typo
0:34:05Fixing the bug: U/V copypasta typo
0:34:05Fixing the bug: U/V copypasta typo
0:35:05Doing the texel fetch wide as well
0:35:05Doing the texel fetch wide as well
0:35:05Doing the texel fetch wide as well
0:37:30Not optimizing yet, just translating to SIMD
0:37:30Not optimizing yet, just translating to SIMD
0:37:30Not optimizing yet, just translating to SIMD
0:39:45Adjusting the texture fetch to use the wide values
0:39:45Adjusting the texture fetch to use the wide values
0:39:45Adjusting the texture fetch to use the wide values
0:40:30Converting the fetch coord by truncating
0:40:30Converting the fetch coord by truncating
0:40:30Converting the fetch coord by truncating
0:42:00Getting fX and fY by subtraction
0:42:00Getting fX and fY by subtraction
0:42:00Getting fX and fY by subtraction
0:43:30All correct, under 70 cycles
0:43:30All correct, under 70 cycles
0:43:30All correct, under 70 cycles
0:44:10No longer need to initialize the Texel values
0:44:10No longer need to initialize the Texel values
0:44:10No longer need to initialize the Texel values
0:46:00Everything in SIMD now but texel loads
0:46:00Everything in SIMD now but texel loads
0:46:00Everything in SIMD now but texel loads
0:46:50Blackboard: Unpacking the color data
0:46:50Blackboard: Unpacking the color data
0:46:50Blackboard: Unpacking the color data
0:48:30Pulling out colors using masks and shifting
0:48:30Pulling out colors using masks and shifting
0:48:30Pulling out colors using masks and shifting
0:53:20Blackboard: The matrix of sample reads
0:53:20Blackboard: The matrix of sample reads
0:53:20Blackboard: The matrix of sample reads
0:55:00Packing the sample data into 4-wide registers
0:55:00Packing the sample data into 4-wide registers
0:55:00Packing the sample data into 4-wide registers
0:55:48Some crazy emacs macro kung-fu
0:55:48Some crazy emacs macro kung-fu
0:55:48Some crazy emacs macro kung-fu
0:56:50Doing the Texels the same way as Dest
0:56:50Doing the Texels the same way as Dest
0:56:50Doing the Texels the same way as Dest
0:58:05Working texel read, and...almost 50cy/pixel
0:58:05Working texel read, and...almost 50cy/pixel
0:58:05Working texel read, and...almost 50cy/pixel
0:59:25What if there's nothing in the mask?
0:59:25What if there's nothing in the mask?
0:59:25What if there's nothing in the mask?
1:01:19Q&A
1:01:19Q&A
1:01:19Q&A
1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?
🗪
1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?
🗪
1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?
🗪
1:03:03garlandobloom Are you pulling this code over into ground splats soon?
🗪
1:03:03garlandobloom Are you pulling this code over into ground splats soon?
🗪
1:03:03garlandobloom Are you pulling this code over into ground splats soon?
🗪
1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?
🗪
1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?
🗪
1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?
🗪
1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?
🗪
1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?
🗪
1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?
🗪
1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?
🗪
1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?
🗪
1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?
🗪
1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128
🗪
1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128
🗪
1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128
🗪
1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?
🗪
1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?
🗪
1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?
🗪
1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps
🗪
1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps
🗪
1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps
🗪
1:21:00grumpygiant Agner Fog says the throughput is 1
🗪
1:21:00grumpygiant Agner Fog says the throughput is 1
🗪
1:21:00grumpygiant Agner Fog says the throughput is 1
🗪
1:22:16mrstone56 [What is latency vs throughput?]
🗪
1:22:16mrstone56 [What is latency vs throughput?]
🗪
1:22:16mrstone56 [What is latency vs throughput?]
🗪
1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?
🗪
1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?
🗪
1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?
🗪
1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?
🗪
1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?
🗪
1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?
🗪
1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?
🗪
1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?
🗪
1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?
🗪
1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?
🗪
1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?
🗪
1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?
🗪
1:28:50ttbjm Is the normal map code going to be converted to SIMD?
🗪
1:28:50ttbjm Is the normal map code going to be converted to SIMD?
🗪
1:28:50ttbjm Is the normal map code going to be converted to SIMD?
🗪
1:29:27End of the stream
1:29:27End of the stream
1:29:27End of the stream

Wide Unpacking and Masking

Masking the write:

In SIMD, doing operations "4-wide" means that one wide (packed) operation operates on four pixels. So there's no difference between doing an operation on one pixel or two or three or four, except when it comes to reading and writing.

The way we can make sure we only write the pixels we're actually operating on meaningfully is by masking out the ones we aren't. Instead of doing a conditional check every loop, we want to build a mask that's filled with 1s in the places where we'll keep the pixels, and 0s in the places where we'll throw out the pixels. If we're operating on four pixels at once and we're hanging 2 off the edge, the mask might look like:

[0x00000000,0x00000000,0xFFFFFFFF,0xFFFFFFFF]

By doing a bitwise AND with the pixel data we generate, we can mask out the values that are invalid, since the zeroes in the mask will knock out any bits set in our data. Likewise, the 1s will ensure any values we want to keep will remain in place.

We still need to preserve the destination how it was, and the easiest way to do that is to remember what the destination looked like before, and use those values wherever we knocked out values in our data. So we generate an inverted mask that might look something like:

[0xFFFFFFFF,0xFFFFFFFF,0x00000000,0x00000000]

Using the same AND technique, we can grab out the destination values that should remain unchanged. Then, we can combine that with the set of valid pixel values we generated using the other mask using a bitwise OR. Since the places where the two sets of values overlap are set to 0s in one of them, the data will effectively just be copied from one onto the other with no interference.