Wide Unpacking and Masking

0:00:25Overview of optimization work

0:01:30Recap where we were yesterday

0:01:50Current issue: Black bars

0:03:20Blackboard: Writing correct values to destination

0:05:35It's ok to do all operations for all pixels

0:06:52Blackboard: Another option: Combine old/new values

0:08:14Blackboard: Build a mask

0:09:00Masking out the invalid new values

0:10:50Making sure we save the original destination

0:11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently

0:12:55Problem with WriteMask: Haven't computed it yet!

0:14:00Use cheesy set macros to set WriteMask

0:14:16Handmade Hero: A Bit Garish edition

0:15:20Fixing the 'problem': Mi macro for uint setting

0:16:00Another thing: Fabian's rounding mode comment

0:16:57Some work to do with the last for(I) loop

0:19:34The explicit version of unrolling the loop

0:22:00Checking we're still working: under 100 cycles now

0:23:10Doing the destination the same way

0:23:50Just saved more cycles moving things out

0:24:35Fixing the WriteMask nonsense

0:25:38SSE Comparison Operations

0:26:20Blackboard: Comparisons for wide operations

0:29:43Using comparisons to generate WriteMask directly

0:31:50Working WriteMask with wide operations

0:32:10Problem: can't get rid of if entirely...

0:32:40Solution: Clamp U and V

0:33:40Get rid of the if entirely!

0:33:54Handmade Hero: Uniformly Stretchy Edition

0:34:05Fixing the bug: U/V copypasta typo

0:35:05Doing the texel fetch wide as well

0:37:30Not optimizing yet, just translating to SIMD

0:39:45Adjusting the texture fetch to use the wide values

0:40:30Converting the fetch coord by truncating

0:42:00Getting fX and fY by subtraction

0:43:30All correct, under 70 cycles

0:44:10No longer need to initialize the Texel values

0:46:00Everything in SIMD now but texel loads

0:46:50Blackboard: Unpacking the color data

0:48:30Pulling out colors using masks and shifting

0:53:20Blackboard: The matrix of sample reads

0:55:00Packing the sample data into 4-wide registers

0:55:48Some crazy emacs macro kung-fu

0:56:50Doing the Texels the same way as Dest

0:58:05Working texel read, and...almost 50cy/pixel

0:59:25What if there's nothing in the mask?

1:01:19Q&A

🗩

1:01:19Q&A

🗩

1:01:19Q&A

🗩

1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?

🗪

1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?

🗪

1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?

🗪

1:03:03garlandobloom Are you pulling this code over into ground splats soon?

🗪

1:03:03garlandobloom Are you pulling this code over into ground splats soon?

🗪

1:03:03garlandobloom Are you pulling this code over into ground splats soon?

🗪

1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?

🗪

1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?

🗪

1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?

🗪

1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?

🗪

1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?

🗪

1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?

🗪

1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?

🗪

1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?

🗪

1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?

🗪

1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128

🗪

1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128

🗪

1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128

🗪

1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?

🗪

1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?

🗪

1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?

🗪

1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps

🗪

1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps

🗪

1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps

🗪

1:21:00grumpygiant Agner Fog says the throughput is 1

🗪

1:21:00grumpygiant Agner Fog says the throughput is 1

🗪

1:21:00grumpygiant Agner Fog says the throughput is 1

🗪

1:22:16mrstone56 [What is latency vs throughput?]

🗪

1:22:16mrstone56 [What is latency vs throughput?]

🗪

1:22:16mrstone56 [What is latency vs throughput?]

🗪

1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?

🗪

1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?

🗪

1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?

🗪

1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?

🗪

1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?

🗪

1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?

🗪

1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?

🗪

1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?

🗪

1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?

🗪

1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?

🗪

1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?

🗪

1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?

🗪

1:28:50ttbjm Is the normal map code going to be converted to SIMD?

🗪

1:28:50ttbjm Is the normal map code going to be converted to SIMD?

🗪

1:28:50ttbjm Is the normal map code going to be converted to SIMD?

🗪

1:29:27End of the stream

🗩

1:29:27End of the stream

🗩

1:29:27End of the stream

🗩

Handmade Hero

Keyboard Navigation

Global Keys

Menu toggling

In-Menu Movement

Quotes and References Menus

Quotes, References and Credits Menus

Filter Menu

Filter and Link Menus

Credits Menu