I've got a simple question, but I'm not able to figure it out in a reasonable amount of time.
At the end of this video Casey fixes the issue when the rendered texture's width % 4 != 0. I've rewatched it and tried it but how is this possibly working? :P
So, as far as I understand this is what's happening:
The mask is set to all 1s.
If the width & 3 > 0 -> this value is the amount of pixels we have to adjust for, so we increment the minX with this adjustment, and we shift the mask by this value to the left. Now if we go through the loop we always have 4 pixels to fill and so we don't miss any pixels at the right edge.
I get this so far, but by shifting to the left, instead of masking the first pixels, I would think this creates a discontinuity, like for example: 0xFFFF0000? (and the next pixels should all be filled) Moreover, how are the first pixels filled that are discarded by increasing the minX, aren't the wrong pixels masked in the first loop? I mean, obviously not, but why not?
Edit: The minX is not incremented, but decremented. My first mistake. Now it does make sense, but it's still not working for me though.
Nevermind, I think I understand it all, I should just go sleep and debug it tomorrow.
You're just making the same mistake I did at first :) It's the register order vs. memory order problem, as always. Remember in _register order_ it might look like 0xFFFF0000, but in _memory order_ that's 00 00 FF FF, which is the way it comes out on the screen, meaning the _first two bytes_ are masked, not the _last two_.
So since the startup code wants to mask the _initial_ 0-2 pixels, it's a shift left, because it's the _low order bytes_, which are the first ones in memory order, that we want to _skip_, which means the mask should be zero there.
Wow, that makes sense! I hadn't thought about the memory order and registry order. Thank you.
I keep discovering unknown unknowns every episode. Like with the "this is stupid comment" above the switch statement and _mm_slli_si128, I thought: "This _is_ stupid, I'm sure this is because it was nearly midnight." So I just do _mm_slli_si128(mask, adjustment*4); But, it doesn't compile, obviously. Rather a clever trick as opposed to what the comment suggests. :)
Yeah, the _mm_slli_si128 instruction is for shifting by an _immediate_, which means it has to be baked into the instruction stream as a known value, and cannot be variable. There are shifts that shift by another register, and we should probably be using those, but I was tired :P
You were not tired - you looked up if there are shift by variable amount for 128-bit register. There weren't - only immediate was available. Shifts by variable amount are available for less amount of bits - 2x64 or 4x32 types.