Day 127: Aligning OutputBuffer for SIMD, Why are we masking?

Hi there,

I am following handmade hero and I found something interesting on day 127.
Casey aligns the fill rectangle so that MinX and MaxX are both aligned on a four-pixel-boundary.

He sets the StartClipMask and EndClipMask accordingly to avoid changing the pixels outside the _original_ fill boundaries.
Here is the code that will align and set the masks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
        __m128i StartClipMask = _mm_set1_epi8(-1);
        __m128i EndClipMask = _mm_set1_epi8(-1);

        __m128i StartClipMasks[] =
        {
            _mm_slli_si128(StartClipMask, 0*4),
            _mm_slli_si128(StartClipMask, 1*4),
            _mm_slli_si128(StartClipMask, 2*4),
            _mm_slli_si128(StartClipMask, 3*4),            
        };

        __m128i EndClipMasks[] =
        {
            _mm_srli_si128(EndClipMask, 0*4),
            _mm_srli_si128(EndClipMask, 3*4),
            _mm_srli_si128(EndClipMask, 2*4),
            _mm_srli_si128(EndClipMask, 1*4),            
        };
        
        if(FillRect.MinX & 3)
        {
            StartClipMask = StartClipMasks[FillRect.MinX & 3];
            FillRect.MinX = FillRect.MinX & ~3;
        }

        if(FillRect.MaxX & 3)
        {
            EndClipMask = EndClipMasks[FillRect.MaxX & 3];
            FillRect.MaxX = (FillRect.MaxX & ~3) + 4;
        }

Here is the code that will do the masking:

1
2
3
4
5
6
7
8
                __m128 U = _mm_add_ps(_mm_mul_ps(PixelPx, nXAxisx_4x), PynX);
                __m128 V = _mm_add_ps(_mm_mul_ps(PixelPx, nYAxisx_4x), PynY);

                __m128i WriteMask = _mm_castps_si128(_mm_and_ps(_mm_and_ps(_mm_cmpge_ps(U, Zero),
                                                                           _mm_cmple_ps(U, One)),
                                                                _mm_and_ps(_mm_cmpge_ps(V, Zero),
                                                                           _mm_cmple_ps(V, One))));
                WriteMask = _mm_and_si128(WriteMask, ClipMask); // HERE

The question is why are we doing that?
We're already clipping to the texture boundaries using the U,V coordinates and masking accordingly. Right?!

Expanding the fill rectangle will cause the pixels to fall outside the texture and get masked automatically.
We have done that before. We used to iterate over the entire output buffer pixels and that just worked.

To eliminate the suspicions I modified the code slightly and added an assert:

1
2
3
4
5
6
7
8
9
                __m128 U = _mm_add_ps(_mm_mul_ps(PixelPx, nXAxisx_4x), PynX);
                __m128 V = _mm_add_ps(_mm_mul_ps(PixelPx, nYAxisx_4x), PynY);

                __m128i WriteMask = _mm_castps_si128(_mm_and_ps(_mm_and_ps(_mm_cmpge_ps(U, Zero),
                                                                           _mm_cmple_ps(U, One)),
                                                                _mm_and_ps(_mm_cmpge_ps(V, Zero),
                                                                           _mm_cmple_ps(V, One))));
                __m128i WriteMask2 = _mm_and_si128(WriteMask, ClipMask);
                Assert(memcmp(&WriteMask, &WriteMask2, sizeof(WriteMask)) == 0);

And everything is fine.
I checked later days source code to see if it has changed but it is still the same.
Am I missing something?
We are doing that because for loading/storing memory to/from simd register there are two options. When address is 16-byte aligned and when it is not 16-byte aligned. For aligned addresses you use _mm_load_si128 and _mm_store_si128 intrinsics. For unaligned addresses you use use _mm_loadu_si128 and _mm_storeu_si128.

And we want to prefer aligned addresses, reason for that is because load and store are faster than loadu and storeu. This changed only relatively recently with Intel Sandy Bridge processors where performance difference between load and loadu (same with store/storeu) is now very small.

Here you can see performance numbers from Core 2 CPU: http://stackoverflow.com/a/27659526 Unaligned loads are ~2x slower.

Edited by Mārtiņš Možeiko on
I am not asking why are we aligning. I'm asking why are we masking?
We don't need the additional masking. We are already making using rectangle inclusion test which will mask any pixel outside the texture.

We can just do this:
1
2
3
4
5
            FillRect.MinX = FillRect.MinX & ~3;

            // NOTE: that there is also a bug, this line should be:
            // FillRect.MaxX = ((FillRect.MaxX+3) & ~3)
            FillRect.MaxX = (FillRect.MaxX & ~3) + 4;

And then:
1
2
3
4
5
6
7
                __m128 U = _mm_add_ps(_mm_mul_ps(PixelPx, nXAxisx_4x), PynX);
                __m128 V = _mm_add_ps(_mm_mul_ps(PixelPx, nYAxisx_4x), PynY);

                __m128i WriteMask = _mm_castps_si128(_mm_and_ps(_mm_and_ps(_mm_cmpge_ps(U, Zero),
                                                                           _mm_cmple_ps(U, One)),
                                                                _mm_and_ps(_mm_cmpge_ps(V, Zero),
                                                                           _mm_cmple_ps(V, One))));


Edited by Ameen Sayegh on
The masking with ClipMask is exactly for aligning purposes. No other reason.
But I think you are right - the ClipMask is not needed because clipping UV to [0,1] should take care of it automatically.

Btw "FillRect.MaxX = (FillRect.MaxX & ~3) + 4" is not a bug because it is inside "if (FillRect.MaxX & 3)" condition - it won't be used if MaxX % 4 == 0. Of course it is easy to to change it you don't need to use if statement.