Horizontal adds in SIMD

twelvefifteen

#20978

April 25, 2019

Hey guys. After finishing the first SIMD section of HmH (~days 115-120) I started SIMDizing some functions in an image processing codebase of mine. While converting a function to get the average color of an image, I ran into a scenario which lead me to use what I now know is called a horizontal add. I'd like to make sure I have the right idea with this technique, and if it's even necessary at all in this case. Any help is appreciated.

Here's the code:

static f32
HorizontalAdd(__m128 PackedSingle)
{
    f32* PackedSinglePtr = (f32*)&PackedSingle;
    f32 Result = (PackedSinglePtr[0] +
                  PackedSinglePtr[1] +
                  PackedSinglePtr[2] +
                  PackedSinglePtr[3]);
    return(Result);
}

static v4
GetMeanColor(loaded_raster* Raster)
{
    __m128i MaskFF_4x = _mm_set1_epi32(0xFF);
    __m128 Inv255_4x = _mm_set1_ps(1.0f / 255.0f);
    
    __m128 Accumulator = _mm_set1_ps(0.0f);
    u32* SourceDest = (u32*)Raster->Address;
    for(s32 Y = 0;
        Y < Raster->Height;
        Y++)
    {
        for(s32 X = 0;
            X < Raster->Width;
            X += 4)
        {
            __m128i C = _mm_loadu_si128((__m128i*)SourceDest);
            
            __m128 Texelb = _mm_cvtepi32_ps(_mm_and_si128(C, MaskFF_4x));
            __m128 Texelg = _mm_cvtepi32_ps(_mm_and_si128(_mm_srli_epi32(C, 8), MaskFF_4x));
            __m128 Texelr = _mm_cvtepi32_ps(_mm_and_si128(_mm_srli_epi32(C, 16), MaskFF_4x));
            __m128 Texela = _mm_cvtepi32_ps(_mm_and_si128(_mm_srli_epi32(C, 24), MaskFF_4x));
            
            Texelb = _mm_mul_ps(Texelb, Inv255_4x);
            Texelg = _mm_mul_ps(Texelg, Inv255_4x);
            Texelr = _mm_mul_ps(Texelr, Inv255_4x);
            Texela = _mm_mul_ps(Texela, Inv255_4x);
            
            Accumulator = _mm_add_ps(Accumulator,
                                     _mm_set_ps(HorizontalAdd(Texela),
                                                HorizontalAdd(Texelb),
                                                HorizontalAdd(Texelg),
                                                HorizontalAdd(Texelr)));
            
            SourceDest += 4;
        }
    }

    __m128 InvPixelCount = _mm_set1_ps(1.0f / (Raster->Width*Raster->Height));
    Accumulator = _mm_mul_ps(Accumulator, InvPixelCount);
    
    v4 Result;
    _mm_storeu_ps((f32*)&Result, Accumulator);
    
    return(Result);
}

Edited by twelvefifteen on April 26, 2019, 3:13pm Reason: Initial post

ratchetfreak

#20979

April 25, 2019

You are halfway there,

You can have 4 accumulators, then you don't need to do a horizontal add until after the loop.

Accumulatora = _mm_add_ps(Accumulatora, Texela);
Accumulatorb = _mm_add_ps(Accumulatorb, Texelb);
Accumulatorr = _mm_add_ps(Accumulatorr, Texelr);
Accumulatorg = _mm_add_ps(Accumulatorg, Texelg);

Edited by ratchetfreak on April 25, 2019, 10:12pm

twelvefifteen

#20980

April 25, 2019

Thank you! That makes a lot of sense.

If you don't mind another question: I'm curious about how to restructure SIMD code like this to support images whose widths aren't multiples of four. My gut reaction would be to pad the ends of rows and ignore those extra bytes in the processing code, but I'm interested to see if there are more popular alternatives, perhaps ones that don't require modifying the source image.

ratchetfreak

#20981

April 26, 2019

You mean dealing with the stragglers?

Most often I see a scalar cleanup. Though you can read over the end (and into the next row) and then mask off the values so the extra memory read doesn't affect the result. This requires that you overallocate a little bit for the last row but doesn't require any padding for the other rows.

twelvefifteen

#20983

April 26, 2019

Yup, that is what I meant. Scalar cleanup is def what I was looking for. I appreciate the help!