"Optimizing" Day 118 with AVX2 gather instruction

As Casey mentioned on stream for newer instruction set AVX2 which is available starting with Intel Haswell CPU's there are gather instructions that can fetch memory from multiple locations with single instruction and store all results in one SSE/AVX register. This is exactly what this loop does (which is last one operating with 4 iterations):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
for(int I = 0;
    I < 4;
    ++I)
{
	int32 FetchX = Mi(FetchX_4x, I);
	int32 FetchY = Mi(FetchY_4x, I);

	Assert((FetchX >= 0) && (FetchX < Texture->Width));
	Assert((FetchY >= 0) && (FetchY < Texture->Height));

	uint8 *TexelPtr = ((uint8 *)Texture->Memory) + FetchY*Texture->Pitch + FetchX*sizeof(uint32);
	Mi(SampleA, I) = *(uint32 *)(TexelPtr);
	Mi(SampleB, I) = *(uint32 *)(TexelPtr + sizeof(uint32));
	Mi(SampleC, I) = *(uint32 *)(TexelPtr + Texture->Pitch);
	Mi(SampleD, I) = *(uint32 *)(TexelPtr + Texture->Pitch + sizeof(uint32));
}

You can replace this loop with following code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__m128i IndexA = _mm_add_epi32(_mm_mullo_epi32(FetchY_4x, Pitch_4x),
                               _mm_slli_epi32(FetchX_4x, 2));
__m128i IndexB = _mm_add_epi32(IndexA, Four_4x);
__m128i IndexC = _mm_add_epi32(IndexA, Pitch_4x);
__m128i IndexD = _mm_add_epi32(IndexC, Four_4x);

SampleA = _mm_i32gather_epi32((int*)Texture->Memory, IndexA, 1);
SampleB = _mm_i32gather_epi32((int*)Texture->Memory, IndexB, 1);
SampleC = _mm_i32gather_epi32((int*)Texture->Memory, IndexC, 1);
SampleD = _mm_i32gather_epi32((int*)Texture->Memory, IndexD, 1);

Then there will be no inner for loop at all with 4 iterations.
This code uses AVX2 intrinsic _mm_i32gather_epi32 and SSE4.1 intrinsic _mm_mullo_epi32. SSE4.1 could be removed if we would made Texture Pitch always a power of two, then Instead for (FetchY_4x * Pitch) you could do (FetchY_4x << bits) and it is possible to shift with SSE2 intrinsic _mm_slli_epi32.

Four_4x and Pitch_4x can be initialized at the beginning of function where all other constants are initialized:
1
2
__m128i Pitch_4x = _mm_set1_epi32(Texture->Pitch);
__m128i Four_4x = _mm_set1_epi32(4);


Unfortunately this new code runs much slower. For me on i7-4790K old loop code runs ~42cy/h. AVX2 gather code runs ~190cy/h.

This well known problem. Intel themselves said somewhere they are only introducing new gather instructions in AVX2 CPUs so people can start using them. They (Intel) will optimize this instruction in newer CPUs to be faster - and software that expects this instruction to be available will run faster. So we need to wait for Broadwell or Skylake, maybe it will change there.

Edited by Mārtiņš Možeiko on
It is AVX512 that has the stuff you actually want, I believe. But don't quote me on that :) AVX2 is definitely not a useful gather, as Won Chun mentioned on the Twitter discussion...

- Casey
I don't see how AVX-512 will help here.
Sure it provides _mm512_i32gather_epi32 intrinsic that pretty much does the same thing but to 16 32-bit elements, not just 4 (SSE register) or 8 (AVX register). Even the arguments are the same. But the same question remains - how efficient Intel will implement that. They might as well improve AVX2 gather to be more efficient at the same time.

Edited by Mārtiņš Možeiko on