As Casey mentioned on stream for newer instruction set AVX2 which is available starting with Intel Haswell CPU's there are gather instructions that can fetch memory from multiple locations with single instruction and store all results in one SSE/AVX register. This is exactly what this loop does (which is last one operating with 4 iterations):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 | for(int I = 0;
I < 4;
++I)
{
int32 FetchX = Mi(FetchX_4x, I);
int32 FetchY = Mi(FetchY_4x, I);
Assert((FetchX >= 0) && (FetchX < Texture->Width));
Assert((FetchY >= 0) && (FetchY < Texture->Height));
uint8 *TexelPtr = ((uint8 *)Texture->Memory) + FetchY*Texture->Pitch + FetchX*sizeof(uint32);
Mi(SampleA, I) = *(uint32 *)(TexelPtr);
Mi(SampleB, I) = *(uint32 *)(TexelPtr + sizeof(uint32));
Mi(SampleC, I) = *(uint32 *)(TexelPtr + Texture->Pitch);
Mi(SampleD, I) = *(uint32 *)(TexelPtr + Texture->Pitch + sizeof(uint32));
}
|
You can replace this loop with following code:
| __m128i IndexA = _mm_add_epi32(_mm_mullo_epi32(FetchY_4x, Pitch_4x),
_mm_slli_epi32(FetchX_4x, 2));
__m128i IndexB = _mm_add_epi32(IndexA, Four_4x);
__m128i IndexC = _mm_add_epi32(IndexA, Pitch_4x);
__m128i IndexD = _mm_add_epi32(IndexC, Four_4x);
SampleA = _mm_i32gather_epi32((int*)Texture->Memory, IndexA, 1);
SampleB = _mm_i32gather_epi32((int*)Texture->Memory, IndexB, 1);
SampleC = _mm_i32gather_epi32((int*)Texture->Memory, IndexC, 1);
SampleD = _mm_i32gather_epi32((int*)Texture->Memory, IndexD, 1);
|
Then there will be no inner for loop at all with 4 iterations.
This code uses AVX2 intrinsic _mm_i32gather_epi32 and SSE4.1 intrinsic _mm_mullo_epi32. SSE4.1 could be removed if we would made Texture Pitch always a power of two, then Instead for (FetchY_4x * Pitch) you could do (FetchY_4x << bits) and it is possible to shift with SSE2 intrinsic _mm_slli_epi32.
Four_4x and Pitch_4x can be initialized at the beginning of function where all other constants are initialized:
| __m128i Pitch_4x = _mm_set1_epi32(Texture->Pitch);
__m128i Four_4x = _mm_set1_epi32(4);
|
Unfortunately this new code runs much slower. For me on i7-4790K old loop code runs
~42cy/h. AVX2 gather code runs
~190cy/h.
This
well known problem. Intel themselves said somewhere they are only introducing new gather instructions in AVX2 CPUs so people can start using them. They (Intel) will optimize this instruction in newer CPUs to be faster - and software that expects this instruction to be available will run faster. So we need to wait for Broadwell or Skylake, maybe it will change there.