Approach for faster software sRGB gamma computation

Regarding the latest episode, where Casey thought about calling pow(x, 2.2) for gamma correction, I decided to look for a fast power series approximation. The naive approach of expanding x^a around 1 led to the problem that the result doesn't map 0 to 0. A better way of doing it seems to be writing x^2.2 as x^2 * x^0.2 (5th root of x) and only expand the second term around 1 and multiply by x^2.

The approximation then takes the form: x^2 * (1 + (x - 1) / 5 * (1 - 2 * (x - 1) / 5))
(I wish we had LaTeX for that)

Note: It was of course assumed that the range of the input values is [0,1].

The implementation looks like the following:

// NOTE: calculate pow(Value, 2.2) using a power series approach
f32 ApplyGamma(f32 Value)
{
    f32 Result;
    f32 CommonTerm = 0.2f * (Value - 1.f);
    f32 ValueSqr = Value * Value;
    Result = ValueSqr * (1.f + CommonTerm * (1.f - 2.f * CommonTerm));

    return Result;
}

Generated code (/O2 /Zi, x86):

f32 ApplyGamma(f32 Value)
{
f32 Result;
f32 CommonTerm = 0.2f * (Value - 1.f);
012A1250 fld dword ptr [esp+4]
012A1254 fld st(0)
012A1256 fld1
012A1258 fsub st(1),st
012A125A fxch st(1)
012A125C fmul qword ptr [__real@3fc99999a0000000 (12BFD10h)]
012A1262 fstp dword ptr [esp+4]
f32 ValueSqr = Value * Value;
Result = ValueSqr * (1.f + CommonTerm * (1.f - 2.f * CommonTerm));

return Result;
012A1266 fld dword ptr [esp+4]
012A126A fld st(0)
012A126C fadd st,st(1)
012A126E fsubr st,st(2)
012A1270 fmulp st(1),st
012A1272 faddp st(1),st
012A1274 fld st(1)
012A1276 fmulp st(2),st
012A1278 fxch st(1)
012A127A fstp dword ptr [esp+4]
012A127E fmul dword ptr [esp+4]
012A1282 fstp dword ptr [esp+4]
012A1286 fld dword ptr [esp+4]
}
012A128A ret

I compared this implementation to a call to the CRT implementation (VS2010) of pow. The results on my Intel Core i7 950 @3GHz:

Z:\Tests\FastGamma>gamma_fast
256 values x 1048576 samples = 268435456 calls.
avg cycles per call: 15
avg time per call: 0.005 us

Z:\Tests\FastGamma>gamma_pow
256 values x 1048576 samples = 268435456 calls.
avg cycles per call: 156
avg time per call: 0.051 us

As you can see, the special implementation is about 10x faster.

As for how accurate this approximation actually is, you can have a look at this graph.

This graph shows that, in the [0,255] range, the maximum difference between actual and approximate values is about 0.8, so I think it's fairly accurate. If it's not enough one could always add the next term of the series (at the expense of 2 muls and an add).

Of course, you'd have to convert from sRGB to linear when you're done, but I've not looked into that yet. If you have any suggestions for improvement or other ideas, please let me know!

EDIT: x64 test data
For x64:

Z:\Tests\FastGamma>gamma_fast
256 values x 1048576 samples = 268435456 calls.
avg cycles per call: 9
avg time per call: 0.003 us

Z:\Tests\FastGamma>gamma_pow
256 values x 1048576 samples = 268435456 calls.
avg cycles per call: 97
avg time per call: 0.032 us

Assembly:

f32 ApplyGamma(f32 Value)
{
f32 Result;
f32 CommonTerm = 0.2f * (Value - 1.f);
000000013F4912C0 movss xmm3,dword ptr [__real@3f800000 (13F4AED78h)]
000000013F4912C8 movaps xmm2,xmm0
000000013F4912CB movaps xmm4,xmm0
000000013F4912CE subss xmm2,xmm3
f32 ValueSqr = Value * Value;
Result = ValueSqr * (1.f + CommonTerm * (1.f - 2.f * CommonTerm));
000000013F4912D2 movaps xmm0,xmm3
000000013F4912D5 mulss xmm4,xmm4
000000013F4912D9 mulss xmm2,dword ptr [__real@3e4ccccd (13F4AED74h)]
000000013F4912E1 movaps xmm1,xmm2
000000013F4912E4 mulss xmm1,dword ptr [__real@40000000 (13F4AED70h)]
000000013F4912EC subss xmm0,xmm1
000000013F4912F0 mulss xmm0,xmm2
000000013F4912F4 addss xmm0,xmm3
000000013F4912F8 mulss xmm0,xmm4

return Result;
}
000000013F4912FC ret