The approximation then takes the form: x^2 * (1 + (x - 1) / 5 * (1 - 2 * (x - 1) / 5))

(I wish we had LaTeX for that)

*Note: It was of course assumed that the range of the input values is [0,1].*

The implementation looks like the following:

1 2 3 4 5 6 7 8 9 10 | // NOTE: calculate pow(Value, 2.2) using a power series approach f32 ApplyGamma(f32 Value) { f32 Result; f32 CommonTerm = 0.2f * (Value - 1.f); f32 ValueSqr = Value * Value; Result = ValueSqr * (1.f + CommonTerm * (1.f - 2.f * CommonTerm)); return Result; } |

Generated code (/O2 /Zi, x86):

f32 ApplyGamma(f32 Value)

{

f32 Result;

f32 CommonTerm = 0.2f * (Value - 1.f);

012A1250 fld dword ptr [esp+4]

012A1254 fld st(0)

012A1256 fld1

012A1258 fsub st(1),st

012A125A fxch st(1)

012A125C fmul qword ptr [[email protected] (12BFD10h)]

012A1262 fstp dword ptr [esp+4]

f32 ValueSqr = Value * Value;

Result = ValueSqr * (1.f + CommonTerm * (1.f - 2.f * CommonTerm));

return Result;

012A1266 fld dword ptr [esp+4]

012A126A fld st(0)

012A126C fadd st,st(1)

012A126E fsubr st,st(2)

012A1270 fmulp st(1),st

012A1272 faddp st(1),st

012A1274 fld st(1)

012A1276 fmulp st(2),st

012A1278 fxch st(1)

012A127A fstp dword ptr [esp+4]

012A127E fmul dword ptr [esp+4]

012A1282 fstp dword ptr [esp+4]

012A1286 fld dword ptr [esp+4]

}

012A128A ret

I compared this implementation to a call to the CRT implementation (VS2010) of pow. The results on my Intel Core i7 950 @3GHz:

Z:\Tests\FastGamma>gamma_fast

256 values x 1048576 samples = 268435456 calls.

avg cycles per call: 15

avg time per call: 0.005 us

Z:\Tests\FastGamma>gamma_pow

256 values x 1048576 samples = 268435456 calls.

avg cycles per call: 156

avg time per call: 0.051 us

As you can see, the special implementation is about 10x faster.

As for how accurate this approximation actually is, you can have a look at this graph.

This graph shows that, in the [0,255] range, the maximum difference between actual and approximate values is about 0.8, so I think it's fairly accurate. If it's not enough one could always add the next term of the series (at the expense of 2 muls and an add).

Of course, you'd have to convert from sRGB to linear when you're done, but I've not looked into that yet. If you have any suggestions for improvement or other ideas, please let me know!

EDIT: x64 test data

For x64:

Z:\Tests\FastGamma>gamma_fast

256 values x 1048576 samples = 268435456 calls.

avg cycles per call: 9

avg time per call: 0.003 us

Z:\Tests\FastGamma>gamma_pow

256 values x 1048576 samples = 268435456 calls.

avg cycles per call: 97

avg time per call: 0.032 us

Assembly:

f32 ApplyGamma(f32 Value)

{

f32 Result;

f32 CommonTerm = 0.2f * (Value - 1.f);

000000013F4912C0 movss xmm3,dword ptr [[email protected] (13F4AED78h)]

000000013F4912C8 movaps xmm2,xmm0

000000013F4912CB movaps xmm4,xmm0

000000013F4912CE subss xmm2,xmm3

f32 ValueSqr = Value * Value;

Result = ValueSqr * (1.f + CommonTerm * (1.f - 2.f * CommonTerm));

000000013F4912D2 movaps xmm0,xmm3

000000013F4912D5 mulss xmm4,xmm4

000000013F4912D9 mulss xmm2,dword ptr [[email protected] (13F4AED74h)]

000000013F4912E1 movaps xmm1,xmm2

000000013F4912E4 mulss xmm1,dword ptr [[email protected] (13F4AED70h)]

000000013F4912EC subss xmm0,xmm1

000000013F4912F0 mulss xmm0,xmm2

000000013F4912F4 addss xmm0,xmm3

000000013F4912F8 mulss xmm0,xmm4

return Result;

}

000000013F4912FC ret