Optimisations question

jeringa

#3256

April 8, 2015

Would it pay to optimise the v2, v3 & v4 etc math to run with SSE?
Also there are several places where you convert from pixel to vector, which may also benefit from a tweak or two

This may buy you enough time without having to bring in the heavy compiler optimiser club

Or is not the tree to be barking up?

Mārtiņš Možeiko

#3257

April 8, 2015

If operations you are performing on vX are done in a loop, then yes it makes sense to optimize with SSE. But for simple stuff, if it just a one or two individual vX operations, then it doesn't make much sense.

Also you should check if compiler you are using already performs these optimizations. Because Casey said he'll target SSE as minimum requirement, you can turn this optimization setting on (x86_64/x64 already does by default) and see it it is optimized or not.

For example, let's write to "test.cpp" file simple function "test_function":

#define HANDMADE_INTERNAL 1
#include "handmade_platform.h"
#include "handmade_intrinsics.h"
#include "handmade_math.h"

v4 test_function(v4 Input)
{
    v4 Result;
    v4 SomeValue = { 1,2,3,4 };
    Result = Input + SomeValue;
    return Result;
}

Then I use GCC (4.9.2) to produce optimized x64 assembly:

1	g++ -c -S -O3 test.cpp

Then check the output file "test.s" for "test_function":

_Z13test_function2v4:
	subq	$24, %rsp
	movups	(%rdx), %xmm0
	movq	%rcx, %rax
	addps	.LC0(%rip), %xmm0
	movups	%xmm0, (%rcx)
	addq	$24, %rsp
	ret

As you can see, it is using one instruction (SSE vector add - addps) to add 4 floats. So converting individual vector operations to use SSE instructions won't help at all because compiler is already doing that.

But as I said above, converting more complex loops that perform specific operations like blitting or blending + sRGB stuff to SSE will definitely help because for compilers to automatically optimize that is not a very easy job, you as a developer can do better.

Edited by Mārtiņš Možeiko on April 8, 2015, 5:42am

Casey Muratori

#3259

April 8, 2015

Unfortunately v2/v3/v4 are actually not as optimize-able on SSE as one might want, because SSE is really set up to treat "like elements", meaning that you don't really want a SIMD vector of "R G B A", you tend to want a SIMD vector of "R R R R" and "G G G G" and "B B B B" and "A A A A", if that makes sense. So while Hadamard() is a great case where you could optimize things nicely (it's just one instruction in SIMD), Inner() is a awful (they've even added things post-SSE2 to try to make it less awful, but by its nature it is just not something that wide ops are happy about).

So while you can do optimization here, and it might help you, really want you end up doing (and you'll see us do this on the stream) is identifying the areas where you really care about performance and _restructuring_ the data to look more like "R R R R" style so you can really go 4 times faster on those paths.

So v2/v3/v4 tend to be more for the "cruft code", which is meant to be kind of varied and not handling tons of data, and then the places where you are handling the bulk of your data processing, you handle those specially.

- Casey