The trouble with IACA markers

Has anyone else wondered how to get the IACA begin/end markers placed just right?

I found that their placing and possible reordering can cause huge changes in IACA output - in some cases I saw a jump of 40 cycles in throughput from a rather trivial looking change to the DrawRectangleQuickly code. This seems to result from instruction reordering - the compiler can move instructions outside the markers, in which case they aren't accounted for in the analysis. And if the moved instructions would break dependency chains (loads etc), the throughput can shoot up even though the analysis has less instructions in it since the analysis now thinks there are more dependencies in the code than there actually are.

The IACA manual recommends the following:
1
2
3
4
5
6
while ( condition )
{
	IACA_START
	<loop body>
}
IACA_END

In HMH the end marker is inside the loop - it should be moved outside. However, while this seems to produce more consistent results, it's not perfect. Ideally you would have loop code with markers look something like this in assembly:
1
2
3
4
5
6
7
8
9
<loop preamble>
mov byte ptr gs:[06Fh],06Fh ; begin marker ( IACA_VC64_BEGIN )

loopBody:
<loop body>

dec ecx ; update loop counter
jnz loopBody ; branch
mov byte ptr gs:[0DEh],0DEh ; end marker ( IACA_VC64_END )

However, the compiler will still reorder stuff in the beginning of the loop and also now after the loop (try it and look at the disassembly).

Now there's one thing that sounds like it would work - _ReadWriteBarrier intrinsic. This intrinsic doesn't produce any code but "limits the compiler optimizations that can reorder memory accesses across the point of the call" (see https://msdn.microsoft.com/en-us/library/f20w0x5e.aspx). Now, for the top marker it actually works - use
1
2
IACA_VC64_START;
_ReadWriteBarrier();

at the top of the loop and the marker will actually end at the very beginning. However, no matter where I put the intrinsics, it doesn't seem to prevent reordering of the end marker relative to some code that gets executed right after the loop (!?!?!). MSDN says that the macros are deprecated, so maybe that's the cause, but doesn't really give any better alternatives (even the proposed c++ std::atomic (blargh) barriers seem to exhibit the same behaviour!).

So, does anyone have any good ideas on how to get the markers in the right places? I did devise a solution, but it is rather ugly and technical - I'd like to see if anyone else has a more elegant solution before going through the trouble of explaining mine (it involved writing your own for loop with gotos - I said it's ugly!)
Have you tried the mfence SSE2 intrinsic? According to the Intel Intrinsics Guide:

void _mm_mfence (void)

Perform a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior to this instruction. Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order.
That will guarantee only that CPU finished memory operations. It deals with CPU caches and is useful in multi-threading. It doesn't deal with C/C++ compiler rearranging some calculations around it.

If a/b/c are integer types then
1
2
3
a = a + b;
_mm_mfence();
a = a + c;

Can be easily changed to
1
2
a = a + b + c;
_mm_mfence();

or
1
2
_mm_mfence();
a = a + b + c;

or
1
2
3
a = a + c;
_mm_mfence();
a = a + b;

because nothing get's written into memory if a/b/c values are kept in register.

@owaenerberg: I don't think you will get accurate results IACA down to 1 instruction or cycle. Because once you rearrange your loop like you say, you get different measurements than for real release code where you won't mess up loop structure. You should simply assume that loop runs enough times that measuring preamble doesn't matter. If it executes only once, but you run loop thousand of times, it doesn't matter what runs that one time (as long as it is just few instructions).

Basically I think that compiler tries to extract out of loop any code that is invariant to loop. But if you force those values to actually be calculated inside loop in every iteration - you are measuring worse performance.

However, no matter where I put the intrinsics, it doesn't seem to prevent reordering of the end marker relative to some code that gets executed right after the loop

Can you show example of this? With and without reordering.

Edited by Mārtiņš Možeiko on
That will guarantee only that CPU finished memory operations. It deals with CPU caches and is useful in multi-threading. It doesn't deal with C/C++ compiler rearranging some calculations around it.

Oh, yeah, my bad. Thanks for pointing that out. I guess that's the same problem reported by owaenerberg in the first post.

For the particular example you wrote, I imagine we could force the writing of the intermediate value of a to memory with a clflush call (since the ordering of clflush and mfence is supposed to be guaranteed)...
1
2
3
4
a = a + b;
// _mm_clflush() call writing 'a' somewhere
_mm_mfence();
a = a + c;

but this is not a general solution and, in the context of profiling with IACA markers, those instructions would have to precede the IACA_END marker, altering the final instruction count.
@debiatan: Hmm, thanks for the suggestion, but no. _mm_mfence seems to have the same effects on reordering as _ReadWriteBarrier(). And it generates an extra instruction, which don't really want, we just want to get the produced assembly ordered properly.

@mmozeiko: Yes, gladly. Actually it might be best that I iterate a bit more on what I said - I don't think my point came very clearly across.

The IACA tool in throughput mode looks at the the code that falls between the markers and considers that the body of a loop. If I understand correctly, it just calculates how long a single iteration of the loop would take if the marked code was to repeat infinitely (sans memory stalls etc.). It also disregards all branches, even unconditional ones, so that everything that falls between the markers is taken into the throughput. Now, if you don't get the markers in the right places you can end up with results that are a bit bogus (some instructions are left out/added to the analysis that should not be) or a lot bogus (some dependency calculation doesn't work like it should). Actually, what we have with the HMH code at the moment is a lot bogus. The loop looks like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for(int XI = MinX;
    XI < MaxX;
    XI += 4)
{            
//defines (mmSquare etc)
    IACA_VC64_START;

    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);
    IACA_VC64_END;
}

Running the analyzer on the code as it is produces the following results:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - handmade.dll
Binary Format - 64Bit
Architecture  - NHM
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 120.70 Cycles       Throughput Bottleneck: InterIteration
...
Total Num Of Uops: 295

Now let's change the loop to be
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
for(int XI = MinX;
    XI < MaxX;
    XI += 4)
{            
//defines (mmSquare etc)
    IACA_VC64_START;
    _ReadWriteBarrier();

    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);
}
IACA_VC64_END;

Analyzer output:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - handmade.dll
Binary Format - 64Bit
Architecture  - NHM
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 99.95 Cycles       Throughput Bottleneck: Port1
...
Total Num Of Uops: 311

Whoah! What happened? The thing is, in the original version the compiler can move code from the loop body outside the markers. And while it reduces the number of instructions in the analysis, some of the analyzer's dependency calculations break due to it (basically the analyzer misses some load/move that would have broken a dependency chain). I think that every result we've had that had a bottleneck of InterIteration is erroneous - there are very little inter-iteration dependencies in the actual code.

I'll leave it as an exercise to the reader to verify that the top of the loop works as it should - basically the generated assembly from the version with
1
2
IACA_VC64_START;
_ReadWriteBarrier();

is identical to assembly generated without any markers, just with the begin marker appended to the top. However, the end marker is trickier. Looking at the disassembly near the end of the loop (from assembly listing generated with compiler option -Fa, it's a lot cleaner than runtime disassembly):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
	dec	r15
	jne	$LL3@DrawRectan ; this the branch at the end of the loop, we'd like to get the end marker positioned just after this
	movdqa	xmm3, XMMWORD PTR StartupClipMask$1$[rsp]
	movaps	xmm4, XMMWORD PTR tv1979[rbp-256]
	movaps	xmm5, XMMWORD PTR nXAxisy_4x$1$[rbp-256]
	mov	rsi, QWORD PTR tv1987[rsp]
	mov	r13, QWORD PTR tv1985[rsp]
	movaps	xmm6, XMMWORD PTR nYAxisy_4x$1$[rbp-256]
	movaps	xmm7, XMMWORD PTR Originy_4x$1$[rbp-256]
$LN1@DrawRectan:
; Line 576
	add	r11d, 2
; Line 790
	add	rbx, r13
	mov	BYTE PTR gs:222, 222			; 000000deH but it gets shunted over here

Hmm, so some code from after the loop gets between the marker and the loop? Lets try this then
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
for(int XI = MinX;
    XI < MaxX;
    XI += 4)
{            
//defines (mmSquare etc)
    IACA_VC64_START;
    _ReadWriteBarrier();

    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);
}
IACA_VC64_END;
_ReadWriteBarrier();

But even with that you get
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
	dec	r15
	jne	$LL3@DrawRectan
	movdqa	xmm3, XMMWORD PTR StartupClipMask$1$[rsp] ; what's this stuff doing here?
	movaps	xmm4, XMMWORD PTR tv1979[rbp-256]
	movaps	xmm5, XMMWORD PTR nXAxisy_4x$1$[rbp-256]
	mov	rsi, QWORD PTR tv1987[rsp]
	mov	r13, QWORD PTR tv1985[rsp]
	movaps	xmm6, XMMWORD PTR nYAxisy_4x$1$[rbp-256]
	movaps	xmm7, XMMWORD PTR Originy_4x$1$[rbp-256]
$LN1@DrawRectan:
; Line 788
	mov	BYTE PTR gs:222, 222			; 000000deH

Looking at it now, those seem to be local variables spilled on the stack, so I guess that explains why they jam themselves in there. It's still a bit annoying - I'd say that these extra instructions probably don't matter THAT much, but still, to get the perfect results from the analyzer, the marker should be right after the branch. As I said, I did find one way to circumvent this. It just has that nasty "outsmarting the compiler"-feel to it... maybe Casey would like to take look at this? Because if we actually rely on this tool it would be nice to get accurate results from it.

But um, this ended seriously wall-of-texty, so the important point to take home is: at it now stands the analyzer output can be not just a bit, but a lot bogus! So at least something (probably the changes outlined here) should be done.
If these instructions are really after loop branch, then you shouldn't care about them. Dependencies for them matters only for last pixel. And that is only one pixel. For all other pixels (which is where majority of time will be spent) dependency starts at beginning of loop - for next pixel. So I'm not sure why you are worried about these instructions influencing dependency chain for pixels.
If anything - you should unroll loop two times, so analyzer sees instruction stream how CPU sees: processing some pixel will have some dependencies mostly on instructions for next pixel calculation at beginning of loop.
But that's not how the analyzer works, if I understand correctly. See the IACA documentation, https://software.intel.com/sites/...Code_Analyzer_2.0_Users_Guide.pdf, especially 2.1 ("[throughput analysis] treats the contents of the analyzed block as an infinite loop, including considering inter-iteration dependencies between instructions within the analyzed block." and 3.1 ("[the code analyzer] treats the analyzed code section as a single consecutive block of instructions. It does not follow branch instructions, not even unconditional branches.") Basically if the output code looks like following
1
2
3
mov byte ptr gs:[6Fh],6Fh ; IACA_VC64_BEGIN
<block of code>
mov byte ptr gs:[DEh],DEh ; IACA_VC64_END

then throughput analysis will calculate the cycles (the reported throughput) it takes to execute a single <block of code>, disregarding branching, if the blocks were executed as following
1
2
3
4
5
6
7
<block of code>
<block of code>
<block of code>
<block of code>
<block of code>
<block of code>
repeated ad infinitum...

"under ideal front-end, out-of-order engine and memory hierarchy conditions". So if the markers include, ahem, "the whole loop and nothing but the loop", then the analyser will be correct (as in the correct theoretical maximum). If there's something left out / somethign extra in the code between the markers, then it will be left out of / included in every iteration the tool considers, and the results will be skewed (not only for the presumed last iteration, but every one).
Now to be frank, I don't think the unspilling of stack variables after the for loop affects the output that much. I saw he greatest variations when some dependency breaking instruction that's part of the loop proper was left outside the markers. That definitely can happen with the way the markers are set up now in HMH (see the examples in my last post) - so at least something should be done. In any case, it would be nice if someone who's well versed with the tool could weigh in - it seems kind of tricky to get the tool measure exactly what you want, and interpret the results correctly.
Oh, I see what you are talking about now. I didn't realize IACA works like that - by pretending that analyzed block is executed in a loop.

Then it makes sense to do what you want.
But I don't think there is easy way how to do that. Either you'll need to do a lot of trickery to achieve that (I understand you did), or just ignore those few mov instructions. I guess compiler thinks that reloading those values from stack into register (because beginning of Y loop expects them to be there) is better done at end of Y loop, not beginning for performance reasons.

Edited by Mārtiņš Možeiko on
Hmm, thinking about this some more, maybe there's no elegant solution to the problem. So I'll just leave my solution here, in case anyone is interested.

The latest version I presented was
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
for(int XI = MinX;
    XI < MaxX;
    XI += 4)
{            
//defines (mmSquare etc)
    IACA_VC64_START;
    _ReadWriteBarrier();

    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);
}
IACA_VC64_END;
_ReadWriteBarrier();

However that version still had the unspilling of some local variables at the end before the marker. So what can we do? Basically just implement the for loop ourselves... I'll just dump the code here, with comments.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// no loop construct - the braces are just for scoping
{
    // jump to the end if the loop would be run 0 times
    if ( MaxX <= MinX ) {
        goto pixelLoopEnd;
    }
    // this is an optimization Visual Studio does when the loop is
    // expressed as a for loop - however we need to do it by hand
    // turns the loop counter to count to zero in decrements of one, nice because
    // it saves one register (the loop end condition, now implicit zero)
    uint32 XI = ( ( ( uint32 )( MaxX - MinX ) - 1 ) >> 2 ) + 1;
pixelLoopStart:
    IACA_VC64_START;
    // this ensures that the start marker is right at the beginning of the loop
    _ReadWriteBarrier();

    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);

    // decrement loop index, jump to start if > 0
    --XI;
    if ( XI ) {
        goto loopstart;
    }

    IACA_VC64_END;
    _ReadWriteBarrier(); // still needed
// the closing brace - the unspilling happens here
}
loopend:

Looking at the assembly listing, we now get
beginning:
1
2
3
$loopstart$115:
; Line 601
	mov	BYTE PTR gs:111, 111			; 0000006fH

end:
1
2
3
4
	dec	r10d
	jne	$loopstart$115
; Line 989
	mov	BYTE PTR gs:222, 222			; 000000deH

Just as we wanted: the beginning marker is right at the beginning of the loop, and the end marker right after the conditional.

Lastly, I'll remark that there is one other way of solving this problem: you could just look at the assembly listing generated from code with no markers at all, take the exact portion of code you want to measure, and then jam it in an assembler and put the markers around it. That way, you get to measure the exactly as it is, without any interference on code generation from markers - but that's just unwieldy, especially if you want to rapidly test different versions.