@debiatan: Hmm, thanks for the suggestion, but no. _mm_mfence seems to have the same effects on reordering as _ReadWriteBarrier(). And it generates an extra instruction, which don't really want, we just want to get the produced assembly ordered properly.
@mmozeiko: Yes, gladly. Actually it might be best that I iterate a bit more on what I said - I don't think my point came very clearly across.
The IACA tool in throughput mode looks at the the code that falls between the markers and considers that the body of a loop. If I understand correctly, it just calculates how long a single iteration of the loop would take if the marked code was to repeat infinitely (sans memory stalls etc.). It also disregards all branches, even unconditional ones, so that everything that falls between the markers is taken into the throughput. Now, if you don't get the markers in the right places you can end up with results that are a bit bogus (some instructions are left out/added to the analysis that should not be) or a lot bogus (some dependency calculation doesn't work like it should). Actually, what we have with the HMH code at the moment is a lot bogus. The loop looks like this:
|  | for(int XI = MinX;
    XI < MaxX;
    XI += 4)
{            
//defines (mmSquare etc)
    IACA_VC64_START;
    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);
    IACA_VC64_END;
}
 | 
Running the analyzer on the code as it is produces the following results:
|  | Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - handmade.dll
Binary Format - 64Bit
Architecture  - NHM
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 120.70 Cycles       Throughput Bottleneck: InterIteration
...
Total Num Of Uops: 295
 | 
Now let's change the loop to be
|  | for(int XI = MinX;
    XI < MaxX;
    XI += 4)
{            
//defines (mmSquare etc)
    IACA_VC64_START;
    _ReadWriteBarrier();
    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);
}
IACA_VC64_END;
 | 
Analyzer output:
|  | Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - handmade.dll
Binary Format - 64Bit
Architecture  - NHM
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 99.95 Cycles       Throughput Bottleneck: Port1
...
Total Num Of Uops: 311
 | 
Whoah! What happened? The thing is, in the original version the compiler can move code from the loop body outside the markers. And while it reduces the number of instructions in the analysis, some of the analyzer's dependency calculations break due to it (basically the analyzer misses some load/move that would have broken a dependency chain). I think that every result we've had that had a bottleneck of InterIteration is erroneous - there are very little inter-iteration dependencies in the actual code.
I'll leave it as an exercise to the reader to verify that the top of the loop works as it should - basically the generated assembly from the version with
|  | IACA_VC64_START;
_ReadWriteBarrier();
 | 
is identical to assembly generated without any markers, just with the begin marker appended to the top. However, the end marker is trickier. Looking at the disassembly near the end of the loop (from assembly listing generated with compiler option -Fa, it's a lot cleaner than runtime disassembly):
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15 | 	dec	r15
	jne	$LL3@DrawRectan ; this the branch at the end of the loop, we'd like to get the end marker positioned just after this
	movdqa	xmm3, XMMWORD PTR StartupClipMask$1$[rsp]
	movaps	xmm4, XMMWORD PTR tv1979[rbp-256]
	movaps	xmm5, XMMWORD PTR nXAxisy_4x$1$[rbp-256]
	mov	rsi, QWORD PTR tv1987[rsp]
	mov	r13, QWORD PTR tv1985[rsp]
	movaps	xmm6, XMMWORD PTR nYAxisy_4x$1$[rbp-256]
	movaps	xmm7, XMMWORD PTR Originy_4x$1$[rbp-256]
$LN1@DrawRectan:
; Line 576
	add	r11d, 2
; Line 790
	add	rbx, r13
	mov	BYTE PTR gs:222, 222			; 000000deH but it gets shunted over here
 | 
Hmm, so some code from after the loop gets between the marker and the loop? Lets try this then
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12 | for(int XI = MinX;
    XI < MaxX;
    XI += 4)
{            
//defines (mmSquare etc)
    IACA_VC64_START;
    _ReadWriteBarrier();
    //loop body, all the way to ClipMask = _mm_set1_epi8(-1);
}
IACA_VC64_END;
_ReadWriteBarrier();
 | 
But even with that you get
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12 | 	dec	r15
	jne	$LL3@DrawRectan
	movdqa	xmm3, XMMWORD PTR StartupClipMask$1$[rsp] ; what's this stuff doing here?
	movaps	xmm4, XMMWORD PTR tv1979[rbp-256]
	movaps	xmm5, XMMWORD PTR nXAxisy_4x$1$[rbp-256]
	mov	rsi, QWORD PTR tv1987[rsp]
	mov	r13, QWORD PTR tv1985[rsp]
	movaps	xmm6, XMMWORD PTR nYAxisy_4x$1$[rbp-256]
	movaps	xmm7, XMMWORD PTR Originy_4x$1$[rbp-256]
$LN1@DrawRectan:
; Line 788
	mov	BYTE PTR gs:222, 222			; 000000deH
 | 
Looking at it now, those seem to be local variables spilled on the stack, so I guess that explains why they jam themselves in there. It's still a bit annoying - I'd say that these extra instructions probably don't matter THAT much, but still, to get the perfect results from the analyzer, the marker should be right after the branch. As I said, I did find one way to circumvent this. It just has that nasty "outsmarting the compiler"-feel to it... maybe Casey would like to take look at this? Because if we actually rely on this tool it would be nice to get accurate results from it.
But um, this ended seriously wall-of-texty, so the important point to take home is: at it now stands the analyzer output can be not just a bit, but a lot bogus! So at least something (probably the changes outlined here) should be done.