RDTSC can be executed out of order

RDTSC instruction can be executed out-of-order. This means that CPU can execute some of instructions that are after RDTSC instruction before where it is placed in code. Or other way around - execute RDTSC instruction before code which is placed before RDTSC instruction. So measurements can be inaccurate.

It is recommended to execute some instruction that makes CPU to synchronize pipeline.
Intel recommends to execute CPUID instruction before rdtsc: https://www-ssl.intel.com/content...enchmark-code-execution-paper.pdf
The solution is to call a serializing instruction before calling the RDTSC one. A serializing instruction is an instruction that forces the CPU to complete every
preceding instruction of the C code before continuing the program execution. By
doing so we guarantee that only the code that is under measurement will be
executed in between the RDTSC calls and that no part of that code will be
executed outside the calls.

The complete list of available serializing instructions on IA64 and IA32 can be
found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3A [4]. Reading this manual, we find that “CPUID can be executed at any
privilege level to serialize instruction execution with no effect on program flow,
except that the EAX, EBX, ECX and EDX registers are modified”. Accordingly, the natural choice to avoid out of order execution would be to call CPUID just before
both RTDSC calls

So anywhere where __rdtsc() is should be used like this:
1
2
3
int Unused[4];
__cpuid(Unused, 0);
Clock = __rdtsc();


Alternatively switch to RTDSCP instruction which is serializing instruction.
I don't do this because, in the old days, it significantly affected the performance by doing a cpuid, so yes, the measurement was "more accurate" but it was more accurately measuring _the wrong performance_. I don't know if nowadays it is better.

- Casey
I was wondering about rdtsc beeing special in the sense that it created a sort of ordering bound for compiler and cpu, but I forgot to ask/search for it.

I was wondering that that could have been part of the big measured performance boost when we unwrapped the functions during the initial phase of DrawRectangle, like if the compiler was "scared" of actually looking into the functions and just let them execute before doing rdtsc, when after the unwrapping he could reorder it without worries and call the intrinsic earlier.
Unwrapping did _actually_ increase performances, but maybe part of it was bogus*?

(*It's the first time I use the word bogus, did I catch its meaning right? XD)
Right, adding __cpuid makes it significantly slower - more than 4 times.

Here's benchmark on my machine (Haswell i7-4790K):
1
2
3
4
__rdtsc = 1.56 sec
__rdtscp = 2.05 sec
__cpuid + __rdtsc = 7.06 sec
QueryPerformanceCounter = 2.01 sec


Source:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <windows.h>
#include <stdio.h>
#include <stdint.h>
#include <intrin.h>

#define OP9(x) x; x;
#define OP8(x) OP9(x) OP9(x)
#define OP7(x) OP8(x) OP8(x)
#define OP6(x) OP7(x) OP7(x)
#define OP5(x) OP6(x) OP6(x)
#define OP4(x) OP5(x) OP5(x)
#define OP3(x) OP4(x) OP4(x)
#define OP2(x) OP3(x) OP3(x)
#define OP1(x) OP2(x) OP2(x)
#define OP(x) OP1(x) OP1(x)

#define COUNT (1024 * 256)

int main()
{
    LARGE_INTEGER f, c1, c2;
    QueryPerformanceFrequency(&f);

    {
        uint64_t total = 0;
        QueryPerformanceCounter(&c1);
        for (int i=0; i<COUNT; i++)
        {
            OP(total += __rdtsc())
        }
        QueryPerformanceCounter(&c2);
        volatile uint64_t temp;
        temp = total;

        printf("__rdtsc = %.2f sec\n", (double)(c2.QuadPart - c1.QuadPart) / f.QuadPart);
    }

    {
        uint64_t total = 0;
        QueryPerformanceCounter(&c1);
        for (int i=0; i<COUNT; i++)
        {
            unsigned int x;
            OP(total += __rdtscp(&x))
        }
        QueryPerformanceCounter(&c2);
        volatile uint64_t temp;
        temp = total;

        printf("__rdtscp = %.2f sec\n", (double)(c2.QuadPart - c1.QuadPart) / f.QuadPart);
    }

    {
        uint64_t total = 0;
        QueryPerformanceCounter(&c1);
        for (int i=0; i<COUNT; i++)
        {
            int arr[4];
            OP(__cpuid(arr, 0); total += __rdtsc())
        }
        QueryPerformanceCounter(&c2);
        volatile uint64_t temp;
        temp = total;

        printf("__cpuid + __rdtsc = %.2f sec\n", (double)(c2.QuadPart - c1.QuadPart) / f.QuadPart);
    }

    {
        uint64_t total = 0;
        QueryPerformanceCounter(&c1);
        for (int i=0; i<COUNT; i++)
        {
            LARGE_INTEGER c;
            OP(QueryPerformanceCounter(&c); total += c.QuadPart)
        }
        QueryPerformanceCounter(&c2);
        volatile uint64_t temp;
        temp = total;

        printf("QueryPerformanceCounter = %.2f sec\n", (double)(c2.QuadPart - c1.QuadPart) / f.QuadPart);
    }
}

Edited by Mārtiņš Možeiko on
That was my assumption. So I do not think we want to do anything like that in our actual code. We are not trying to get 100% accurate cycle counts (because what would that even mean on today's processors anyway??), we are trying to get a good idea of the distribution of the cycles among our various codepaths with the minimal impact on the performance while doing so!

- Casey