RDTSC instruction can be executed out-of-order. This means that CPU can execute some of instructions that are after RDTSC instruction before where it is placed in code. Or other way around - execute RDTSC instruction before code which is placed before RDTSC instruction. So measurements can be inaccurate.
It is recommended to execute some instruction that makes CPU to synchronize pipeline.
Intel recommends to execute CPUID instruction before rdtsc:
https://www-ssl.intel.com/content...enchmark-code-execution-paper.pdf
The solution is to call a serializing instruction before calling the RDTSC one. A serializing instruction is an instruction that forces the CPU to complete every
preceding instruction of the C code before continuing the program execution. By
doing so we guarantee that only the code that is under measurement will be
executed in between the RDTSC calls and that no part of that code will be
executed outside the calls.
The complete list of available serializing instructions on IA64 and IA32 can be
found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3A [4]. Reading this manual, we find that “CPUID can be executed at any
privilege level to serialize instruction execution with no effect on program flow,
except that the EAX, EBX, ECX and EDX registers are modified”. Accordingly, the natural choice to avoid out of order execution would be to call CPUID just before
both RTDSC calls
So anywhere where __rdtsc() is should be used like this:
| int Unused[4];
__cpuid(Unused, 0);
Clock = __rdtsc();
|
Alternatively switch to RTDSCP instruction which is serializing instruction.