QueryPerformanceFrequency returning 10mhz bug

When using QueryPerformanceFrequency on Windows 10 it always seems to be returning 10,000,000 (10 mhz) clock ticks per second regardless of what computer/cpu you're using. After searching some forums it seems this is a bug with current updated Windows systems? If this is the case is there a work around to get more accurate timing for windows?

Edited by Jason on Reason: Initial post
What other value you expect it to return?

QPC never had any guarantees about frequency it uses. Frequency will be combination of whatever CPU, motherboard, BIOS and Windows decides or allows to use.

If you want something more granular, try using RDTSC.

People on the internets say that if your BIOS has "Enable High Precision Event Timer" option, then you can enable it for more higher frequency, but then QPC function can be slower. Never tried this myself.

After enabling that in BIOS you can run "bcdedit /set useplatformclock true" to make windows to use HPET after a reboot. And "bcdedit /deletevalue useplatformclock" to disable it.

This seems to be a good article explaining timers & HPET situation on Windows: https://www.anandtech.com/print/1...amining-amd-2nd-gen-ryzen-results
It seems that enabling HPET affects new AMD's minimally, but on Intel enabling HPET leads to worse performance.

Edited by Mārtiņš Možeiko on
Thanks for the article. Had some interesting things in it I never knew.

mmozeiko
What other value you expect it to return?


Well I didn't expect it to return the same 10mhz on 2 vastly different machines (cheap intel laptop vs expensive ryzen desktop). This makes it seem like there is no point to calling QueryPerformanceFrequency if it is always just gonna return the same value. I know, when searching around the internet, it seems people typically see values like 2.4 mhz or 3.6 mhz depending on the cpu.
And then next gen of AMD and Intel CPU's comes out that have more accurate timer, and suddenly value will be 100Mhz, and hardcoded 10Mhz value won't work. There's no reason to not call QueryPerformanceFrequency to get actual value if you need conversion to time. Similar thing on Linux with clock_gettime & clock_getres.

Such decision to hardcode timer frequency was reason early PCs had turbo button that slowed down CPU - otherwise older applications which assumed old/slow fixed speed run too fast on newer CPUs.

Edited by Mārtiņš Možeiko on
According to msdn (https://docs.microsoft.com/en-us/...iring-high-resolution-time-stamps), it's not a bug but a documented behaviour. Not sure what to make of this paragraph, especially since the links to hypervisor only redirect to https://docs.microsoft.com/fr-fr/ :

Great find!
And if I understand correctly, then if you enable WSL2 - your whole Windows (host) will run under hypervisor. Explaining 10Mhz. Nowadays probably there are more cases than just WSL2 where this happens.

Edited by Mārtiņš Možeiko on
Guntha
According to msdn (https://docs.microsoft.com/en-us/...iring-high-resolution-time-stamps), it's not a bug but a documented behaviour. Not sure what to make of this paragraph, especially since the links to hypervisor only redirect to https://docs.microsoft.com/fr-fr/ :


Interesting. Well I think maybe my mental model of what QueryPerforamnceFrequency was suppose to return was off. I watched a video where it returned 2.6 mhz and the author of the video said something about it maybe updating the tick count every 1000 cycles and this is why , on his 2.6 ghz processor, it was returning 2.6 mhz. So in my mind QueryPerformanceFrequency's return value was suppose to return different frequencies depending on the speed of your processor.
I've got an old Windows 7, Core 2 Duo 2.4GHz laptop where it returns 2337939, close to what you describe; on all the Windows 10 machines I tested recently it returns 10Mhz, I'm pretty sure it hasn't always returned that but I don't know at what point it changed.
Guntha
I'm pretty sure it hasn't always returned that but I don't know at what point it changed.


Ya, that's what I thought so that's why I wanted to make sure this 10mhz was an intended change and not a bug. If I tried dividing by the QueryPerformanceFrequency's return value like:

u64 clock_ticks_per_second;
QueryPerformanceFrequency(&clock_ticks_per_second);
u64 time_elapsed_in_seconds = elapsed_ticks / clock_ticks_per_second;

time_elapsed_in_seconds would obviously be different if you are dividing by 2.6mhz vs 10mhz (if the 10mhz was indeed a bug). But I'm pretty new to profiling and thinking about this stuff so I might be a bit off with how I'm visualizing how QPC and QPF functions work.
boagz57
time_elapsed_in_seconds would obviously be different if you are dividing by 2.6mhz vs 10mhz


The result of the two computations would be pretty close because the granularity of elapsed_ticks would also be different, in a same proportion that the frequency would be. It's somewhat similar to dividing 4 by 2 vs dividing 8 by 4.

Edited by Simon Anciaux on
mrmixer
The result of the two computations would be pretty close because the granularity of elapsed_ticks would also be different, in a same proportion that the frequency would be. It's somewhat similar to dividing 4 by 2 vs dividing 8 by 4.


Ya, that makes sense. Thanks.
boagz57
I watched a video where it returned 2.6 mhz and the author of the video said something about it maybe updating the tick count every 1000 cycles and this is why , on his 2.6 ghz processor, it was returning 2.6 mhz.


Problem with thinking this way is that modern CPU's change frequency dynamically - for power savings, or for "turbo boost" reasons. So if it would tie QueryPerformanceCounter to CPU frequency, it would not be constant QueryPerformanceFrequency and it would be hard to time anything.

When doing measuring with QPC, you typically go two routes:

1) convert everything in nanoseconds - query QPC and QPF, and do uint64 nsec = QPC * 1e9 / QPF calculation, then keep result and do whatever you want with it

2) measure intervals and divide with QPF to get length - query two QPC's - qpc1 and qpc2. Then calculate double diff = (qpc2 - qpc1) / QPF.

Directly dividing QPC with QPF is not a good idea as that would loose precision.

Edited by Mārtiņš Možeiko on
After reading the articles linked in this thread, I wanted to try to use only rdtsc to get time information.

I had in the past looked at the disassembly of QueryPerformanceCounter and knew it used rdtsc and modified the value before returning it. I know it's not a good idea to rely on rdtsc as QueryPerformanceCounter will do different things based on the hardware, bios, OS version. So my goal was to just to try to make it work on my machine (at least at first). My reason for wanting to do that is that in profiling code I use both QPC and rdtsc because two sequential calls of QPC can return the same value and I wanted a way to represent events with more granularity which I can somewhat do with rdtsc (as it never returns the same value). Also if I could rely only on the rdtsc value, it would save 8 bytes in my events (that are 16bytes at the moment).

I failed at doing what I wanted (using rdtsc, and doing the transformation that QPC does myself), but I wanted to know if anybody knows how to achieve it ?

1
2
3
4
5
6
7
8
9
#include <windows.h>

int main( int argc, char** argv ) {
    
    LARGE_INTEGER t;
    QueryPerformanceCounter( &t );
    
    return 0;
}


So the assembly for QPC is:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
sub   rsp, 0x28
test  byte ptr [0x7ffe02ed], 0x1
mov   r9, rcx
jz    0x7719b930
mov   r8, qword ptr [0x7ffe03b8]
rdtsc
movzx ecx, byte ptr [0x7ffe02ed]
shl   rdx, 0x20
or    rax, rdx
shr   ecx, 0x2
add   rax, r8
shr   rax, cl
mov   qword ptr [r9], rax
mov   eax, 0x1
add   rsp, 0x28
ret


The jump on line 4 was never taken on my machine.
The code grabs rdtsc, adds the x64 value from the address 0x7ffe03b8 and then divides by 1024 (shr rax, cl where cl is 10).
The value at 0x7ffe03b8 is 353662723 ( 0x15147703 ). I'm not sure if the value is the same after rebooting my machine.

My first question is: where does the 0x7ffexxxx memory range is coming from ? The debugger doesn't list any module that contain that in their memory range. How can I find what that is ? I tried step on the very first instruction of the program, and the values at those addresses are already set.

Since I couldn't figure out from where those value came from, I tried to read Intel documentation.
- Time stamp counter, in Volume 3, chapter 17.17;
- Counting clocks, in Volume 3, chapter 18.7;
- CPUID instruction, in Volume 2, page 300 (3-198);

It seems that using CPUID with eax set to 0x15 I could compute the tsc frequency, but my CPU (i7 860) doesn't support value more than 0xb in CPUID. Chapter 18.7.3.2 specifies that Nehalem based processor should use MSR_PLATFORM_INFO to get the value, but using rdmsr to read msr register requires the application to run in kernel mode. I think I could write a kernel driver to do that, but I don't want to do it at the moment (I was thinking about doing that to be able to query cache misses, but I don't actually know what writing a kernel driver implies or how to use it in).

So does anyone knows how to get the "transformation" necessary to transform rdtsc values to either seconds, or QPC compatible values ?
You can do simple tests with reading/writing MSR register with WinRing0 project: https://github.com/QCute/WinRing0
It provides prebuilt & signed driver that exposes functions like Rdmsr(DWORD index, DWORD* eax, DWORD* edx);
I have used it, and it works great for simple stuff (not for high performance, as it does syscall on every function - no batching/pipelining).

To transform rdtsc to seconds, you simply measure how long the real time takes between to values. So like rdtsc + sleep(1sec) + rdtsc, and then difference between them will be the frequency. Do multiple measurements to get more precise value.

You can rely on rdtsc value on modern PC's for timing. Only old Intel's had issue where this value was not constant due to turbo boost or because of switching cores. CPUID have a flag for constant/invariant TSC. It that is set, you can reliably use rdtsc for timing.

The extra 0x7fff... memory could be shared memory segment with kernel. Windows & Linux does sharing of segments with kernel for time related functionality (clock_gettime, gettimeofday) for performance reasons - to avoid doing syscall when user queries time. On Linux this is known as vDSO: https://man7.org/linux/man-pages/man7/vdso.7.html

Edited by Mārtiņš Možeiko on
Thanks for WinRing0, I'll have a look at that at some point.

mmozeiko
To transform rdtsc to seconds, you simply measure how long the real time takes between to values. So like rdtsc + sleep(1sec) + rdtsc, and then difference between them will be the frequency. Do multiple measurements to get more precise value.

You can rely on rdtsc value on modern PC's for timing. Only old Intel's had issue where this value was not constant due to turbo boost or because of switching cores. CPUID have a flag for constant/invariant TSC. It that is set, you can reliably use rdtsc for timing.


I knew about the invariant TSC after reading the articles. My concern is more about the conversion back to seconds not being "precise" because my intention at first was to try to do the same thing as QPC (the profiler would just record rdtsc and would do the transformation only when displaying events).

I haven't tried the rdtsc + sleep option because I don't want to have wait at application startup to know the value (could use time less than a seconds...) and since Sleep granularity is 1ms I don't know if it would be precise enough. I should try though.

You were right about the 0x7ffe0000 address range. I tried to debug using Windbg hoping that it might give more information and while I didn't do well using Windbg, I managed to step in the disassembly of QPC and, in there, the addresses where displayed as SharedUserData+0x2ed. Here is the assembly from Windbg:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
ntdll!RtlQueryPerformanceCounter:
sub     rsp,28h
test    byte ptr [SharedUserData+0x2ed (00000000`7ffe02ed)],1
mov     r9,rcx
je      ntdll! ?? ::FNODOBFM::`string'+0x149c0 (00000000`77b7b930)
mov     r8,qword ptr [SharedUserData+0x3b8 (00000000`7ffe03b8)]
rdtsc
movzx   ecx,byte ptr [SharedUserData+0x2ed (00000000`7ffe02ed)]
shl     rdx,20h
or      rax,rdx
shr     ecx,2
add     rax,r8
shr     rax,cl
mov     qword ptr [r9],rax
mov     eax,1
add     rsp,28h
ret


After searching what SharedUserData was, I found this page: KUSER_SHARED_DATA (and this not useful msdn page).

So assuming this page is correct, SharedUserData is loaded with every executable at the fixed address 0x7ffe0000 and you can access the different fields using the offsets from the site. The "problem" is that different kernel version use the space differently.

I managed to make it work on Windows 7, but the Windows 10 assembly for QPC is quite different (and I don't have a dev machine running windows 10).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
#include <stdint.h>
#include <assert.h>
#include <windows.h>

#if 0
/* Windows 7 */
0x02ed
union {
    UCHAR TscQpcData;
    struct {
        UCHAR TscQpcEnabled : 1;        // 0x01
        UCHAR TscQpcSpareFlag : 1;      // 0x02
        UCHAR TscQpcShift : 6;          // 0xFC
    };
};

0x03C6
/* Windows 8 */
union {
    USHORT TscQpcData;
    struct {
        BOOLEAN volatile TscQpcEnabled;
        UCHAR TscQpcShift;
    };
};

/* Windows 8.1 */
union {
    USHORT QpcData;
    struct {
        BOOLEAN volatile QpcBypassEnabled;
        UCHAR QpcShift;
    };
};

/* Windows 10 */
union {
    USHORT QpcData;
    struct {
        UCHAR volatile QpcBypassEnabled;
        UCHAR QpcShift;
    };
};

0x038b
/* Windows 7 and 8 */
ULONGLONG volatile TscQpcBias;
/* Windows 8.1 and up */
ULONGLONG volatile QpcBias;
#endif



uint8_t* SharedUserData = ( uint8_t* ) 0x7ffe0000;

uint64_t qpc_offset;
uint8_t qpc_shift;

void custom_qpc_init( ) {
    uint8_t win7_byte = *( SharedUserData + 0x02ed );
    assert( win7_byte & 0x1 ); /* If not set QPC calls NtQueryPerformanceCounter. */
    qpc_shift = ( win7_byte >> 2 );
    qpc_offset = *( uint64_t* ) ( SharedUserData + 0x038b );
}

void custom_qpc( uint64_t* time ) {
    *time = __rdtsc( );
    *time += qpc_offset;
    *time >>= qpc_shift;
}

int main( int argc, char** argv ) {
    
    uint64_t s1, s2, e1, e2;
    QueryPerformanceCounter( ( LARGE_INTEGER* ) &s1 );
    Sleep( 1000 );
    QueryPerformanceCounter( ( LARGE_INTEGER* ) &e1 );
    
    custom_qpc_init( );
    
    custom_qpc( &s2 );
    Sleep( 1000 );
    custom_qpc( &e2 );
    
    uint64_t r1 = e1 - s1;
    uint64_t r2 = e2 - s2;
    
    uint64_t frequency;
    QueryPerformanceFrequency( ( LARGE_INTEGER* ) &frequency );
    
    return 0;
}


I don't know if I should continue on that path, as I will probably never test windows 8 and 8.1 and windows 10 seems to handle several cases and I don't think there is documentation about that.

Also I don't know if I should use rdtscp instead of rdtsc. I don't remember were I read that it would prevent prevent out of order execution since it's an atomic instruction. Windows 10 QPC uses it.

Edited by Simon Anciaux on Reason: Code error