Miles
130 posts / 4 projects
QueryPerformanceFrequency returning 10mhz bug
An interesting note is that the function does at least have well-defined upper and lower bounds. From this documentation: https://docs.microsoft.com/en-us/...iring-high-resolution-time-stamps

"How often does QPC roll over? Not less than 100 years from the most recent system boot"

Given that QueryPerformanceCounter returns a 64 bit signed int (i.e. 63 usable bits), this implies that the value returned by QueryPerformanceFrequency should never be more than around 2^63/(100*365*24*60*60) = 2,924,712,086 = about 2.9 GHz. And the timer resolution is stated to be at least one microsecond, so it will always be at least 1 MHz. I have no idea if this information will be useful to anyone, but it was fun to calculate.
Simon Anciaux
1274 posts
QueryPerformanceFrequency returning 10mhz bug
I spent more time on this and I've written functions to "emulate" what QueryPerformanceCounter does. I have tested it only on Windows 7 and two versions of Windows 10 (1909, 1903). If anybody has a Windows 8, Windows 8.1 or Windows 10 prior to the creator update and is willing to test it I would appreciate the effort. More specifically I'm interested in the following versions of Windows 10 as they may have introduced changes pertinent to the problem:
• Version 1607 (Anniversary update) 2016 => build 14393
• Version 1703 (Creators update) 2017 => build 15063
• Version 1709 (Fall Creators Update) 2017 => build 16299
• Version 1803 (April 2018 update) => build 17134

If you compile and run the program below it should run for a few seconds and display a few lines, the expected result is for all min/max to be in the same range and the "best" values to be somewhat similar (around 33). Here is a link to the compiled exe + pdb and source. There are some asserts that could trigger if a feature isn't supported by the os version (which is why it needs some testing).

The source contains notes and some findings.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 #if 0 /* NOTE build: cl main.c -Feqpc.exe -nologo -O2 -Zi */ /* The following code is the result of trying to figure out what transformation QueryPerformanceCounter (QPC) does to the result of the rdtsc (or rdtscp) instruction based on the assembly of QPC. This was only tested on Windows 7 and Windows 10 version 1909 and Windows 10 version 1903. Windows 8, 8.1 and prior version of Windows 10 are a complete guess if you test on those, let me know how it went. This is not a replacement to QueryPerformanceCounter, you should still use that to query timestamps. The goal was to be able to use only __rdtsc to capture events timestamps and see if it was possible to transform them afterward (or "offline") into value compatible with QueryPerformanceCounter. Which is possible if you save some additional constant values to do the transformation. QPC uses 2 shared memory page (inspecting the assembly using windbg "reveals" the names of those pages): SharedUserData and RtlpHypervisorSharedUserVa. # SharedUserData I didn't find official documentation on that except for the header file containing the definition of the structure (KUSER_SHARED_DATA) in the Windows Driver Kit (WDK). It also contains some useful comments about the meaning of some of the fields (starting a line 8219). C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\km\ntddk.h There is a Microsoft documentation page but it only contains the structure definition: https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntddk/ns-ntddk-kuser_shared_data The following page contains more information, a description and history of SharedUserData and the offsets of the different fields: http://geoffchappell.com/studies/windows/km/ntoskrnl/structs/kuser_shared_data/index.htm SharedUserData is always loaded in memory at the 0x7ffe0000 address. # RtlpHypervisorSharedUserVa When using QPC on Windows 10 (presumably only on version that came out after the "Anniversary update" v1607, build number 14393 ) there is another page used by QPC called RtlpHypervisorSharedUserVa. Similarly I didn't find official documentation about it. The following tweet says that its location in memory isn't always the same but should be near 0x7ffe8000. I tested on two machines, and on one it was always at 0x7ffe8000 and on the other it was always at 0x7ffed000. There is a way to query the location at runtime using NtQuerySystemInformation and passing 0xc5 in the SystemInformationClass parameter. Tweet mentionning the RtlpHypervisorSharedUserVa and _SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION: https://twitter.com/aionescu/status/963584812412997632 System information class unofficial documentation: http://geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/sysinfo/class.htm?tx=181 This page also list the name for the SystemInformationClass value (SystemHypervisorSharedPageInformation on line 1447), contains a comment saying that the query was added in Windows 10 redstone 4 (v1803, april 2018 update, build 17134), and has a definition for the struct returned by NtQuerySystemInformation, which is just a void pointer (on line 3523). */ #if 0 typedef struct _SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION { PVOID HypervisorSharedUserVa; } SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION, *PSYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION; #endif /* https://github.com/processhacker/processhacker/blob/master/phnt/include/ntexapi.h This tweet mention that that page is only used by 2 functions, RtlQueryPerformancecounter and RtlGetMultiTimePrecise. https://twitter.com/AmarSaar/status/995794185398534147 I still don't know exactly what the page contains. What I observed was: - The first 4 bytes (+0x0) read "HalT" in ascii which is 0x546c6148 (tested on two machines). - The next 4 bytes (+0x4) don't seem to be used; - The next 8 bytes (+0x8) contains a big value and will be used to do a 128 multiply with the result of rdtsc; It seem to be constant on a machine (always the same on the same computer), but difers between machines. I observed 0xea39330e641ff4 on my first machine and 0xc7b5ac275f1df on the second. - The next 8 bytes (+0x10) contains a value that is added to the 8 upper bytes of the result from the multiply. On both my test machines this was always 0x0; The first 4 bytes value (usally "HalT") seem to be used for 2 things: - If the value is zero, QPC should use the NtQueryPerformanceCounter syscall instead of continuing with rdtscp. - It seems that the value could change during the call, and if it changed, it would redo the rdtsc and multiply before continuing. # Windows versions: For reference here are the different versions of windows with their kernel number: - Windows 7: kernel 6.1 - Windows 8: kernel 6.2 - Windows 8.1: kernel 6.3 - Windows 10: kernel 10.0 Furthermore here are the different versions of windows 10: - Version 1507 (Jully) 2015 => build 10240 - Version 1511 (November update) 2015 => build 10586 - Version 1607 (Anniversary update) 2016 => build 14393 - Version 1703 (Creators update) 2017 => build 15063 - Version 1709 (Fall Creators Update) 2017 => build 16299 - Version 1803 (April 2018 update) => build 17134 - Version 1809 (October 2018 update) => build 17763 - Version 1903 (May 2019 update) => build 18362 - Version 1909 (November 2019 update) => build 18363 - Version 2004 (May 2020 update) => build 19041 # Pertinent Offsets in SharedUserData structure ## 0x0260 Kernel 10.0 and up */ ULONG NtBuildNumber; /* ## 0x026C Kernel 4.0 and up */ ULONG NtMajorVersion; /* ## 0x0270 Kernel 4.0 and up */ ULONG NtMinorVersion; /* ## 0x02ed Kernel 6.1 only (Windows 7) */ union { UCHAR TscQpcData; struct { UCHAR TscQpcEnabled : 1; // 0x01 UCHAR TscQpcSpareFlag : 1; // 0x02 UCHAR TscQpcShift : 6; // 0xFC }; }; /* ## 0x0300 Kernel 6.2 and up (Windows 8 and up). Other meaning in previous version. QueryPerformanceFrequency returns this value on Windows 10 (and I suppose 8 and 8.1). Windows 7 does a system call instead. */ LONGLONG QpcFrequency; /* ## 0x03b8 Kernel 6.1 (Windows 7) and 6.2 (Windows 8) */ ULONGLONG volatile TscQpcBias; /* Kernel 6.3 and up (Windows 8.1 and up) */ ULONGLONG volatile QpcBias; /* ## 0x03C6 Kernel 6.1 (Windows 7) */ USHORT Reserved4; /* Kernel 6.2 only (Windows 8) This is very similar to 0x02ed but for windows 8. */ union { USHORT TscQpcData; struct { BOOLEAN volatile TscQpcEnabled; UCHAR TscQpcShift; }; }; /* Kernel 6.3 (Windows 8.1), and kernel 10 up to version 1607 (Windows 10 anniversary update) Bypass here means bypassing a system call to retrive the counter (based on the comments in the WDK header, see above). */ union { USHORT QpcData; struct { BOOLEAN volatile QpcBypassEnabled; UCHAR QpcShift; }; }; /* Kernel 10 starting with version 1709 (Windows 10 fall creators update) and up. What about version 1703 (creator update) ? Assuming previous version only set the boolean to 0 or 1, it shouldn't matter as the value needs to have the second bit set (0x2) to take the hypervisor path. In theory any version starting with windows 8 could use the windows 10 function below and should still work. From the unofficial doc (see above): Version 1709 changes QpcBypassEnabled from a UCHAR that is intended to be either TRUE or FALSE to one whose meaning is taken in bits. Microsoft's C-language definition in the contemporaneous WDK defines: 0x01 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED; 0x10 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE; 0x20 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE; 0x40 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_A73_ERRATA; 0x80 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP. From the header of the WDK (see above): // // Define flags for QPC bypass information. None of these flags may be set // unless bypass is enabled. This is for compat with existing code which // compares this value to zero to detect bypass enablement. // #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED (0x01) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_HV_PAGE (0x02) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_DISABLE_32BIT (0x04) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE (0x10) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE (0x20) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_A73_ERRATA (0x40) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP (0x80) */ /* NOTE This definition comes from the WDK header, not from the unofficial doc as the rest of the definitions. */ union { USHORT QpcData; struct { // // A boolean indicating whether performance counter queries // can read the counter directly (bypassing the system call). // volatile UCHAR QpcBypassEnabled; // // Shift applied to the raw counter value to derive the // QPC count. // UCHAR QpcShift; }; }; #endif #include #include #include #include #if 0 /* NOTE Actual definitions for reference. */ /* NOTE from winternl.h */ NTSTATUS NTAPI NtQuerySystemInformation( IN SYSTEM_INFORMATION_CLASS SystemInformationClass, OUT PVOID SystemInformation, IN ULONG SystemInformationLength, OUT PULONG ReturnLength OPTIONAL ); /* NOTE from msdn */ NTSTATUS NtQueryPerformanceCounter( _Out_ PLARGE_INTEGER PerformanceCounter, _Out_opt_ PLARGE_INTEGER PerformanceFrequency ); #endif typedef int32_t __stdcall NtQuerySystemInformation_t( int32_t SystemInformationClass, void* SystemInformation, uint32_t SystemInformationLenght, uint32_t* ReturnLenght ); typedef int32_t NtQueryPerformanceCounter_t( uint64_t* PerformanceCounter, uint64_t* PerformanceFrequency ); NtQueryPerformanceCounter_t* NtQueryPerformanceCounter = 0; uint8_t* SharedUserData = ( uint8_t* ) 0x7ffe0000; volatile uint8_t* RtlpHypervisorSharedUserVa = 0; /* NOTE volatile because I think the content could be changed by the kernel. */ void qpc_win_7( uint64_t* time ) { uint8_t tsc_qpc_data = *( SharedUserData + 0x02ed ); uint8_t bypass_syscall = tsc_qpc_data & 0x1; if ( bypass_syscall ) { uint8_t qpc_shift = ( tsc_qpc_data >> 2 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); *time = __rdtsc( ); *time += qpc_bias; *time >>= qpc_shift; } else { int32_t result = NtQueryPerformanceCounter( time, 0 ); assert( result >= 0 ); } } /* NOTE This function hasn't been tested. I don't have the windows 8/10 assembly, it's a complete guess. */ void qpc_win_8_to_10_v1067( uint64_t* time ) { uint8_t bypass_syscall = *( SharedUserData + 0x03c6 ); if ( bypass_syscall ) { uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); *time = __rdtsc( ); *time += qpc_bias; *time >>= qpc_shift; } else { int32_t result = NtQueryPerformanceCounter( time, 0 ); assert( result >= 0 ); } } void qpc_win_10( uint64_t* time ) { uint64_t tsc = 0; uint8_t flags = *( SharedUserData + 0x3c6 ); uint8_t bypass_syscall = flags & 0x1; if ( bypass_syscall ) { uint8_t use_hypervisor_page = flags & 0x2; if ( use_hypervisor_page ) { /* NOTE If RtlpHypervisorSharedUserVa is 0, we should use NtQueryPerformanceCounter to get the result of the whole function (not done here to keep it simple). */ assert( RtlpHypervisorSharedUserVa ); while ( 1 ) { /* NOTE This value is "HalT" in ascii on my machine (0x546c6148) */ uint32_t some_value_that_should_not_be_zero = *( uint32_t* ) RtlpHypervisorSharedUserVa; /* NOTE If this value is 0, we should use NtQueryPerformanceCounter to get the result of the whole function (not done here to keep it simple).*/ assert( some_value_that_should_not_be_zero ); uint8_t use_rdtscp = flags & 0x80; if ( use_rdtscp ) { uint32_t x; tsc = __rdtscp( &x ); } else { uint8_t lfence = flags & 0x20; uint8_t mfence = flags & 0x10; if ( lfence ) { _mm_lfence( ); } else if ( mfence ) { _mm_mfence( ); } tsc = __rdtsc( ); } uint64_t value_1 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x08 ); /* NOTE Always 0xea39330e641ff4 on my machine. */ uint64_t value_2 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x10 ); /* NOTE Always 0x0 on my machine. */ uint64_t high, low; low = _umul128( tsc, value_1, &high ); /* NOTE Could use __umulh as the low bytes are discarded. */ high += value_2; tsc = high; low = *( uint32_t* ) RtlpHypervisorSharedUserVa; /* NOTE If the value "HalT" was changed since we read it, redo the work (possibly could make the path go through NtQueryPerformanceCounter). */ if ( low == some_value_that_should_not_be_zero ) { break; } } uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); /* NOTE qpc_shift is always 0 on my machine on windows 10. */ uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); tsc = tsc + qpc_bias; tsc >>= qpc_shift; *time = tsc; } else { qpc_win_8_to_10_v1067( time ); } } else { int32_t result = NtQueryPerformanceCounter( time, 0 ); assert( result >= 0 ); } } uint64_t tsc_to_qpc_win_7( uint64_t tsc ) { uint8_t tsc_qpc_data = *( SharedUserData + 0x02ed ); uint8_t qpc_shift = ( tsc_qpc_data >> 2 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); uint64_t result = tsc + qpc_bias; result >>= qpc_shift; return result; } uint64_t tsc_to_qpc_win_8_to_10_v1067( uint64_t tsc ) { uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); uint64_t result = tsc + qpc_bias; result >>= qpc_shift; return result; } uint64_t tsc_to_qpc_win_10( uint64_t tsc ) { uint64_t value_1 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x08 ); uint64_t value_2 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x10 ); tsc = __umulh( tsc, value_1 ); tsc += value_2; uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); uint64_t result = tsc + qpc_bias; result >>= qpc_shift; return result; } typedef void qpc_t( uint64_t* time ); qpc_t* custom_qpc = 0; typedef uint64_t tsc_to_qpc_t( uint64_t ); tsc_to_qpc_t* tsc_to_qpc = 0; void custom_qpc_init( ) { HANDLE ntdll = LoadLibrary( "ntdll.dll" ); NtQuerySystemInformation_t* NtQuerySystemInformation = 0; if ( ntdll ) { NtQueryPerformanceCounter = ( NtQueryPerformanceCounter_t* ) GetProcAddress( ntdll, "NtQueryPerformanceCounter" ); assert( NtQueryPerformanceCounter ); NtQuerySystemInformation = ( NtQuerySystemInformation_t* ) GetProcAddress( ntdll, "NtQuerySystemInformation" ); assert( NtQuerySystemInformation ); FreeLibrary( ntdll ); } uint32_t kernel_major = *( uint32_t* ) ( SharedUserData + 0x026c ); uint32_t kernel_minor = *( uint32_t* ) ( SharedUserData + 0x0270 ); if ( kernel_major == 6 && kernel_minor == 1 ) { custom_qpc = qpc_win_7; tsc_to_qpc = tsc_to_qpc_win_7; } else if ( kernel_major == 6 && ( kernel_minor == 2 || kernel_minor == 3 ) ) { custom_qpc = qpc_win_8_to_10_v1067; tsc_to_qpc = tsc_to_qpc_win_8_to_10_v1067; } else if ( kernel_major == 10 ) { uint32_t win_10_build_number = *( uint32_t* ) ( SharedUserData + 0x0260 ); if ( win_10_build_number > 14393 ) { /* NOTE Build after anniversary update. This number might need to be bumped to 16299. */ uint64_t system_information; uint32_t out_size; int32_t SystemHypervisorSharedPageInformation = 0xc5; int32_t result = NtQuerySystemInformation( SystemHypervisorSharedPageInformation, &system_information, sizeof( system_information ), &out_size ); assert( out_size == sizeof( system_information ) ); if ( result >= 0 ) { RtlpHypervisorSharedUserVa = ( uint8_t* ) system_information; } custom_qpc = qpc_win_10; tsc_to_qpc = tsc_to_qpc_win_10; } else { custom_qpc = qpc_win_8_to_10_v1067; tsc_to_qpc = tsc_to_qpc_win_8_to_10_v1067; } } else { assert( !"Not supported" ); } } int main( int argc, char** argv ) { custom_qpc_init( ); uint64_t mins[ 3 ] = { 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff }; uint64_t maxs[ 3 ] = { 0 }; uint64_t bests[ 3 ] = { 0 }; uint32_t sleep_duration = 100; #define iteration_count 100 uint64_t qpc_results[ iteration_count ] = { 0 }; uint64_t custom_qpc_results[ iteration_count ] = { 0 }; uint64_t rdtscp_starts[ iteration_count ] = { 0 }; uint64_t rdtscp_ends[ iteration_count ] = { 0 }; for ( uint32_t index = 0; index < iteration_count; index++ ) { uint64_t s, e; QueryPerformanceCounter( ( LARGE_INTEGER* ) &s ); Sleep( sleep_duration ); QueryPerformanceCounter( ( LARGE_INTEGER* ) &e ); qpc_results[ index ] = e - s; } for ( uint32_t index = 0; index < iteration_count; index++ ) { uint64_t s, e; custom_qpc( &s ); Sleep( sleep_duration ); custom_qpc( &e ); custom_qpc_results[ index ] = e - s; } for ( uint32_t index = 0; index < iteration_count; index++ ) { int32_t x = 0; rdtscp_starts[ index ] = __rdtscp( &x ); Sleep( sleep_duration ); rdtscp_ends[ index ] = __rdtscp( &x ); } for ( uint32_t index = 0; index < iteration_count; index++ ) { uint64_t rdtscp_result = tsc_to_qpc( rdtscp_ends[ index ] ) - tsc_to_qpc( rdtscp_starts[ index ] ); if ( qpc_results[ index ] < mins[ 0 ] ) { mins[ 0 ] = qpc_results[ index ]; } if ( qpc_results[ index ] > maxs[ 0 ] ) { maxs[ 0 ] = qpc_results[ index ]; } if ( custom_qpc_results[ index ] < mins[ 1 ] ) { mins[ 1 ] = custom_qpc_results[ index ]; } if ( custom_qpc_results[ index ] > maxs[ 1 ] ) { maxs[ 1 ] = custom_qpc_results[ index ]; } if ( rdtscp_result < mins[ 2 ] ) { mins[ 2 ] = rdtscp_result; } if ( rdtscp_result > maxs[ 2 ] ) { maxs[ 2 ] = rdtscp_result; } if ( qpc_results[ index ] < custom_qpc_results[ index ] && qpc_results[ index ] < rdtscp_result ) { bests[ 0 ]++; } else if ( custom_qpc_results[ index ] < rdtscp_result ) { bests[ 1 ]++; } else { bests[ 2 ]++; } } /* NOTE Best on each version should be roughly the same. */ printf( "qpc\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 0 ], maxs[ 0 ], bests[ 0 ] ); printf( "custom qpc\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 1 ], maxs[ 1 ], bests[ 1 ] ); printf( "rdtscp\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 2 ], maxs[ 2 ], bests[ 2 ] ); return 0; } #if 0 /* I use the following assembly from Windows 10 QPC. I'm not sure which version, but it's prior to 1909 (I got an updated while working on this). The 1909 assembly is bit different but not by much. ntdll!RtlQueryPerformanceCounter: 00007ffb7e4aca70 48895c2408 mov qword ptr [rsp+8], rbx ss:000000645075fb20=0000000000000000 00007ffb7e4aca75 57 push rdi 00007ffb7e4aca76 4883ec20 sub rsp, 20h 00007ffb7e4aca7a 448a0c25c603fe7f mov r9b, byte ptr [SharedUserData+0x3c6 (000000007ffe03c6)] 00007ffb7e4aca82 488bd9 mov rbx, rcx 00007ffb7e4aca85 41f6c101 test r9b, 1 00007ffb7e4aca89 7470 je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb7e4acafb) 00007ffb7e4aca8b 4c8b1c25b803fe7f mov r11, qword ptr [SharedUserData+0x3b8 (000000007ffe03b8)] 00007ffb7e4aca93 41f6c102 test r9b, 2 00007ffb7e4aca97 0f84b61d0600 je ntdll!RtlQueryPerformanceCounter+0x61de3 (00007ffb7e50e853) 00007ffb7e4aca9d 4c8b0584731000 mov r8, qword ptr [ntdll!RtlpHypervisorSharedUserVa (00007ffb7e5b3e28)] 00007ffb7e4acaa4 4d85c0 test r8, r8 00007ffb7e4acaa7 7452 je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb7e4acafb) 00007ffb7e4acaa9 458b10 mov r10d, dword ptr [r8] 00007ffb7e4acaac 4585d2 test r10d, r10d 00007ffb7e4acaaf 744a je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb7e4acafb) 00007ffb7e4acab1 4584c9 test r9b, r9b 00007ffb7e4acab4 0f897e1d0600 jns ntdll!RtlQueryPerformanceCounter+0x61dc8 (00007ffb7e50e838) 00007ffb7e4acaba 0f01f9 rdtscp 00007ffb7e4acabd 48c1e220 shl rdx, 20h 00007ffb7e4acac1 480bd0 or rdx, rax 00007ffb7e4acac4 498b4008 mov rax, qword ptr [r8+8] 00007ffb7e4acac8 498b4810 mov rcx, qword ptr [r8+10h] 00007ffb7e4acacc 48f7e2 mul rax, rdx 00007ffb7e4acacf 418b00 mov eax, dword ptr [r8] 00007ffb7e4acad2 4803d1 add rdx, rcx 00007ffb7e4acad5 413bc2 cmp eax, r10d 00007ffb7e4acad8 75cf jne ntdll!RtlQueryPerformanceCounter+0x39 (00007ffb7e4acaa9) 00007ffb7e4acada 8a0c25c703fe7f mov cl, byte ptr [SharedUserData+0x3c7 (000000007ffe03c7)] 00007ffb7e4acae1 4a8d041a lea rax, [rdx+r11] 00007ffb7e4acae5 48d3e8 shr rax, cl 00007ffb7e4acae8 488903 mov qword ptr [rbx], rax 00007ffb7e4acaeb b801000000 mov eax, 1 00007ffb7e4acaf0 488b5c2430 mov rbx, qword ptr [rsp+30h] 00007ffb7e4acaf5 4883c420 add rsp, 20h 00007ffb7e4acaf9 5f pop rdi 00007ffb7e4acafa c3 ret 00007ffb7e4acafb 33d2 xor edx, edx 00007ffb7e4acafd 488d4c2440 lea rcx, [rsp+40h] 00007ffb7e4acb02 e869320400 call ntdll!NtQueryPerformanceCounter (00007ffb7e4efd70) 00007ffb7e4acb07 488b442440 mov rax, qword ptr [rsp+40h] 00007ffb7e4acb0c ebda jmp ntdll!RtlQueryPerformanceCounter+0x78 (00007ffb7e4acae8) 00007ffb7e4acb0e cc int 3 Some jumps leads here. 00007ffb7e50e838 41f6c120 test r9b, 20h 00007ffb7e50e83c 7405 je ntdll!RtlQueryPerformanceCounter+0x61dd3 (00007ffb7e50e843) 00007ffb7e50e83e 0faee8 lfence 00007ffb7e50e841 eb09 jmp ntdll!RtlQueryPerformanceCounter+0x61ddc (00007ffb7e50e84c) 00007ffb7e50e843 41f6c110 test r9b, 10h 00007ffb7e50e847 7403 je ntdll!RtlQueryPerformanceCounter+0x61ddc (00007ffb7e50e84c) 00007ffb7e50e849 0faef0 mfence 00007ffb7e50e84c 0f31 rdtsc 00007ffb7e50e84e e96ae2f9ff jmp ntdll!RtlQueryPerformanceCounter+0x4d (00007ffb7e4acabd) 00007ffb7e50e853 4584c9 test r9b, r9b 00007ffb7e50e856 7905 jns ntdll!RtlQueryPerformanceCounter+0x61ded (00007ffb7e50e85d) 00007ffb7e50e858 0f01f9 rdtscp 00007ffb7e50e85b eb16 jmp ntdll!RtlQueryPerformanceCounter+0x61e03 (00007ffb7e50e873) 00007ffb7e50e85d 41f6c120 test r9b, 20h 00007ffb7e50e861 7405 je ntdll!RtlQueryPerformanceCounter+0x61df8 (00007ffb7e50e868) 00007ffb7e50e863 0faee8 lfence 00007ffb7e50e866 eb09 jmp ntdll!RtlQueryPerformanceCounter+0x61e01 (00007ffb7e50e871) 00007ffb7e50e868 41f6c110 test r9b, 10h 00007ffb7e50e86c 7403 je ntdll!RtlQueryPerformanceCounter+0x61e01 (00007ffb7e50e871) 00007ffb7e50e86e 0faef0 mfence 00007ffb7e50e871 0f31 rdtsc 00007ffb7e50e873 48c1e220 shl rdx, 20h 00007ffb7e50e877 480bd0 or rdx, rax 00007ffb7e50e87a e95be2f9ff jmp ntdll!RtlQueryPerformanceCounter+0x6a (00007ffb7e4acada) 00007ffb7e50e87f cc int 3 */ #endif 
Mārtiņš Možeiko
2453 posts / 2 projects
QueryPerformanceFrequency returning 10mhz bug
Edited by Mārtiņš Možeiko on

In assembly it loads r10d from RtlpHypervisorSharedUserVa address - and it is doing it inside while loop, not outside. It is basically some_value_that_should_not_be_zero value - that is one you should use for qpc_bias (r10d). It does not repeatedly load this value into "low" as you are doing just before comparison.
Simon Anciaux
1274 posts
QueryPerformanceFrequency returning 10mhz bug
I believe the code is correct.

- When using the hypervisor page path, there are 2 adds, the first one comes from the hypervisor page + 0x10 ( loaded in rcx, that value is always 0 on my machine) and is in the loop, and the second one is the qpc bias (r11) and is outside the loop.
- When not using the hypervisor page, there is only the qpc bias (r11) add.

Here is the assembly commented.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ; Bypass enabled ? test r9b, 1 ; NtQueryPerformanceCounter syscall je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb7e4acafb) ; Loading qpc_bias in r11 mov r11, qword ptr [SharedUserData+0x3b8 (000000007ffe03b8)] ; Use hypervisor page ? test r9b, 2 ; Second part of assembly (rdtsc/p + maybe lfence or mfence) je ntdll!RtlQueryPerformanceCounter+0x61de3 (00007ffb7e50e853) ; Loading address of RtlpHypervisorSharedUserVa into r8 (r8 = 0x7ffe8000) mov r8, qword ptr [ntdll!RtlpHypervisorSharedUserVa (00007ffb7e5b3e28)] ; Is RtlpHypervisorSharedUserVa present ? test r8, r8 ; NtQueryPerformanceCounter syscall je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb7e4acafb) ; Loading first 4 byte of RtlpHypervisorSharedUserVa. r10d is 'HalT' ; This is the first instruction of the loop mov r10d, dword ptr [r8] ; r10d != 0 test r10d, r10d ; NtQueryPerformanceCounter syscall je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb7e4acafb) ; Use rdtscp ? test r9b, r9b ; Second part of assembly (use rdtsc + maybe lfence or mfence) jns ntdll!RtlQueryPerformanceCounter+0x61dc8 (00007ffb7e50e838) ; rdtscp and combine the result in rdx rdtscp shl rdx, 20h or rdx, rax ; Loading RtlpHypervisorSharedUserVa + 0x08 into rax ; This is the value that is used in the 128 multiply ; This is where the path that use the hypervisor but not rdtscp would returns to mov rax, qword ptr [r8+8] ; Loading RtlpHypervisorSharedUserVa + 0x10 into rcx ; This is the value that is added to the high bytes of the 128 multiply (but was always 0x0 in my tests) mov rcx, qword ptr [r8+10h] ; 128 bit multiply. rdx contains the result's high bytes, rax contains the result's low bytes mul rax, rdx ; Load RtlpHypervisorSharedUserVa + 0x0 'HalT' into eax ; This "discards" the low bytes of the 128 byte multiply mov eax, dword ptr [r8] ; rcx is the content of RtlpHypervisorSharedUserVa + 0x10 add rdx, rcx ; Is eax == to r10d => does both value contain 'HalT' ? cmp eax, r10d ; Loop jump jne ntdll!RtlQueryPerformanceCounter+0x39 (00007ffb7e4acaa9) ; Load qpc_shift into cl (always 0x0 in my tests) ; This is where the path that doesn't use the hypervisor returns to. mov cl, byte ptr [SharedUserData+0x3c7 (000000007ffe03c7)] ; Add qpc_bias (r11) to rdx (high bytes of the 128 bit multiply) and store in rax lea rax, [rdx+r11] ; Do qpc_shift shr rax, cl ; Put the result at the output memory location mov qword ptr [rbx], rax ; Return 1 mov eax, 1 
Simon Anciaux
1274 posts
QueryPerformanceFrequency returning 10mhz bug
Edited by Simon Anciaux on Reason: updated the code
I looked at how to do a similar thing on linux ( convert tsc result to seconds). This is based on the source of the kernel version 5.9. and tested on linux Manjaro 64bit in a VirtualBox virtual machine from Windows 7 (i7 860). I downloaded the source from git.kernel.org but I will link to elixir.bootlin.com as this should keep pointing to the right version.

clock_gettime uses vdso.

clock_gettime is an alias for __vdso_clock_gettime which in turns calls __cvdso_clock_gettime which calls __cvdso_clock_gettime_data.

__cvdso_clock_gettime retrieves the address of a data block shared by the kernel with the user process by calling __arch_get_vdso_data which returns a global variable called __vdso_data. It's a bit complicated how this variable can be accessed.

It's defined in arch/x86/include/asm/vdso/gettimeofday.h using the VVAR macro. When processed the actual name is vvar__vdso_data.

At the bottom of the vvar.h file there is a declaration:
 1 DECLARE_VVAR(128, struct vdso_data, _vdso_data) 

When processed, the macro expands to:
 1 extern struct vdso_data vvar__vdso_data[CS_BASES] __attribute__((visibility("hidden"))); 

Note that the offset (128) isn't used at all in that code. CS_BASES, which value is 2, is defined in datapage.h as is the struct vdso_data. The definition of the struct can change between kernel versions, but the first 6 fields don't seem to change and those are the ones that we are interested in (we'll see why later).

The problem here is that I would like to avoid including headers, especially if those header requires the kernel source. In principle what we want is the location of the data and just read the bytes we want. To find that location, after stepping in the assembly I could identify the address of the vdso_data and it's offset from the vdso image you get with getauxval seems to always be the same and about 16Kio. In fact it was 16Kio - 128o which is 4 memory pages minus the offset from the DECLARE_VVAR macro. I don't remember how I found out but there is a file arch/x86/entry/vdso/vdso-layout.lds.S that says that vvar_start = -4 * PAGE_SIZE. So I think it's ok to assume a constant offset from the vdso image. If someone knows a safer way to retrieve the location of vvar__vdso_data I would like to know.

Note that the offset can change between kernel version. For example, on kernel version 4.20 the offset is -3 * PAGE_SIZE.

__cvdso_clock_gettime_data calls __cvdso_clock_gettime_common or calls a fallback function if gettime_common fails. The fallback function is a clock_gettime system call. Note that even if a system doesn't support the vdso clock_gettime, it still goes through the vdso functions, fails and then make a syscall (I observed that on a 32 bit intel Atom processor, running debian 10 with the 4.19 kernel).

My understanding of how clock_gettime (with CLOCK_MONOTONIC) works at a high level is:
- periodically the kernel updates the vdso_data values.
- I measured a interval (if my understanding of what's going on is correct) to be 4 697 428 nano seconds ( 4.6ms) on my setup. So the update interval is in the range of 10 milliseconds. I'll explain how I measured that at the end.
- When you call gettime, it will read the TSC, measure how much it has changed since the last kernel update and return the kernel time + the change. This is I believe to keep more precision and keep every thing in a 64 bit integer range.

__cvdso_clock_gettime_common first check if we request a valid clock, then based on the clock calls either do_hres or do_coarse and if you request CLOCK_MONOTONIC_RAW it will use the second element of the vdso_data array instead of the first one.

do_coarse will simply return the last value from the kernel update, not reading the current TSC (not adding the difference).

do_hres:
- Most of the function is in a loop. I believe this is to make sure that the values read from the kernel don't change between the different reads. It's non blocking and will loop until it succeeds. If you try to step in this assembly you'll not be able to step out of this loop as the value will most likely change while you step.
- There is another loop that checks for time name spaces. I'm not familiar with time namespaces but I'm confident we don't care about that in our case, so we can skip this loop.
- The code then calls __arch_get_hw_counter: in our case this calls rdtscp or rdtsc with memory fences. Even PVCLOCK and HVCLOCK at some point will use rdtscp and adjust it's value. In practive, while stepping in the assembly this result in rdtscp being called.
- The code then retrieves the difference between the kernel last tsc value convert it to nano seconds and add it to the last kernel time value.
--- vdso_calc_delta verify that the new tsc value is greater than the one from the kernel (I believe because the intel spec says that TSC value can be a little off if you read two cores TSC at the same time).
--- It then computes the difference between the two and multiply it by the mult field from the vdso_data structure.
--- It adds that value to the kernel nano second value and shift the result by the shift field from the vdso_data structure.
- If the value weren't updated during that, the loop ends;
- The nano second count is converted to second and added to the result seconds, and the remainder is stored as nanoseconds.
- That's it.

This means that to convert TSC values to seconds, you need to multiply it by vdso_data.mult and shift it right by vdso_data.shift and you've got an integer representing the timestamp in nanoseconds (clock_getres seems to only return 1 nanoseconds).

- One small issue is that the mult field can change, but it only changes by 1 (oscillating between 0x5b7e10 and 0x5b7e0f for example). Se I don't think it's a big issue. The shift value never changed in my tests.
- A bigger issue is that the result of the multiply will quickly take more than 64bit. We could either use doubles for that or doing a 128bit integer multiply and maybe the shift will bring the value in a 64bit range ? I didn't tested the 128bit multiply as I couldn't figure out if there were instrinsics like _umul128 on linux. If anyone knows, I'm all ears.

One thing that is off is that if I convert rdtscp to ns using the mul and shift, the result is not the same as what I get from clock_gettime. In my tests there was a difference of about 30 seconds. I'm not sure but it seems to be the time it takes for the system to boot up. Maybe using CLOCK_MONOTONIC_RAW would give a closer result but I didn't tested it.

How I measured the interval for the kernel update:
- This is to get an idea of the range, not a precise measurement;
- I set a breakpoint in __vdso_clock_gettime in gdb;
- I stepped until I reached the rdtscp instruction;
- A little bit after that there is code that looks like this:
 1 2 3 mov 0x8(%r10), %rcx mov 0x28(%r11), %rax mov 0x18(%r10), %esi 

- I added a breakpoint on the second instruction, and used "continue" to take another iteration in the loop;
- In that code rcx is cycle_last in the vdso_data structure (r10 is the address of the vdso_data structure).
- I noted the value of rcx (0xadac82ca89c);
- Don't step as rax contains the current TSC value ( 0xadac8f51ca1 ) (also stored in rdx at that time).
- I subtracted them, multiply the result by vdso_data.mult and shift it right by vdso_data.shift
- 0xc87405 >> 0x18 = 0x47ad54 = 4 697 428 ns

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 #include #include #include #include #include #include typedef struct vdso_data_t { uint32_t seq; int32_t clock_mode; uint64_t cycle_last; uint64_t mask; uint32_t mult; uint32_t shift; struct { uint64_t sec; uint64_t nsec; } basetime[ 12 ]; /* union { struct vdso_timestamp basetime[VDSO_BASES]; struct timens_offset offset[VDSO_BASES]; }; */ int32_t tz_minuteswest; int32_t tz_dsttime; uint32_t hrtimer_res; uint32_t __unused; /* Empty struct on x86 struct arch_vdso_data arch_data; */ } vdso_data_t; typedef unsigned __int128 uint128_t; void get_kernel_version( uint32_t* major_out, uint32_t* minor_out ) { /* uname -r */ *major_out = 5; *minor_out = 9; struct utsname info; if ( uname( &info ) >= 0 ) { uint32_t major = 0; uint32_t minor = 0; char* version = info.release; while ( *version && *version != '.' ) { char c = *version; if ( c >= '0' && c <= '9' ) { major *= 10; major += ( c - '0' ); } version++; } version++; while ( *version && *version != '.' ) { char c = *version; if ( c >= '0' && c <= '9' ) { minor *= 10; minor += ( c - '0' ); } version++; } if ( major ) { *major_out = major; *minor_out = minor; } } } intptr_t get_vdso_data_offset( void ) { uint32_t major, minor; get_kernel_version( &major, &minor ); intptr_t page_size = 1 << 12; intptr_t offset = -4 * page_size; if ( major == 5 ) { if ( minor > 5 ) { offset = -4 * page_size; } else { offset = -3 * page_size; } } else if ( major == 4 ) { if ( minor > 11 ) { offset = -3 * page_size; } else if ( minor > 6 ) { offset = -2 * page_size; } else if ( minor > 4 ) { offset = -3 * page_size; } else if ( minor > 1 ) { offset = -2 * page_size; } else { assert( !"Unsupported" ); } } offset += 128; return offset; } int main( int argc, char** argv ) { intptr_t vdso_data_offset = get_vdso_data_offset( ); uint8_t* vdso = ( uint8_t* ) getauxval( AT_SYSINFO_EHDR ); vdso_data_t* data = ( vdso_data_t* ) ( vdso + vdso_data_offset ); uint32_t mul = data[ 0 ].mult; uint32_t shift = data[ 0 ].shift; struct timespec res; clock_getres( CLOCK_MONOTONIC, &res ); double frequency = res.tv_nsec * 1000000000; uint32_t x = 0; uint64_t rdtsc = __builtin_ia32_rdtscp( &x ); struct timespec spec; clock_gettime( CLOCK_MONOTONIC, &spec ); uint128_t temp = ( uint128_t ) rdtsc * mul; temp >>= shift; assert( ( temp & 0xffffffffffffffff ) == temp ); uint64_t result = ( uint64_t ) ( temp ); double r1 = ( double ) result / frequency; printf( "integer: %5.9f\n", r1 ); double divisor = 1; for ( uint32_t i = 0; i < shift; i++ ) { divisor *= 2; } double r1b = ( ( double ) rdtsc * ( double ) mul ) / divisor; r1b /= frequency; printf( "double : %5.9f\n", r1b ); double r2 = ( double ) spec.tv_sec + ( ( double ) spec.tv_nsec / frequency ); printf( "gettime: %5.9f\n", r2 ); printf( "---\n" ); printf("integer diff: %5.9f\n", r1 - r2 ); printf("double diff: %5.9f\n", r1b - r2 ); return 0; } 

Mārtiņš Možeiko
2453 posts / 2 projects
QueryPerformanceFrequency returning 10mhz bug
mrmixer
- A bigger issue is that the result of the multiply will quickly take more than 64bit. We could either use doubles for that or doing a 128bit integer multiply and maybe the shift will bring the value in a 64bit range ? I didn't tested the 128bit multiply as I couldn't figure out if there were instrinsics like _umul128 on linux. If anyone knows, I'm all ears.

Intrinsics like _umul128 are not OS specific. They are compiler specific. On gcc/clang you can use __int128 type instead (even on Windows):
 1 2 3 uint64_t a = ..., b = ...; unsigned __int128 big = (unsigned __int128)a * b; uint64_t higer_64_bits = (uint64_t)(big >> 64); 

Compiler will optimize it correctly two register mul, and in this example shift will be for "free", as I simply take upper 64-bits.

Alternative is inline asm. As you are writing architecture specific code, asm will be available only in one variant. For 64x64 mul it will be trivial one line of inline asm with "imul" instruction.
Simon Anciaux
1274 posts
QueryPerformanceFrequency returning 10mhz bug
Thanks. I updated the code with the 128 multiply and some code to try to choose the offset based on the kernel version (only tested on kernel 5.9 so we shouldn't assume it's working).
Tim
2 posts
QueryPerformanceFrequency returning 10mhz bug
Hello, thank you for the indepth info about the QPC, very interesting read.

I just wanted to add that the QueryPerformanceFrequency always returned 10mhz for me as well when Invariant TSC was used, and always 14.32mhz when HPET was used.

I came across this while searching for the best method to profile my (VBA) code with the QueryPerformanceCounter function(via). I found that the QPF always returned 10mhz as well, and after reading this and a bit more, that the HPET timer wasn't used: it was on in the BIOS, but win10 used the Invariant TSC (found out with this piece of software profiling software)). And surprisingly, when I turned HPET on the QPF also always returned the same value, now being 14.318.180. After restart but also after a restart where I cut off the power for half a minute. Still 14.32mhz. Info about my desktop: win10, version 2004, build 19041.985, i7-2600k (3.4GHz, not overclocked), hyper-v enabled in bios (did not test what the effect was when turning hyperv off).

And, also not sure why and if interesting, when looping over the QPC function 5000 times and storing its value in an array (calculations only done after the loop completed), with the ITSC the difference between QPC calls would be about 28 ticks and the total 5000 loops took about 14-15 milliseconds, but with HPET on it registered about 55 ticks per loop and total calculated time was 18-20 ms.
Simon Anciaux
1274 posts
QueryPerformanceFrequency returning 10mhz bug
TimerTimmyyy
And surprisingly, when I turned HPET on the QPF also always returned the same value, now being 14.318.180. After restart but also after a restart where I cut off the power for half a minute. Still 14.32mhz.

I assume the surprising part is that it's always the same value. But HPET is a hardware timer so it is expected to always have the same frequency.

TimerTimmyyy
And, also not sure why and if interesting, when looping over the QPC function 5000 times and storing its value in an array (calculations only done after the loop completed), with the ITSC the difference between QPC calls would be about 28 ticks and the total 5000 loops took about 14-15 milliseconds, but with HPET on it registered about 55 ticks per loop and total calculated time was 18-20 ms.

I did a quick test, and QPC with HPET on most likely takes more time because it's a syscall instruction, meaning the program will ask the Windows kernel to do something (I haven't looked exactly why a syscall takes more time though). When HPET is off, QPC is a RDTSCP instruction with a few more instructions around it.
Tim
2 posts
QueryPerformanceFrequency returning 10mhz bug
Edited by Tim on
Thanks for the response!

QPC with HPET on most likely takes more time because it's a syscall instruction, meaning the program will ask the Windows kernel to do something

At first I thought it was because of the different time-scaling with HPET on/off (10 vs 14,32), but this actually makes more sense! Can you please tell me a little more about:

When HPET is off, QPC is a RDTSCP instruction with a few more instructions around it.

Can that last part stand on itself? Specifically --->

QPC is a RDTSCP instruction with a few more instructions around it.

I'm quite a noob and above texts (with assembly code) go way over my head, but I remember reading/learning from this thread that the QPC function actually has a loop in it. I also noticed this when calling the QPC repeatedly (for example loop of 5000 times). Usually the amount of tics in between two QPC calls was around 25 tics. But sometimes a QPC call would take about 160 tics. There didn't seem to be a difference where in the loop this occurred: it occurred as often in between call 1 to 500 as in between call 4000-4500. As explained above, it is probably because of that loop inside the QPC itself (hope I understand this correctly).

I spent reading on this for about a full week, where I kept having the feeling that RDTSCP would be a better option then QPC. I don't care about the amount of nanoseconds the result is off, I just want to output how one piece of code compares to another piece of code. What I think I need for that is core cpu cycles. So here comes my question, finally:

QPC is a RDTSCP instruction with a few more instructions around it.

Does this mean that it actually does not matter which function you use (QPC or RDTSCP) because they actually return the same thing (just QPC having a little more overhead)? Both returning the amount of clock cycles since boot.

It is what I concluded before (I mean, even Google Benchmark uses the rdtsc command), but when I read another comment yesterday I got confused again, making it sound like QPC (or also (RD)TSC) is 'just a wall clock time'...

"The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter."

I thought QPC returned the core clock cycles, but according to this quote... it doesn't: TSC (and thus QPC and RDTSC) return a 'reference clock cycle'?
Mārtiņš Možeiko
2453 posts / 2 projects
QueryPerformanceFrequency returning 10mhz bug
Edited by Mārtiņš Možeiko on
It is called like that because it runs with same frequency - regardless of how fast actually core runs. Modern CPU cores have turbo scaling and can boost their frequency - how much cycles executes per second. But rdtsc instruction is invariant, it returns same amount of "reference cycles" per second.

> Does this mean that it actually does not matter which function you use (QPC or RDTSCP) because they actually return the same thing (just QPC having a little more overhead)?

No, you should not rely on QPC returning same value as rdtsc. There can be many different things affecting that - windows settings, hypervisor, hardware/bios configuration.

You should use rdtsc when want low-latency counter - but you won't know actual wall-clock time.
And you should use QPC when you need actual time.

If you want low-latency counter, but still need to convert to real time, you can call use rdtsc and from time to time (like every 100msec or once a second) call QPC and synchronize rdtsc readings with these QPC values - linearly interpolate between them.
Simon Anciaux
1274 posts
QueryPerformanceFrequency returning 10mhz bug
mrmixer
QPC is a RDTSCP instruction with a few more instructions around it.

What I meant was that after the RDTSCP instruction there are some instruction to convert the resulting value to a value that you can divide by the result of QueryPerformanceFrequency to get a time in seconds. Otherwise the result of RDTSCP can't be "easily" converted to seconds. The few instructions are a 128bit multiply, 2 adds and a bit shift (on Windows 10). Note that this is based on my observations and I don't guarantee that it's correct or that it will continue to work in the future.

Here is the code I use in a profiler to convert the result of RDTSCP to a QPF compatible value.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 uint64_t profiler_cycles_to_time( profiler_tool_t* tool, uint64_t cycles ) { uint64_t time = cycles; profiler_platform_data_t* platform = &tool->platform_data; if ( tool->platform == profiler_platform_windows ) { if ( platform->windows.version == profiler_windows_version_10 ) { #if defined( PROFILER_MSVC ) time = __umulh( cycles, platform->windows.mul128 ); #elif defined( PROFILER_CLANG ) || defined( PROFILER_GCC ) unsigned __int128 big = ( unsigned __int128 ) cycles * platform->windows.mul128; time = ( uint64_t ) ( big >> 64 ); #else # error Unsupported compiler. #endif time += platform->windows.add; } time += platform->windows.qpc_bias; time >>= platform->windows.qpc_shift; } else if ( tool->platform == profiler_platform_linux ) { #if defined( PROFILER_MSVC ) uint64_t high = 0; uint64_t low = _umul128( cycles, platform->linux_.mult, &high ); profiler_assert( platform->linux_.shift <= 0xff ); time = __shiftright128( low, high, ( unsigned char ) platform->linux_.shift ); #elif defined( PROFILER_CLANG ) || defined( PROFILER_GCC ) unsigned __int128 big = ( unsigned __int128 ) time * tool->platform_data.linux_.mult; big >>= tool->platform_data.linux_.shift; time = ( uint64_t ) big; #else # error Unsupported compiler. #endif } else { profiler_assert( tool->platform == profiler_platform_fallback ); time = profiler_cycles_to_time_fallback( tool, cycles ); } return time; } 

TimerTimmyyy

I'm quite a noob and above texts (with assembly code) go way over my head, but I remember reading/learning from this thread that the QPC function actually has a loop in it. I also noticed this when calling the QPC repeatedly (for example loop of 5000 times). Usually the amount of tics in between two QPC calls was around 25 tics. But sometimes a QPC call would take about 160 tics. There didn't seem to be a difference where in the loop this occurred: it occurred as often in between call 1 to 500 as in between call 4000-4500. As explained above, it is probably because of that loop inside the QPC itself (hope I understand this correctly).

As I don't have the code of the QPC function, I can only guess what it does. My understanding of the loop in QPC is that it will almost never run more than once. I haven't measured that but In my test I don't think it ever did. The reason for the loop is (once again it's a guess) to make sure some information in the Hypervisor memory page doesn't change during the call to QPC.

The 160 tics you saw might be related to that, but it could also be Windows not giving processor time to you application for some other reason.

TimerTimmyyy
Does this mean that it actually does not matter which function you use (QPC or RDTSCP) because they actually return the same thing (just QPC having a little more overhead)? Both returning the amount of clock cycles since boot.

As mmozeiko said, no. As far as I know QPC will never return the same value than RDTSCP. It will return a value based on a value coming from RDTSCP so you can use QPF to convert the timestamp to seconds. If you don't care about the absolute time, and only want to compare different runs of a piece of code than you can use RDTSCP.

TimeTimmyyy
"The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter."

As mmozeiko said, the invariant time stamp counter (the value returned by RDTSCP and RDTSC) are not the actual clock speed as the processor speed changes in real time so a second can take X cycles at one point and take Y cycles at another point. The invariant TSC increases it's value at a constant rate, so that a second always take X cycles; until you restart your computer, at which point X might be different but it will still be X cycles per seconds until you restart.