1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 | #if 0 /* NOTE build: cl main.c -Feqpc.exe -nologo -O2 -Zi */ /* The following code is the result of trying to figure out what transformation QueryPerformanceCounter (QPC) does to the result of the rdtsc (or rdtscp) instruction based on the assembly of QPC. This was only tested on Windows 7 and Windows 10 version 1909 and Windows 10 version 1903. Windows 8, 8.1 and prior version of Windows 10 are a complete guess if you test on those, let me know how it went. This is not a replacement to QueryPerformanceCounter, you should still use that to query timestamps. The goal was to be able to use only __rdtsc to capture events timestamps and see if it was possible to transform them afterward (or "offline") into value compatible with QueryPerformanceCounter. Which is possible if you save some additional constant values to do the transformation. QPC uses 2 shared memory page (inspecting the assembly using windbg "reveals" the names of those pages): SharedUserData and RtlpHypervisorSharedUserVa. # SharedUserData I didn't find official documentation on that except for the header file containing the definition of the structure (KUSER_SHARED_DATA) in the Windows Driver Kit (WDK). It also contains some useful comments about the meaning of some of the fields (starting a line 8219). C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\km\ntddk.h There is a Microsoft documentation page but it only contains the structure definition: https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntddk/ns-ntddk-kuser_shared_data The following page contains more information, a description and history of SharedUserData and the offsets of the different fields: http://geoffchappell.com/studies/windows/km/ntoskrnl/structs/kuser_shared_data/index.htm SharedUserData is always loaded in memory at the 0x7ffe0000 address. # RtlpHypervisorSharedUserVa When using QPC on Windows 10 (presumably only on version that came out after the "Anniversary update" v1607, build number 14393 ) there is another page used by QPC called RtlpHypervisorSharedUserVa. Similarly I didn't find official documentation about it. The following tweet says that its location in memory isn't always the same but should be near 0x7ffe8000. I tested on two machines, and on one it was always at 0x7ffe8000 and on the other it was always at 0x7ffed000. There is a way to query the location at runtime using NtQuerySystemInformation and passing 0xc5 in the SystemInformationClass parameter. Tweet mentionning the RtlpHypervisorSharedUserVa and _SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION: https://twitter.com/aionescu/status/963584812412997632 System information class unofficial documentation: http://geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/sysinfo/class.htm?tx=181 This page also list the name for the SystemInformationClass value (SystemHypervisorSharedPageInformation on line 1447), contains a comment saying that the query was added in Windows 10 redstone 4 (v1803, april 2018 update, build 17134), and has a definition for the struct returned by NtQuerySystemInformation, which is just a void pointer (on line 3523). */ #if 0 typedef struct _SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION { PVOID HypervisorSharedUserVa; } SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION, *PSYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION; #endif /* https://github.com/processhacker/processhacker/blob/master/phnt/include/ntexapi.h This tweet mention that that page is only used by 2 functions, RtlQueryPerformancecounter and RtlGetMultiTimePrecise. https://twitter.com/AmarSaar/status/995794185398534147 I still don't know exactly what the page contains. What I observed was: - The first 4 bytes (+0x0) read "HalT" in ascii which is 0x546c6148 (tested on two machines). - The next 4 bytes (+0x4) don't seem to be used; - The next 8 bytes (+0x8) contains a big value and will be used to do a 128 multiply with the result of rdtsc; It seem to be constant on a machine (always the same on the same computer), but difers between machines. I observed 0xea39330e641ff4 on my first machine and 0xc7b5ac275f1df on the second. - The next 8 bytes (+0x10) contains a value that is added to the 8 upper bytes of the result from the multiply. On both my test machines this was always 0x0; The first 4 bytes value (usally "HalT") seem to be used for 2 things: - If the value is zero, QPC should use the NtQueryPerformanceCounter syscall instead of continuing with rdtscp. - It seems that the value could change during the call, and if it changed, it would redo the rdtsc and multiply before continuing. # Windows versions: For reference here are the different versions of windows with their kernel number: - Windows 7: kernel 6.1 - Windows 8: kernel 6.2 - Windows 8.1: kernel 6.3 - Windows 10: kernel 10.0 Furthermore here are the different versions of windows 10: - Version 1507 (Jully) 2015 => build 10240 - Version 1511 (November update) 2015 => build 10586 - Version 1607 (Anniversary update) 2016 => build 14393 - Version 1703 (Creators update) 2017 => build 15063 - Version 1709 (Fall Creators Update) 2017 => build 16299 - Version 1803 (April 2018 update) => build 17134 - Version 1809 (October 2018 update) => build 17763 - Version 1903 (May 2019 update) => build 18362 - Version 1909 (November 2019 update) => build 18363 - Version 2004 (May 2020 update) => build 19041 # Pertinent Offsets in SharedUserData structure ## 0x0260 Kernel 10.0 and up */ ULONG NtBuildNumber; /* ## 0x026C Kernel 4.0 and up */ ULONG NtMajorVersion; /* ## 0x0270 Kernel 4.0 and up */ ULONG NtMinorVersion; /* ## 0x02ed Kernel 6.1 only (Windows 7) */ union { UCHAR TscQpcData; struct { UCHAR TscQpcEnabled : 1; // 0x01 UCHAR TscQpcSpareFlag : 1; // 0x02 UCHAR TscQpcShift : 6; // 0xFC }; }; /* ## 0x0300 Kernel 6.2 and up (Windows 8 and up). Other meaning in previous version. QueryPerformanceFrequency returns this value on Windows 10 (and I suppose 8 and 8.1). Windows 7 does a system call instead. */ LONGLONG QpcFrequency; /* ## 0x03b8 Kernel 6.1 (Windows 7) and 6.2 (Windows 8) */ ULONGLONG volatile TscQpcBias; /* Kernel 6.3 and up (Windows 8.1 and up) */ ULONGLONG volatile QpcBias; /* ## 0x03C6 Kernel 6.1 (Windows 7) */ USHORT Reserved4; /* Kernel 6.2 only (Windows 8) This is very similar to 0x02ed but for windows 8. */ union { USHORT TscQpcData; struct { BOOLEAN volatile TscQpcEnabled; UCHAR TscQpcShift; }; }; /* Kernel 6.3 (Windows 8.1), and kernel 10 up to version 1607 (Windows 10 anniversary update) Bypass here means bypassing a system call to retrive the counter (based on the comments in the WDK header, see above). */ union { USHORT QpcData; struct { BOOLEAN volatile QpcBypassEnabled; UCHAR QpcShift; }; }; /* Kernel 10 starting with version 1709 (Windows 10 fall creators update) and up. What about version 1703 (creator update) ? Assuming previous version only set the boolean to 0 or 1, it shouldn't matter as the value needs to have the second bit set (0x2) to take the hypervisor path. In theory any version starting with windows 8 could use the windows 10 function below and should still work. From the unofficial doc (see above): Version 1709 changes QpcBypassEnabled from a UCHAR that is intended to be either TRUE or FALSE to one whose meaning is taken in bits. Microsoft's C-language definition in the contemporaneous WDK defines: 0x01 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED; 0x10 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE; 0x20 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE; 0x40 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_A73_ERRATA; 0x80 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP. From the header of the WDK (see above): // // Define flags for QPC bypass information. None of these flags may be set // unless bypass is enabled. This is for compat with existing code which // compares this value to zero to detect bypass enablement. // #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED (0x01) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_HV_PAGE (0x02) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_DISABLE_32BIT (0x04) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE (0x10) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE (0x20) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_A73_ERRATA (0x40) #define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP (0x80) */ /* NOTE This definition comes from the WDK header, not from the unofficial doc as the rest of the definitions. */ union { USHORT QpcData; struct { // // A boolean indicating whether performance counter queries // can read the counter directly (bypassing the system call). // volatile UCHAR QpcBypassEnabled; // // Shift applied to the raw counter value to derive the // QPC count. // UCHAR QpcShift; }; }; #endif #include <stdint.h> #include <assert.h> #include <windows.h> #include <stdio.h> #if 0 /* NOTE Actual definitions for reference. */ /* NOTE from winternl.h */ NTSTATUS NTAPI NtQuerySystemInformation( IN SYSTEM_INFORMATION_CLASS SystemInformationClass, OUT PVOID SystemInformation, IN ULONG SystemInformationLength, OUT PULONG ReturnLength OPTIONAL ); /* NOTE from msdn */ NTSTATUS NtQueryPerformanceCounter( _Out_ PLARGE_INTEGER PerformanceCounter, _Out_opt_ PLARGE_INTEGER PerformanceFrequency ); #endif typedef int32_t __stdcall NtQuerySystemInformation_t( int32_t SystemInformationClass, void* SystemInformation, uint32_t SystemInformationLenght, uint32_t* ReturnLenght ); typedef int32_t NtQueryPerformanceCounter_t( uint64_t* PerformanceCounter, uint64_t* PerformanceFrequency ); NtQueryPerformanceCounter_t* NtQueryPerformanceCounter = 0; uint8_t* SharedUserData = ( uint8_t* ) 0x7ffe0000; volatile uint8_t* RtlpHypervisorSharedUserVa = 0; /* NOTE volatile because I think the content could be changed by the kernel. */ void qpc_win_7( uint64_t* time ) { uint8_t tsc_qpc_data = *( SharedUserData + 0x02ed ); uint8_t bypass_syscall = tsc_qpc_data & 0x1; if ( bypass_syscall ) { uint8_t qpc_shift = ( tsc_qpc_data >> 2 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); *time = __rdtsc( ); *time += qpc_bias; *time >>= qpc_shift; } else { int32_t result = NtQueryPerformanceCounter( time, 0 ); assert( result >= 0 ); } } /* NOTE This function hasn't been tested. I don't have the windows 8/10 assembly, it's a complete guess. */ void qpc_win_8_to_10_v1067( uint64_t* time ) { uint8_t bypass_syscall = *( SharedUserData + 0x03c6 ); if ( bypass_syscall ) { uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); *time = __rdtsc( ); *time += qpc_bias; *time >>= qpc_shift; } else { int32_t result = NtQueryPerformanceCounter( time, 0 ); assert( result >= 0 ); } } void qpc_win_10( uint64_t* time ) { uint64_t tsc = 0; uint8_t flags = *( SharedUserData + 0x3c6 ); uint8_t bypass_syscall = flags & 0x1; if ( bypass_syscall ) { uint8_t use_hypervisor_page = flags & 0x2; if ( use_hypervisor_page ) { /* NOTE If RtlpHypervisorSharedUserVa is 0, we should use NtQueryPerformanceCounter to get the result of the whole function (not done here to keep it simple). */ assert( RtlpHypervisorSharedUserVa ); while ( 1 ) { /* NOTE This value is "HalT" in ascii on my machine (0x546c6148) */ uint32_t some_value_that_should_not_be_zero = *( uint32_t* ) RtlpHypervisorSharedUserVa; /* NOTE If this value is 0, we should use NtQueryPerformanceCounter to get the result of the whole function (not done here to keep it simple).*/ assert( some_value_that_should_not_be_zero ); uint8_t use_rdtscp = flags & 0x80; if ( use_rdtscp ) { uint32_t x; tsc = __rdtscp( &x ); } else { uint8_t lfence = flags & 0x20; uint8_t mfence = flags & 0x10; if ( lfence ) { _mm_lfence( ); } else if ( mfence ) { _mm_mfence( ); } tsc = __rdtsc( ); } uint64_t value_1 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x08 ); /* NOTE Always 0xea39330e641ff4 on my machine. */ uint64_t value_2 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x10 ); /* NOTE Always 0x0 on my machine. */ uint64_t high, low; low = _umul128( tsc, value_1, &high ); /* NOTE Could use __umulh as the low bytes are discarded. */ high += value_2; tsc = high; low = *( uint32_t* ) RtlpHypervisorSharedUserVa; /* NOTE If the value "HalT" was changed since we read it, redo the work (possibly could make the path go through NtQueryPerformanceCounter). */ if ( low == some_value_that_should_not_be_zero ) { break; } } uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); /* NOTE qpc_shift is always 0 on my machine on windows 10. */ uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); tsc = tsc + qpc_bias; tsc >>= qpc_shift; *time = tsc; } else { qpc_win_8_to_10_v1067( time ); } } else { int32_t result = NtQueryPerformanceCounter( time, 0 ); assert( result >= 0 ); } } uint64_t tsc_to_qpc_win_7( uint64_t tsc ) { uint8_t tsc_qpc_data = *( SharedUserData + 0x02ed ); uint8_t qpc_shift = ( tsc_qpc_data >> 2 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); uint64_t result = tsc + qpc_bias; result >>= qpc_shift; return result; } uint64_t tsc_to_qpc_win_8_to_10_v1067( uint64_t tsc ) { uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); uint64_t result = tsc + qpc_bias; result >>= qpc_shift; return result; } uint64_t tsc_to_qpc_win_10( uint64_t tsc ) { uint64_t value_1 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x08 ); uint64_t value_2 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x10 ); tsc = __umulh( tsc, value_1 ); tsc += value_2; uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 ); uint64_t result = tsc + qpc_bias; result >>= qpc_shift; return result; } typedef void qpc_t( uint64_t* time ); qpc_t* custom_qpc = 0; typedef uint64_t tsc_to_qpc_t( uint64_t ); tsc_to_qpc_t* tsc_to_qpc = 0; void custom_qpc_init( ) { HANDLE ntdll = LoadLibrary( "ntdll.dll" ); NtQuerySystemInformation_t* NtQuerySystemInformation = 0; if ( ntdll ) { NtQueryPerformanceCounter = ( NtQueryPerformanceCounter_t* ) GetProcAddress( ntdll, "NtQueryPerformanceCounter" ); assert( NtQueryPerformanceCounter ); NtQuerySystemInformation = ( NtQuerySystemInformation_t* ) GetProcAddress( ntdll, "NtQuerySystemInformation" ); assert( NtQuerySystemInformation ); FreeLibrary( ntdll ); } uint32_t kernel_major = *( uint32_t* ) ( SharedUserData + 0x026c ); uint32_t kernel_minor = *( uint32_t* ) ( SharedUserData + 0x0270 ); if ( kernel_major == 6 && kernel_minor == 1 ) { custom_qpc = qpc_win_7; tsc_to_qpc = tsc_to_qpc_win_7; } else if ( kernel_major == 6 && ( kernel_minor == 2 || kernel_minor == 3 ) ) { custom_qpc = qpc_win_8_to_10_v1067; tsc_to_qpc = tsc_to_qpc_win_8_to_10_v1067; } else if ( kernel_major == 10 ) { uint32_t win_10_build_number = *( uint32_t* ) ( SharedUserData + 0x0260 ); if ( win_10_build_number > 14393 ) { /* NOTE Build after anniversary update. This number might need to be bumped to 16299. */ uint64_t system_information; uint32_t out_size; int32_t SystemHypervisorSharedPageInformation = 0xc5; int32_t result = NtQuerySystemInformation( SystemHypervisorSharedPageInformation, &system_information, sizeof( system_information ), &out_size ); assert( out_size == sizeof( system_information ) ); if ( result >= 0 ) { RtlpHypervisorSharedUserVa = ( uint8_t* ) system_information; } custom_qpc = qpc_win_10; tsc_to_qpc = tsc_to_qpc_win_10; } else { custom_qpc = qpc_win_8_to_10_v1067; tsc_to_qpc = tsc_to_qpc_win_8_to_10_v1067; } } else { assert( !"Not supported" ); } } int main( int argc, char** argv ) { custom_qpc_init( ); uint64_t mins[ 3 ] = { 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff }; uint64_t maxs[ 3 ] = { 0 }; uint64_t bests[ 3 ] = { 0 }; uint32_t sleep_duration = 100; #define iteration_count 100 uint64_t qpc_results[ iteration_count ] = { 0 }; uint64_t custom_qpc_results[ iteration_count ] = { 0 }; uint64_t rdtscp_starts[ iteration_count ] = { 0 }; uint64_t rdtscp_ends[ iteration_count ] = { 0 }; for ( uint32_t index = 0; index < iteration_count; index++ ) { uint64_t s, e; QueryPerformanceCounter( ( LARGE_INTEGER* ) &s ); Sleep( sleep_duration ); QueryPerformanceCounter( ( LARGE_INTEGER* ) &e ); qpc_results[ index ] = e - s; } for ( uint32_t index = 0; index < iteration_count; index++ ) { uint64_t s, e; custom_qpc( &s ); Sleep( sleep_duration ); custom_qpc( &e ); custom_qpc_results[ index ] = e - s; } for ( uint32_t index = 0; index < iteration_count; index++ ) { int32_t x = 0; rdtscp_starts[ index ] = __rdtscp( &x ); Sleep( sleep_duration ); rdtscp_ends[ index ] = __rdtscp( &x ); } for ( uint32_t index = 0; index < iteration_count; index++ ) { uint64_t rdtscp_result = tsc_to_qpc( rdtscp_ends[ index ] ) - tsc_to_qpc( rdtscp_starts[ index ] ); if ( qpc_results[ index ] < mins[ 0 ] ) { mins[ 0 ] = qpc_results[ index ]; } if ( qpc_results[ index ] > maxs[ 0 ] ) { maxs[ 0 ] = qpc_results[ index ]; } if ( custom_qpc_results[ index ] < mins[ 1 ] ) { mins[ 1 ] = custom_qpc_results[ index ]; } if ( custom_qpc_results[ index ] > maxs[ 1 ] ) { maxs[ 1 ] = custom_qpc_results[ index ]; } if ( rdtscp_result < mins[ 2 ] ) { mins[ 2 ] = rdtscp_result; } if ( rdtscp_result > maxs[ 2 ] ) { maxs[ 2 ] = rdtscp_result; } if ( qpc_results[ index ] < custom_qpc_results[ index ] && qpc_results[ index ] < rdtscp_result ) { bests[ 0 ]++; } else if ( custom_qpc_results[ index ] < rdtscp_result ) { bests[ 1 ]++; } else { bests[ 2 ]++; } } /* NOTE Best on each version should be roughly the same. */ printf( "qpc\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 0 ], maxs[ 0 ], bests[ 0 ] ); printf( "custom qpc\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 1 ], maxs[ 1 ], bests[ 1 ] ); printf( "rdtscp\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 2 ], maxs[ 2 ], bests[ 2 ] ); return 0; } #if 0 /* I use the following assembly from Windows 10 QPC. I'm not sure which version, but it's prior to 1909 (I got an updated while working on this). The 1909 assembly is bit different but not by much. ntdll!RtlQueryPerformanceCounter: 00007ffb`7e4aca70 48895c2408 mov qword ptr [rsp+8], rbx ss:00000064`5075fb20=0000000000000000 00007ffb`7e4aca75 57 push rdi 00007ffb`7e4aca76 4883ec20 sub rsp, 20h 00007ffb`7e4aca7a 448a0c25c603fe7f mov r9b, byte ptr [SharedUserData+0x3c6 (00000000`7ffe03c6)] 00007ffb`7e4aca82 488bd9 mov rbx, rcx 00007ffb`7e4aca85 41f6c101 test r9b, 1 00007ffb`7e4aca89 7470 je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb) 00007ffb`7e4aca8b 4c8b1c25b803fe7f mov r11, qword ptr [SharedUserData+0x3b8 (00000000`7ffe03b8)] 00007ffb`7e4aca93 41f6c102 test r9b, 2 00007ffb`7e4aca97 0f84b61d0600 je ntdll!RtlQueryPerformanceCounter+0x61de3 (00007ffb`7e50e853) 00007ffb`7e4aca9d 4c8b0584731000 mov r8, qword ptr [ntdll!RtlpHypervisorSharedUserVa (00007ffb`7e5b3e28)] 00007ffb`7e4acaa4 4d85c0 test r8, r8 00007ffb`7e4acaa7 7452 je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb) 00007ffb`7e4acaa9 458b10 mov r10d, dword ptr [r8] 00007ffb`7e4acaac 4585d2 test r10d, r10d 00007ffb`7e4acaaf 744a je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb) 00007ffb`7e4acab1 4584c9 test r9b, r9b 00007ffb`7e4acab4 0f897e1d0600 jns ntdll!RtlQueryPerformanceCounter+0x61dc8 (00007ffb`7e50e838) 00007ffb`7e4acaba 0f01f9 rdtscp 00007ffb`7e4acabd 48c1e220 shl rdx, 20h 00007ffb`7e4acac1 480bd0 or rdx, rax 00007ffb`7e4acac4 498b4008 mov rax, qword ptr [r8+8] 00007ffb`7e4acac8 498b4810 mov rcx, qword ptr [r8+10h] 00007ffb`7e4acacc 48f7e2 mul rax, rdx 00007ffb`7e4acacf 418b00 mov eax, dword ptr [r8] 00007ffb`7e4acad2 4803d1 add rdx, rcx 00007ffb`7e4acad5 413bc2 cmp eax, r10d 00007ffb`7e4acad8 75cf jne ntdll!RtlQueryPerformanceCounter+0x39 (00007ffb`7e4acaa9) 00007ffb`7e4acada 8a0c25c703fe7f mov cl, byte ptr [SharedUserData+0x3c7 (00000000`7ffe03c7)] 00007ffb`7e4acae1 4a8d041a lea rax, [rdx+r11] 00007ffb`7e4acae5 48d3e8 shr rax, cl 00007ffb`7e4acae8 488903 mov qword ptr [rbx], rax 00007ffb`7e4acaeb b801000000 mov eax, 1 00007ffb`7e4acaf0 488b5c2430 mov rbx, qword ptr [rsp+30h] 00007ffb`7e4acaf5 4883c420 add rsp, 20h 00007ffb`7e4acaf9 5f pop rdi 00007ffb`7e4acafa c3 ret 00007ffb`7e4acafb 33d2 xor edx, edx 00007ffb`7e4acafd 488d4c2440 lea rcx, [rsp+40h] 00007ffb`7e4acb02 e869320400 call ntdll!NtQueryPerformanceCounter (00007ffb`7e4efd70) 00007ffb`7e4acb07 488b442440 mov rax, qword ptr [rsp+40h] 00007ffb`7e4acb0c ebda jmp ntdll!RtlQueryPerformanceCounter+0x78 (00007ffb`7e4acae8) 00007ffb`7e4acb0e cc int 3 Some jumps leads here. 00007ffb`7e50e838 41f6c120 test r9b, 20h 00007ffb`7e50e83c 7405 je ntdll!RtlQueryPerformanceCounter+0x61dd3 (00007ffb`7e50e843) 00007ffb`7e50e83e 0faee8 lfence 00007ffb`7e50e841 eb09 jmp ntdll!RtlQueryPerformanceCounter+0x61ddc (00007ffb`7e50e84c) 00007ffb`7e50e843 41f6c110 test r9b, 10h 00007ffb`7e50e847 7403 je ntdll!RtlQueryPerformanceCounter+0x61ddc (00007ffb`7e50e84c) 00007ffb`7e50e849 0faef0 mfence 00007ffb`7e50e84c 0f31 rdtsc 00007ffb`7e50e84e e96ae2f9ff jmp ntdll!RtlQueryPerformanceCounter+0x4d (00007ffb`7e4acabd) 00007ffb`7e50e853 4584c9 test r9b, r9b 00007ffb`7e50e856 7905 jns ntdll!RtlQueryPerformanceCounter+0x61ded (00007ffb`7e50e85d) 00007ffb`7e50e858 0f01f9 rdtscp 00007ffb`7e50e85b eb16 jmp ntdll!RtlQueryPerformanceCounter+0x61e03 (00007ffb`7e50e873) 00007ffb`7e50e85d 41f6c120 test r9b, 20h 00007ffb`7e50e861 7405 je ntdll!RtlQueryPerformanceCounter+0x61df8 (00007ffb`7e50e868) 00007ffb`7e50e863 0faee8 lfence 00007ffb`7e50e866 eb09 jmp ntdll!RtlQueryPerformanceCounter+0x61e01 (00007ffb`7e50e871) 00007ffb`7e50e868 41f6c110 test r9b, 10h 00007ffb`7e50e86c 7403 je ntdll!RtlQueryPerformanceCounter+0x61e01 (00007ffb`7e50e871) 00007ffb`7e50e86e 0faef0 mfence 00007ffb`7e50e871 0f31 rdtsc 00007ffb`7e50e873 48c1e220 shl rdx, 20h 00007ffb`7e50e877 480bd0 or rdx, rax 00007ffb`7e50e87a e95be2f9ff jmp ntdll!RtlQueryPerformanceCounter+0x6a (00007ffb`7e4acada) 00007ffb`7e50e87f cc int 3 */ #endif |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | ; Bypass enabled ? test r9b, 1 ; NtQueryPerformanceCounter syscall je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb) ; Loading qpc_bias in r11 mov r11, qword ptr [SharedUserData+0x3b8 (00000000`7ffe03b8)] ; Use hypervisor page ? test r9b, 2 ; Second part of assembly (rdtsc/p + maybe lfence or mfence) je ntdll!RtlQueryPerformanceCounter+0x61de3 (00007ffb`7e50e853) ; Loading address of RtlpHypervisorSharedUserVa into r8 (r8 = 0x7ffe8000) mov r8, qword ptr [ntdll!RtlpHypervisorSharedUserVa (00007ffb`7e5b3e28)] ; Is RtlpHypervisorSharedUserVa present ? test r8, r8 ; NtQueryPerformanceCounter syscall je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb) ; Loading first 4 byte of RtlpHypervisorSharedUserVa. r10d is 'HalT' ; This is the first instruction of the loop mov r10d, dword ptr [r8] ; r10d != 0 test r10d, r10d ; NtQueryPerformanceCounter syscall je ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb) ; Use rdtscp ? test r9b, r9b ; Second part of assembly (use rdtsc + maybe lfence or mfence) jns ntdll!RtlQueryPerformanceCounter+0x61dc8 (00007ffb`7e50e838) ; rdtscp and combine the result in rdx rdtscp shl rdx, 20h or rdx, rax ; Loading RtlpHypervisorSharedUserVa + 0x08 into rax ; This is the value that is used in the 128 multiply ; This is where the path that use the hypervisor but not rdtscp would returns to mov rax, qword ptr [r8+8] ; Loading RtlpHypervisorSharedUserVa + 0x10 into rcx ; This is the value that is added to the high bytes of the 128 multiply (but was always 0x0 in my tests) mov rcx, qword ptr [r8+10h] ; 128 bit multiply. rdx contains the result's high bytes, rax contains the result's low bytes mul rax, rdx ; Load RtlpHypervisorSharedUserVa + 0x0 'HalT' into eax ; This "discards" the low bytes of the 128 byte multiply mov eax, dword ptr [r8] ; rcx is the content of RtlpHypervisorSharedUserVa + 0x10 add rdx, rcx ; Is eax == to r10d => does both value contain 'HalT' ? cmp eax, r10d ; Loop jump jne ntdll!RtlQueryPerformanceCounter+0x39 (00007ffb`7e4acaa9) ; Load qpc_shift into cl (always 0x0 in my tests) ; This is where the path that doesn't use the hypervisor returns to. mov cl, byte ptr [SharedUserData+0x3c7 (00000000`7ffe03c7)] ; Add qpc_bias (r11) to rdx (high bytes of the 128 bit multiply) and store in rax lea rax, [rdx+r11] ; Do qpc_shift shr rax, cl ; Put the result at the output memory location mov qword ptr [rbx], rax ; Return 1 mov eax, 1 |
1 | DECLARE_VVAR(128, struct vdso_data, _vdso_data) |
1 | extern struct vdso_data vvar__vdso_data[CS_BASES] __attribute__((visibility("hidden"))); |
1 2 3 | mov 0x8(%r10), %rcx mov 0x28(%r11), %rax mov 0x18(%r10), %esi |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | #include <time.h> #include <sys/auxv.h> #include <stdint.h> #include <stdio.h> #include <assert.h> #include <sys/utsname.h> typedef struct vdso_data_t { uint32_t seq; int32_t clock_mode; uint64_t cycle_last; uint64_t mask; uint32_t mult; uint32_t shift; struct { uint64_t sec; uint64_t nsec; } basetime[ 12 ]; /* union { struct vdso_timestamp basetime[VDSO_BASES]; struct timens_offset offset[VDSO_BASES]; }; */ int32_t tz_minuteswest; int32_t tz_dsttime; uint32_t hrtimer_res; uint32_t __unused; /* Empty struct on x86 struct arch_vdso_data arch_data; */ } vdso_data_t; typedef unsigned __int128 uint128_t; void get_kernel_version( uint32_t* major_out, uint32_t* minor_out ) { /* uname -r */ *major_out = 5; *minor_out = 9; struct utsname info; if ( uname( &info ) >= 0 ) { uint32_t major = 0; uint32_t minor = 0; char* version = info.release; while ( *version && *version != '.' ) { char c = *version; if ( c >= '0' && c <= '9' ) { major *= 10; major += ( c - '0' ); } version++; } version++; while ( *version && *version != '.' ) { char c = *version; if ( c >= '0' && c <= '9' ) { minor *= 10; minor += ( c - '0' ); } version++; } if ( major ) { *major_out = major; *minor_out = minor; } } } intptr_t get_vdso_data_offset( void ) { uint32_t major, minor; get_kernel_version( &major, &minor ); intptr_t page_size = 1 << 12; intptr_t offset = -4 * page_size; if ( major == 5 ) { if ( minor > 5 ) { offset = -4 * page_size; } else { offset = -3 * page_size; } } else if ( major == 4 ) { if ( minor > 11 ) { offset = -3 * page_size; } else if ( minor > 6 ) { offset = -2 * page_size; } else if ( minor > 4 ) { offset = -3 * page_size; } else if ( minor > 1 ) { offset = -2 * page_size; } else { assert( !"Unsupported" ); } } offset += 128; return offset; } int main( int argc, char** argv ) { intptr_t vdso_data_offset = get_vdso_data_offset( ); uint8_t* vdso = ( uint8_t* ) getauxval( AT_SYSINFO_EHDR ); vdso_data_t* data = ( vdso_data_t* ) ( vdso + vdso_data_offset ); uint32_t mul = data[ 0 ].mult; uint32_t shift = data[ 0 ].shift; struct timespec res; clock_getres( CLOCK_MONOTONIC, &res ); double frequency = res.tv_nsec * 1000000000; uint32_t x = 0; uint64_t rdtsc = __builtin_ia32_rdtscp( &x ); struct timespec spec; clock_gettime( CLOCK_MONOTONIC, &spec ); uint128_t temp = ( uint128_t ) rdtsc * mul; temp >>= shift; assert( ( temp & 0xffffffffffffffff ) == temp ); uint64_t result = ( uint64_t ) ( temp ); double r1 = ( double ) result / frequency; printf( "integer: %5.9f\n", r1 ); double divisor = 1; for ( uint32_t i = 0; i < shift; i++ ) { divisor *= 2; } double r1b = ( ( double ) rdtsc * ( double ) mul ) / divisor; r1b /= frequency; printf( "double : %5.9f\n", r1b ); double r2 = ( double ) spec.tv_sec + ( ( double ) spec.tv_nsec / frequency ); printf( "gettime: %5.9f\n", r2 ); printf( "---\n" ); printf("integer diff: %5.9f\n", r1 - r2 ); printf("double diff: %5.9f\n", r1b - r2 ); return 0; } |
mrmixer
- A bigger issue is that the result of the multiply will quickly take more than 64bit. We could either use doubles for that or doing a 128bit integer multiply and maybe the shift will bring the value in a 64bit range ? I didn't tested the 128bit multiply as I couldn't figure out if there were instrinsics like _umul128 on linux. If anyone knows, I'm all ears.
1 2 3 | uint64_t a = ..., b = ...; unsigned __int128 big = (unsigned __int128)a * b; uint64_t higer_64_bits = (uint64_t)(big >> 64); |
TimerTimmyyy
And surprisingly, when I turned HPET on the QPF also always returned the same value, now being 14.318.180. After restart but also after a restart where I cut off the power for half a minute. Still 14.32mhz.
TimerTimmyyy
And, also not sure why and if interesting, when looping over the QPC function 5000 times and storing its value in an array (calculations only done after the loop completed), with the ITSC the difference between QPC calls would be about 28 ticks and the total 5000 loops took about 14-15 milliseconds, but with HPET on it registered about 55 ticks per loop and total calculated time was 18-20 ms.
QPC with HPET on most likely takes more time because it's a syscall instruction, meaning the program will ask the Windows kernel to do something
When HPET is off, QPC is a RDTSCP instruction with a few more instructions around it.
QPC is a RDTSCP instruction with a few more instructions around it.
QPC is a RDTSCP instruction with a few more instructions around it.
mrmixer
QPC is a RDTSCP instruction with a few more instructions around it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | uint64_t profiler_cycles_to_time( profiler_tool_t* tool, uint64_t cycles ) { uint64_t time = cycles; profiler_platform_data_t* platform = &tool->platform_data; if ( tool->platform == profiler_platform_windows ) { if ( platform->windows.version == profiler_windows_version_10 ) { #if defined( PROFILER_MSVC ) time = __umulh( cycles, platform->windows.mul128 ); #elif defined( PROFILER_CLANG ) || defined( PROFILER_GCC ) unsigned __int128 big = ( unsigned __int128 ) cycles * platform->windows.mul128; time = ( uint64_t ) ( big >> 64 ); #else # error Unsupported compiler. #endif time += platform->windows.add; } time += platform->windows.qpc_bias; time >>= platform->windows.qpc_shift; } else if ( tool->platform == profiler_platform_linux ) { #if defined( PROFILER_MSVC ) uint64_t high = 0; uint64_t low = _umul128( cycles, platform->linux_.mult, &high ); profiler_assert( platform->linux_.shift <= 0xff ); time = __shiftright128( low, high, ( unsigned char ) platform->linux_.shift ); #elif defined( PROFILER_CLANG ) || defined( PROFILER_GCC ) unsigned __int128 big = ( unsigned __int128 ) time * tool->platform_data.linux_.mult; big >>= tool->platform_data.linux_.shift; time = ( uint64_t ) big; #else # error Unsupported compiler. #endif } else { profiler_assert( tool->platform == profiler_platform_fallback ); time = profiler_cycles_to_time_fallback( tool, cycles ); } return time; } |
TimerTimmyyy
I'm quite a noob and above texts (with assembly code) go way over my head, but I remember reading/learning from this thread that the QPC function actually has a loop in it. I also noticed this when calling the QPC repeatedly (for example loop of 5000 times). Usually the amount of tics in between two QPC calls was around 25 tics. But sometimes a QPC call would take about 160 tics. There didn't seem to be a difference where in the loop this occurred: it occurred as often in between call 1 to 500 as in between call 4000-4500. As explained above, it is probably because of that loop inside the QPC itself (hope I understand this correctly).
TimerTimmyyy
Does this mean that it actually does not matter which function you use (QPC or RDTSCP) because they actually return the same thing (just QPC having a little more overhead)? Both returning the amount of clock cycles since boot.
TimeTimmyyy
"The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter."