QueryPerformanceFrequency returning 10mhz bug

An interesting note is that the function does at least have well-defined upper and lower bounds. From this documentation: https://docs.microsoft.com/en-us/...iring-high-resolution-time-stamps

"How often does QPC roll over? Not less than 100 years from the most recent system boot"

Given that QueryPerformanceCounter returns a 64 bit signed int (i.e. 63 usable bits), this implies that the value returned by QueryPerformanceFrequency should never be more than around 2^63/(100*365*24*60*60) = 2,924,712,086 = about 2.9 GHz. And the timer resolution is stated to be at least one microsecond, so it will always be at least 1 MHz. I have no idea if this information will be useful to anyone, but it was fun to calculate.
I spent more time on this and I've written functions to "emulate" what QueryPerformanceCounter does. I have tested it only on Windows 7 and two versions of Windows 10 (1909, 1903). If anybody has a Windows 8, Windows 8.1 or Windows 10 prior to the creator update and is willing to test it I would appreciate the effort. More specifically I'm interested in the following versions of Windows 10 as they may have introduced changes pertinent to the problem:
  • Version 1607 (Anniversary update) 2016 => build 14393
  • Version 1703 (Creators update) 2017 => build 15063
  • Version 1709 (Fall Creators Update) 2017 => build 16299
  • Version 1803 (April 2018 update) => build 17134


If you compile and run the program below it should run for a few seconds and display a few lines, the expected result is for all min/max to be in the same range and the "best" values to be somewhat similar (around 33). Here is a link to the compiled exe + pdb and source. There are some asserts that could trigger if a feature isn't supported by the os version (which is why it needs some testing).

The source contains notes and some findings.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
#if 0

/* NOTE build: cl main.c -Feqpc.exe -nologo -O2 -Zi */

/* The following code is the result of trying to figure out what transformation
QueryPerformanceCounter (QPC) does to the result of the rdtsc (or rdtscp) instruction based on the
assembly of QPC. This was only tested on Windows 7 and Windows 10 version 1909 and Windows 10
version 1903.

Windows 8, 8.1 and prior version of Windows 10 are a complete guess if you test on those, let me
know how it went.

This is not a replacement to QueryPerformanceCounter, you should still use that to query timestamps.

The goal was to be able to use only __rdtsc to capture events timestamps and see if it was possible
to transform them afterward (or "offline") into value compatible with QueryPerformanceCounter.
Which is possible if you save some additional constant values to do the transformation.

QPC uses 2 shared memory page (inspecting the assembly using windbg "reveals" the names of those
pages): SharedUserData and RtlpHypervisorSharedUserVa.



# SharedUserData

I didn't find official documentation on that except for the header file containing the definition
of the structure (KUSER_SHARED_DATA) in the Windows Driver Kit (WDK). It also contains some useful
comments about the meaning of some of the fields (starting a line 8219).

C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\km\ntddk.h

There is a Microsoft documentation page but it only contains the structure definition:

https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntddk/ns-ntddk-kuser_shared_data

The following page contains more information, a description and history of SharedUserData and the offsets of the
different fields:

http://geoffchappell.com/studies/windows/km/ntoskrnl/structs/kuser_shared_data/index.htm

SharedUserData is always loaded in memory at the 0x7ffe0000 address.



# RtlpHypervisorSharedUserVa

When using QPC on Windows 10 (presumably only on version that came out after the
"Anniversary update" v1607, build number 14393 ) there is another page used by QPC called
RtlpHypervisorSharedUserVa. Similarly I didn't find official documentation about it.

 The following tweet says that its location in memory isn't always the same but should be near
0x7ffe8000. I tested on two machines, and on one it was always at 0x7ffe8000 and on the other it
was always at 0x7ffed000. There is a way to query the location at runtime using
NtQuerySystemInformation and passing 0xc5 in the SystemInformationClass parameter.

Tweet mentionning the RtlpHypervisorSharedUserVa and _SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION:
https://twitter.com/aionescu/status/963584812412997632

System information class unofficial documentation:

http://geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/sysinfo/class.htm?tx=181

This page also list the name for the SystemInformationClass value
(SystemHypervisorSharedPageInformation on line 1447), contains a comment saying that the query was
added in Windows 10 redstone 4 (v1803, april 2018 update, build 17134), and has a definition for
the struct returned by NtQuerySystemInformation, which is just a void pointer (on line 3523).
*/

#if 0
typedef struct _SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION
{
    PVOID HypervisorSharedUserVa;
} SYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION, *PSYSTEM_HYPERVISOR_SHARED_PAGE_INFORMATION;
#endif

/*
https://github.com/processhacker/processhacker/blob/master/phnt/include/ntexapi.h

This tweet mention that that page is only used by 2 functions, RtlQueryPerformancecounter
and RtlGetMultiTimePrecise.

https://twitter.com/AmarSaar/status/995794185398534147

I still don't know exactly what the page contains. What I observed was:
- The first 4 bytes (+0x0) read "HalT" in ascii which is 0x546c6148 (tested on two machines).
- The next 4 bytes (+0x4) don't seem to be used;
- The next 8 bytes (+0x8) contains a big value and will be used to do a 128 multiply with the result of rdtsc;
  It seem to be constant on a machine (always the same on the same computer), but difers between
  machines. I observed 0xea39330e641ff4 on my first machine and 0xc7b5ac275f1df on the second.
- The next 8 bytes (+0x10) contains a value that is added to the 8 upper bytes of the result from the multiply.
  On both my test machines this was always 0x0;

The first 4 bytes value (usally "HalT") seem to be used for 2 things:
- If the value is zero, QPC should use the NtQueryPerformanceCounter syscall instead of continuing with rdtscp.
- It seems that the value could change during the call, and if it changed, it would redo the rdtsc and multiply before continuing.




# Windows versions:

For reference here are the different versions of windows with their kernel number:

- Windows 7: kernel 6.1
- Windows 8: kernel 6.2
- Windows 8.1: kernel 6.3
- Windows 10: kernel 10.0

Furthermore here are the different versions of windows 10:
- Version 1507 (Jully) 2015 => build 10240
- Version 1511 (November update) 2015 => build 10586
- Version 1607 (Anniversary update) 2016 => build 14393
- Version 1703 (Creators update) 2017 => build 15063
- Version 1709 (Fall Creators Update) 2017 => build 16299
- Version 1803 (April 2018 update) => build 17134
- Version 1809 (October 2018 update) => build 17763
- Version 1903 (May 2019 update) => build 18362
- Version 1909 (November 2019 update) => build 18363
- Version 2004 (May 2020 update) => build 19041

# Pertinent Offsets in SharedUserData structure

## 0x0260
Kernel 10.0 and up
*/
ULONG NtBuildNumber;

/*
## 0x026C
Kernel 4.0 and up
*/
ULONG NtMajorVersion;

/*
## 0x0270
Kernel 4.0 and up
*/
ULONG NtMinorVersion;

/*
## 0x02ed
Kernel 6.1 only (Windows 7)
*/
union {
    UCHAR TscQpcData;
    struct {
        UCHAR TscQpcEnabled : 1;        // 0x01
        UCHAR TscQpcSpareFlag : 1;      // 0x02
        UCHAR TscQpcShift : 6;          // 0xFC
    };
};

/*
## 0x0300
Kernel 6.2 and up (Windows 8 and up).
Other meaning in previous version.
QueryPerformanceFrequency returns this value on Windows 10 (and I suppose 8 and 8.1).
Windows 7 does a system call instead. */
LONGLONG QpcFrequency;

/*
## 0x03b8
Kernel 6.1 (Windows 7) and 6.2 (Windows 8) */
ULONGLONG volatile TscQpcBias;
/* Kernel 6.3 and up (Windows 8.1 and up) */
ULONGLONG volatile QpcBias;

/*
## 0x03C6
Kernel 6.1 (Windows 7)
*/
USHORT Reserved4;

/*
Kernel 6.2 only (Windows 8)
This is very similar to 0x02ed but for windows 8.
*/
union {
    USHORT TscQpcData;
    struct {
        BOOLEAN volatile TscQpcEnabled;
        UCHAR TscQpcShift;
    };
};

/*
Kernel 6.3 (Windows 8.1), and kernel 10 up to version 1607 (Windows 10 anniversary update)
Bypass here means bypassing a system call to retrive the counter (based on the comments in the WDK
header, see above). */
union {
    USHORT QpcData;
    struct {
        BOOLEAN volatile QpcBypassEnabled;
        UCHAR QpcShift;
    };
};

/*
Kernel 10 starting with version 1709 (Windows 10 fall creators update) and up. What about version 1703 (creator update) ?
Assuming previous version only set the boolean to 0 or 1, it shouldn't matter as the value needs to
have the second bit set (0x2) to take the hypervisor path. In theory any version starting with
windows 8 could use the windows 10 function below and should still work.

From the unofficial doc (see above):
Version 1709 changes QpcBypassEnabled from a UCHAR that is intended to be either TRUE or FALSE to one whose meaning is taken in bits. Microsoft's C-language definition in the contemporaneous WDK defines:

0x01 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED;
0x10 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE;
0x20 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE;
0x40 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_A73_ERRATA;
0x80 as SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP.

From the header of the WDK (see above):
//
// Define flags for QPC bypass information. None of these flags may be set
// unless bypass is enabled. This is for compat with existing code which
// compares this value to zero to detect bypass enablement.
//

#define SHARED_GLOBAL_FLAGS_QPC_BYPASS_ENABLED (0x01)
#define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_HV_PAGE (0x02)
#define SHARED_GLOBAL_FLAGS_QPC_BYPASS_DISABLE_32BIT (0x04)
#define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_MFENCE (0x10)
#define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_LFENCE (0x20)
#define SHARED_GLOBAL_FLAGS_QPC_BYPASS_A73_ERRATA (0x40)
#define SHARED_GLOBAL_FLAGS_QPC_BYPASS_USE_RDTSCP (0x80)
*/

/* NOTE This definition comes from the WDK header, not from the unofficial doc as the rest of the definitions. */
union {
    USHORT QpcData;
    struct {
        //
        // A boolean indicating whether performance counter queries
        // can read the counter directly (bypassing the system call).
        //
        
        volatile UCHAR QpcBypassEnabled;
        
        //
        // Shift applied to the raw counter value to derive the
        // QPC count.
        //
        
        UCHAR QpcShift;
    };
};

#endif

#include <stdint.h>
#include <assert.h>
#include <windows.h>
#include <stdio.h>

#if 0
/* NOTE Actual definitions for reference. */
/* NOTE from winternl.h */
NTSTATUS NTAPI NtQuerySystemInformation( IN SYSTEM_INFORMATION_CLASS SystemInformationClass, OUT PVOID SystemInformation, IN ULONG SystemInformationLength, OUT PULONG ReturnLength OPTIONAL );

/* NOTE from msdn */
NTSTATUS NtQueryPerformanceCounter( _Out_ PLARGE_INTEGER PerformanceCounter, _Out_opt_ PLARGE_INTEGER PerformanceFrequency );
#endif

typedef int32_t __stdcall NtQuerySystemInformation_t( int32_t SystemInformationClass, void* SystemInformation, uint32_t SystemInformationLenght, uint32_t* ReturnLenght );
typedef int32_t NtQueryPerformanceCounter_t( uint64_t* PerformanceCounter, uint64_t* PerformanceFrequency );

NtQueryPerformanceCounter_t* NtQueryPerformanceCounter = 0;

uint8_t* SharedUserData = ( uint8_t* ) 0x7ffe0000;
volatile uint8_t* RtlpHypervisorSharedUserVa = 0; /* NOTE volatile because I think the content could be changed by the kernel. */

void qpc_win_7( uint64_t* time ) {
    
    uint8_t tsc_qpc_data = *( SharedUserData + 0x02ed );
    uint8_t bypass_syscall = tsc_qpc_data & 0x1;
    
    if ( bypass_syscall ) {
        uint8_t qpc_shift = ( tsc_qpc_data >> 2 );
        uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 );
        *time = __rdtsc( );
        *time += qpc_bias;
        *time >>= qpc_shift;
    } else {
        int32_t result = NtQueryPerformanceCounter( time, 0 );
        assert( result >= 0 );
    }
}

/* NOTE This function hasn't been tested. I don't have the windows 8/10 assembly, it's a complete guess. */
void qpc_win_8_to_10_v1067( uint64_t* time ) {
    
    uint8_t bypass_syscall = *( SharedUserData + 0x03c6 );
    
    if ( bypass_syscall ) {
        
        uint8_t qpc_shift = *( SharedUserData + 0x03c7 );
        uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 );
        *time = __rdtsc( );
        *time += qpc_bias;
        *time >>= qpc_shift;
        
    } else {
        int32_t result = NtQueryPerformanceCounter( time, 0 );
        assert( result >= 0 );
    }
}

void qpc_win_10( uint64_t* time ) {
    
    uint64_t tsc = 0;
    uint8_t flags = *( SharedUserData + 0x3c6 );
    uint8_t bypass_syscall = flags & 0x1;
    
    if ( bypass_syscall ) {
        
        uint8_t use_hypervisor_page = flags & 0x2;
        
        if ( use_hypervisor_page ) {
            
            /* NOTE If RtlpHypervisorSharedUserVa is 0, we should use NtQueryPerformanceCounter
            to get the result of the whole function (not done here to keep it simple). */
            assert( RtlpHypervisorSharedUserVa );
            
            while ( 1 ) {
                
                /* NOTE This value is "HalT" in ascii on my machine (0x546c6148) */
                uint32_t some_value_that_should_not_be_zero = *( uint32_t* ) RtlpHypervisorSharedUserVa;
                /* NOTE If this value is 0, we should use NtQueryPerformanceCounter
                to get the result of the whole function (not done here to keep it simple).*/
                assert( some_value_that_should_not_be_zero );
                
                uint8_t use_rdtscp = flags & 0x80;
                
                if ( use_rdtscp ) {
                    uint32_t x;
                    tsc = __rdtscp( &x );
                } else {
                    uint8_t lfence = flags & 0x20;
                    uint8_t mfence = flags & 0x10;
                    
                    if ( lfence ) {
                        _mm_lfence( );
                    } else if ( mfence ) {
                        _mm_mfence( );
                    }
                    
                    tsc = __rdtsc( );
                }
                
                uint64_t value_1 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x08 ); /* NOTE Always 0xea39330e641ff4 on my machine. */
                uint64_t value_2 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x10 ); /* NOTE Always 0x0 on my machine. */
                uint64_t high, low;
                low = _umul128( tsc, value_1, &high ); /* NOTE Could use __umulh as the low bytes are discarded. */
                high += value_2;
                tsc = high;
                
                low = *( uint32_t* ) RtlpHypervisorSharedUserVa;
                
                /* NOTE If the value "HalT" was changed since we read it, redo the work
                (possibly could make the path go through NtQueryPerformanceCounter). */
                if ( low == some_value_that_should_not_be_zero ) {
                    break;
                }
            }
            
            uint8_t qpc_shift = *( SharedUserData + 0x03c7 ); /* NOTE qpc_shift is always 0 on my machine on windows 10. */
            uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 );
            tsc = tsc + qpc_bias;
            tsc >>= qpc_shift;
            
            *time = tsc;
            
        } else {
            qpc_win_8_to_10_v1067( time );
        }
        
    } else {
        int32_t result = NtQueryPerformanceCounter( time, 0 );
        assert( result >= 0 );
    }
}

uint64_t tsc_to_qpc_win_7( uint64_t tsc ) {
    
    uint8_t tsc_qpc_data = *( SharedUserData + 0x02ed );
    uint8_t qpc_shift = ( tsc_qpc_data >> 2 );
    uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 );
    
    uint64_t result = tsc + qpc_bias;
    result >>= qpc_shift;
    
    return result;
}

uint64_t tsc_to_qpc_win_8_to_10_v1067( uint64_t tsc ) {
    
    uint8_t qpc_shift = *( SharedUserData + 0x03c7 );
    uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 );
    
    uint64_t result = tsc + qpc_bias;
    result >>= qpc_shift;
    
    return result;
}

uint64_t tsc_to_qpc_win_10( uint64_t tsc ) {
    
    uint64_t value_1 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x08 );
    uint64_t value_2 = *( uint64_t* ) ( RtlpHypervisorSharedUserVa + 0x10 );
    
    tsc = __umulh( tsc, value_1 );
    tsc += value_2;
    
    uint8_t qpc_shift = *( SharedUserData + 0x03c7 );
    uint64_t qpc_bias = *( uint64_t* ) ( SharedUserData + 0x03b8 );
    
    uint64_t result = tsc + qpc_bias;
    result >>= qpc_shift;
    
    return result;
}

typedef void qpc_t( uint64_t* time );
qpc_t* custom_qpc = 0;

typedef uint64_t tsc_to_qpc_t( uint64_t );
tsc_to_qpc_t* tsc_to_qpc = 0;

void custom_qpc_init( ) {
    
    HANDLE ntdll = LoadLibrary( "ntdll.dll" );
    NtQuerySystemInformation_t* NtQuerySystemInformation = 0;
    
    if ( ntdll ) {
        
        NtQueryPerformanceCounter = ( NtQueryPerformanceCounter_t* ) GetProcAddress( ntdll, "NtQueryPerformanceCounter" );
        assert( NtQueryPerformanceCounter );
        
        NtQuerySystemInformation = ( NtQuerySystemInformation_t* ) GetProcAddress( ntdll, "NtQuerySystemInformation" );
        assert( NtQuerySystemInformation );
        
        FreeLibrary( ntdll );
    }
    
    uint32_t kernel_major = *( uint32_t* ) ( SharedUserData + 0x026c );
    uint32_t kernel_minor = *( uint32_t* ) ( SharedUserData + 0x0270 );
    
    if ( kernel_major == 6 && kernel_minor == 1 ) {
        custom_qpc = qpc_win_7;
        tsc_to_qpc = tsc_to_qpc_win_7;
    } else if ( kernel_major == 6 && ( kernel_minor == 2 || kernel_minor == 3 ) ) {
        custom_qpc = qpc_win_8_to_10_v1067;
        tsc_to_qpc = tsc_to_qpc_win_8_to_10_v1067;
    } else if ( kernel_major == 10 ) {
        
        uint32_t win_10_build_number = *( uint32_t* ) ( SharedUserData + 0x0260 );
        
        if ( win_10_build_number > 14393 ) {
            
            /* NOTE Build after anniversary update. This number might need to be bumped to 16299. */
            
            uint64_t system_information;
            uint32_t out_size;
            int32_t SystemHypervisorSharedPageInformation = 0xc5;
            int32_t result = NtQuerySystemInformation( SystemHypervisorSharedPageInformation, &system_information, sizeof( system_information ), &out_size );
            assert( out_size == sizeof( system_information ) );
            
            if ( result >= 0 ) {
                RtlpHypervisorSharedUserVa = ( uint8_t* ) system_information;
            }
            
            custom_qpc = qpc_win_10;
            tsc_to_qpc = tsc_to_qpc_win_10;
            
        } else {
            custom_qpc = qpc_win_8_to_10_v1067;
            tsc_to_qpc = tsc_to_qpc_win_8_to_10_v1067;
        }
    } else {
        assert( !"Not supported" );
    }
}

int main( int argc, char** argv ) {
    
    custom_qpc_init( );
    
    uint64_t mins[ 3 ] = { 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff };
    uint64_t maxs[ 3 ] = { 0 };
    uint64_t bests[ 3 ] = { 0 };
    
    uint32_t sleep_duration = 100;
#define iteration_count 100
    
    uint64_t qpc_results[ iteration_count ] = { 0 };
    uint64_t custom_qpc_results[ iteration_count ] = { 0 };
    uint64_t rdtscp_starts[ iteration_count ] = { 0 };
    uint64_t rdtscp_ends[ iteration_count ] = { 0 };
    
    for ( uint32_t index = 0; index < iteration_count; index++ ) {
        
        uint64_t s, e;
        QueryPerformanceCounter( ( LARGE_INTEGER* ) &s );
        Sleep( sleep_duration );
        QueryPerformanceCounter( ( LARGE_INTEGER* ) &e );
        qpc_results[ index ] = e - s;
    }
    
    for ( uint32_t index = 0; index < iteration_count; index++ ) {
        
        uint64_t s, e;
        custom_qpc( &s );
        Sleep( sleep_duration );
        custom_qpc( &e );
        custom_qpc_results[ index ] = e - s;
    }
    
    for ( uint32_t index = 0; index < iteration_count; index++ ) {
        
        int32_t x = 0;
        rdtscp_starts[ index ] = __rdtscp( &x );
        Sleep( sleep_duration );
        rdtscp_ends[ index ] = __rdtscp( &x );
    }
    
    for ( uint32_t index = 0; index < iteration_count; index++ ) {
        
        uint64_t rdtscp_result = tsc_to_qpc( rdtscp_ends[ index ] ) - tsc_to_qpc( rdtscp_starts[ index ] );
        
        if ( qpc_results[ index ] < mins[ 0 ] ) {
            mins[ 0 ] = qpc_results[ index ];
        }
        
        if ( qpc_results[ index ] > maxs[ 0 ] ) {
            maxs[ 0 ] = qpc_results[ index ];
        }
        
        if ( custom_qpc_results[ index ] < mins[ 1 ] ) {
            mins[ 1 ] = custom_qpc_results[ index ];
        }
        
        if ( custom_qpc_results[ index ] > maxs[ 1 ] ) {
            maxs[ 1 ] = custom_qpc_results[ index ];
        }
        
        if ( rdtscp_result < mins[ 2 ] ) {
            mins[ 2 ] = rdtscp_result;
        }
        
        if ( rdtscp_result > maxs[ 2 ] ) {
            maxs[ 2 ] = rdtscp_result;
        }
        
        if ( qpc_results[ index ] < custom_qpc_results[ index ] && qpc_results[ index ] < rdtscp_result ) {
            bests[ 0 ]++;
        } else if ( custom_qpc_results[ index ] < rdtscp_result ) {
            bests[ 1 ]++;
        } else {
            bests[ 2 ]++;
        }
    }
    
    /* NOTE Best on each version should be roughly the same. */
    printf( "qpc\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 0 ], maxs[ 0 ], bests[ 0 ] );
    printf( "custom qpc\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 1 ], maxs[ 1 ], bests[ 1 ] );
    printf( "rdtscp\n\tmin: %llu\n\tmax: %llu\n\tbest: %llu\n", mins[ 2 ], maxs[ 2 ], bests[ 2 ] );
    
    return 0;
}

#if 0

/* I use the following assembly from Windows 10 QPC. I'm not sure which version, but it's prior to
1909 (I got an updated while working on this). The 1909 assembly is bit different but not by much.

ntdll!RtlQueryPerformanceCounter:
00007ffb`7e4aca70 48895c2408       mov     qword ptr [rsp+8], rbx ss:00000064`5075fb20=0000000000000000
00007ffb`7e4aca75 57               push    rdi
00007ffb`7e4aca76 4883ec20         sub     rsp, 20h
00007ffb`7e4aca7a 448a0c25c603fe7f mov     r9b, byte ptr [SharedUserData+0x3c6 (00000000`7ffe03c6)]
00007ffb`7e4aca82 488bd9           mov     rbx, rcx
00007ffb`7e4aca85 41f6c101         test    r9b, 1
00007ffb`7e4aca89 7470             je      ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb)
00007ffb`7e4aca8b 4c8b1c25b803fe7f mov     r11, qword ptr [SharedUserData+0x3b8 (00000000`7ffe03b8)]
00007ffb`7e4aca93 41f6c102         test    r9b, 2
00007ffb`7e4aca97 0f84b61d0600     je      ntdll!RtlQueryPerformanceCounter+0x61de3 (00007ffb`7e50e853)
00007ffb`7e4aca9d 4c8b0584731000   mov     r8, qword ptr [ntdll!RtlpHypervisorSharedUserVa (00007ffb`7e5b3e28)]
00007ffb`7e4acaa4 4d85c0           test    r8, r8
00007ffb`7e4acaa7 7452             je      ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb)
00007ffb`7e4acaa9 458b10           mov     r10d, dword ptr [r8]
00007ffb`7e4acaac 4585d2           test    r10d, r10d
00007ffb`7e4acaaf 744a             je      ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb)
00007ffb`7e4acab1 4584c9           test    r9b, r9b
00007ffb`7e4acab4 0f897e1d0600     jns     ntdll!RtlQueryPerformanceCounter+0x61dc8 (00007ffb`7e50e838)
00007ffb`7e4acaba 0f01f9           rdtscp
00007ffb`7e4acabd 48c1e220         shl     rdx, 20h
00007ffb`7e4acac1 480bd0           or      rdx, rax
00007ffb`7e4acac4 498b4008         mov     rax, qword ptr [r8+8]
00007ffb`7e4acac8 498b4810         mov     rcx, qword ptr [r8+10h]
00007ffb`7e4acacc 48f7e2           mul     rax, rdx
00007ffb`7e4acacf 418b00           mov     eax, dword ptr [r8]
00007ffb`7e4acad2 4803d1           add     rdx, rcx
00007ffb`7e4acad5 413bc2           cmp     eax, r10d
00007ffb`7e4acad8 75cf             jne     ntdll!RtlQueryPerformanceCounter+0x39 (00007ffb`7e4acaa9)
00007ffb`7e4acada 8a0c25c703fe7f   mov     cl, byte ptr [SharedUserData+0x3c7 (00000000`7ffe03c7)]
00007ffb`7e4acae1 4a8d041a         lea     rax, [rdx+r11]
00007ffb`7e4acae5 48d3e8           shr     rax, cl
00007ffb`7e4acae8 488903           mov     qword ptr [rbx], rax
00007ffb`7e4acaeb b801000000       mov     eax, 1
00007ffb`7e4acaf0 488b5c2430       mov     rbx, qword ptr [rsp+30h]
00007ffb`7e4acaf5 4883c420         add     rsp, 20h
00007ffb`7e4acaf9 5f               pop     rdi
00007ffb`7e4acafa c3               ret
00007ffb`7e4acafb 33d2             xor     edx, edx
00007ffb`7e4acafd 488d4c2440       lea     rcx, [rsp+40h]
00007ffb`7e4acb02 e869320400       call    ntdll!NtQueryPerformanceCounter (00007ffb`7e4efd70)
00007ffb`7e4acb07 488b442440       mov     rax, qword ptr [rsp+40h]
00007ffb`7e4acb0c ebda             jmp     ntdll!RtlQueryPerformanceCounter+0x78 (00007ffb`7e4acae8)
00007ffb`7e4acb0e cc               int     3

Some jumps leads here.

00007ffb`7e50e838 41f6c120             test    r9b, 20h
00007ffb`7e50e83c 7405                 je      ntdll!RtlQueryPerformanceCounter+0x61dd3 (00007ffb`7e50e843)
00007ffb`7e50e83e 0faee8               lfence
00007ffb`7e50e841 eb09                 jmp     ntdll!RtlQueryPerformanceCounter+0x61ddc (00007ffb`7e50e84c)
00007ffb`7e50e843 41f6c110             test    r9b, 10h
00007ffb`7e50e847 7403                 je      ntdll!RtlQueryPerformanceCounter+0x61ddc (00007ffb`7e50e84c)
00007ffb`7e50e849 0faef0               mfence
00007ffb`7e50e84c 0f31                 rdtsc
00007ffb`7e50e84e e96ae2f9ff           jmp     ntdll!RtlQueryPerformanceCounter+0x4d (00007ffb`7e4acabd)
00007ffb`7e50e853 4584c9               test    r9b, r9b
00007ffb`7e50e856 7905                 jns     ntdll!RtlQueryPerformanceCounter+0x61ded (00007ffb`7e50e85d)
00007ffb`7e50e858 0f01f9               rdtscp
00007ffb`7e50e85b eb16                 jmp     ntdll!RtlQueryPerformanceCounter+0x61e03 (00007ffb`7e50e873)
00007ffb`7e50e85d 41f6c120             test    r9b, 20h
00007ffb`7e50e861 7405                 je      ntdll!RtlQueryPerformanceCounter+0x61df8 (00007ffb`7e50e868)
00007ffb`7e50e863 0faee8               lfence
00007ffb`7e50e866 eb09                 jmp     ntdll!RtlQueryPerformanceCounter+0x61e01 (00007ffb`7e50e871)
00007ffb`7e50e868 41f6c110             test    r9b, 10h
00007ffb`7e50e86c 7403                 je      ntdll!RtlQueryPerformanceCounter+0x61e01 (00007ffb`7e50e871)
00007ffb`7e50e86e 0faef0               mfence
00007ffb`7e50e871 0f31                 rdtsc
00007ffb`7e50e873 48c1e220             shl     rdx, 20h
00007ffb`7e50e877 480bd0               or      rdx, rax
00007ffb`7e50e87a e95be2f9ff           jmp     ntdll!RtlQueryPerformanceCounter+0x6a (00007ffb`7e4acada)
00007ffb`7e50e87f cc                   int     3
*/

#endif
I think your load of qpc_bias fromSharedUserData in qpc_win_10 is wrong.

In assembly it loads r10d from RtlpHypervisorSharedUserVa address - and it is doing it inside while loop, not outside. It is basically some_value_that_should_not_be_zero value - that is one you should use for qpc_bias (r10d). It does not repeatedly load this value into "low" as you are doing just before comparison.

Edited by Mārtiņš Možeiko on
I believe the code is correct.

- When using the hypervisor page path, there are 2 adds, the first one comes from the hypervisor page + 0x10 ( loaded in rcx, that value is always 0 on my machine) and is in the loop, and the second one is the qpc bias (r11) and is outside the loop.
- When not using the hypervisor page, there is only the qpc bias (r11) add.

Here is the assembly commented.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
; Bypass enabled ?
test    r9b, 1
; NtQueryPerformanceCounter syscall
je      ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb)
; Loading qpc_bias in r11
mov     r11, qword ptr [SharedUserData+0x3b8 (00000000`7ffe03b8)]
; Use hypervisor page ?
test    r9b, 2
; Second part of assembly (rdtsc/p + maybe lfence or mfence)
je      ntdll!RtlQueryPerformanceCounter+0x61de3 (00007ffb`7e50e853)
; Loading address of RtlpHypervisorSharedUserVa into r8 (r8 = 0x7ffe8000)
mov     r8, qword ptr [ntdll!RtlpHypervisorSharedUserVa (00007ffb`7e5b3e28)]
; Is RtlpHypervisorSharedUserVa present ?
test    r8, r8
; NtQueryPerformanceCounter syscall
je      ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb)
; Loading first 4 byte of RtlpHypervisorSharedUserVa. r10d is 'HalT'
; This is the first instruction of the loop
mov     r10d, dword ptr [r8]
; r10d != 0
test    r10d, r10d
; NtQueryPerformanceCounter syscall
je      ntdll!RtlQueryPerformanceCounter+0x8b (00007ffb`7e4acafb)
; Use rdtscp ?
test    r9b, r9b
; Second part of assembly (use rdtsc + maybe lfence or mfence)
jns     ntdll!RtlQueryPerformanceCounter+0x61dc8 (00007ffb`7e50e838)
; rdtscp and combine the result in rdx
rdtscp
shl     rdx, 20h
or      rdx, rax
; Loading RtlpHypervisorSharedUserVa + 0x08 into rax
; This is the value that is used in the 128 multiply
; This is where the path that use the hypervisor but not rdtscp would returns to
mov     rax, qword ptr [r8+8]
; Loading RtlpHypervisorSharedUserVa + 0x10 into rcx
; This is the value that is added to the high bytes of the 128 multiply (but was always 0x0 in my tests)
mov     rcx, qword ptr [r8+10h]
; 128 bit multiply. rdx contains the result's high bytes, rax contains the result's low bytes
mul     rax, rdx
; Load RtlpHypervisorSharedUserVa + 0x0 'HalT' into eax
; This "discards" the low bytes of the 128 byte multiply
mov     eax, dword ptr [r8]
; rcx is the content of RtlpHypervisorSharedUserVa + 0x10
add     rdx, rcx
; Is eax == to r10d => does both value contain 'HalT' ?
cmp     eax, r10d
; Loop jump
jne     ntdll!RtlQueryPerformanceCounter+0x39 (00007ffb`7e4acaa9)
; Load qpc_shift into cl (always 0x0 in my tests)
; This is where the path that doesn't use the hypervisor returns to.
mov     cl, byte ptr [SharedUserData+0x3c7 (00000000`7ffe03c7)]
; Add qpc_bias (r11) to rdx (high bytes of the 128 bit multiply) and store in rax
lea     rax, [rdx+r11]
; Do qpc_shift
shr     rax, cl
; Put the result at the output memory location
mov     qword ptr [rbx], rax
; Return 1
mov     eax, 1
I looked at how to do a similar thing on linux ( convert tsc result to seconds). This is based on the source of the kernel version 5.9. and tested on linux Manjaro 64bit in a VirtualBox virtual machine from Windows 7 (i7 860). I downloaded the source from git.kernel.org but I will link to elixir.bootlin.com as this should keep pointing to the right version.

clock_gettime uses vdso.

clock_gettime is an alias for __vdso_clock_gettime which in turns calls __cvdso_clock_gettime which calls __cvdso_clock_gettime_data.

__cvdso_clock_gettime retrieves the address of a data block shared by the kernel with the user process by calling __arch_get_vdso_data which returns a global variable called __vdso_data. It's a bit complicated how this variable can be accessed.

It's defined in arch/x86/include/asm/vdso/gettimeofday.h using the VVAR macro. When processed the actual name is vvar__vdso_data.

At the bottom of the vvar.h file there is a declaration:
1
DECLARE_VVAR(128, struct vdso_data, _vdso_data)


When processed, the macro expands to:
1
extern struct vdso_data vvar__vdso_data[CS_BASES] __attribute__((visibility("hidden")));


Note that the offset (128) isn't used at all in that code. CS_BASES, which value is 2, is defined in datapage.h as is the struct vdso_data. The definition of the struct can change between kernel versions, but the first 6 fields don't seem to change and those are the ones that we are interested in (we'll see why later).

The problem here is that I would like to avoid including headers, especially if those header requires the kernel source. In principle what we want is the location of the data and just read the bytes we want. To find that location, after stepping in the assembly I could identify the address of the vdso_data and it's offset from the vdso image you get with getauxval seems to always be the same and about 16Kio. In fact it was 16Kio - 128o which is 4 memory pages minus the offset from the DECLARE_VVAR macro. I don't remember how I found out but there is a file arch/x86/entry/vdso/vdso-layout.lds.S that says that vvar_start = -4 * PAGE_SIZE. So I think it's ok to assume a constant offset from the vdso image. If someone knows a safer way to retrieve the location of vvar__vdso_data I would like to know.

Note that the offset can change between kernel version. For example, on kernel version 4.20 the offset is -3 * PAGE_SIZE.


__cvdso_clock_gettime_data calls __cvdso_clock_gettime_common or calls a fallback function if gettime_common fails. The fallback function is a clock_gettime system call. Note that even if a system doesn't support the vdso clock_gettime, it still goes through the vdso functions, fails and then make a syscall (I observed that on a 32 bit intel Atom processor, running debian 10 with the 4.19 kernel).

My understanding of how clock_gettime (with CLOCK_MONOTONIC) works at a high level is:
- periodically the kernel updates the vdso_data values.
- I measured a interval (if my understanding of what's going on is correct) to be 4 697 428 nano seconds ( 4.6ms) on my setup. So the update interval is in the range of 10 milliseconds. I'll explain how I measured that at the end.
- When you call gettime, it will read the TSC, measure how much it has changed since the last kernel update and return the kernel time + the change. This is I believe to keep more precision and keep every thing in a 64 bit integer range.

__cvdso_clock_gettime_common first check if we request a valid clock, then based on the clock calls either do_hres or do_coarse and if you request CLOCK_MONOTONIC_RAW it will use the second element of the vdso_data array instead of the first one.

do_coarse will simply return the last value from the kernel update, not reading the current TSC (not adding the difference).

do_hres:
- Most of the function is in a loop. I believe this is to make sure that the values read from the kernel don't change between the different reads. It's non blocking and will loop until it succeeds. If you try to step in this assembly you'll not be able to step out of this loop as the value will most likely change while you step.
- There is another loop that checks for time name spaces. I'm not familiar with time namespaces but I'm confident we don't care about that in our case, so we can skip this loop.
- The code then calls __arch_get_hw_counter: in our case this calls rdtscp or rdtsc with memory fences. Even PVCLOCK and HVCLOCK at some point will use rdtscp and adjust it's value. In practive, while stepping in the assembly this result in rdtscp being called.
- The code then retrieves the difference between the kernel last tsc value convert it to nano seconds and add it to the last kernel time value.
--- vdso_calc_delta verify that the new tsc value is greater than the one from the kernel (I believe because the intel spec says that TSC value can be a little off if you read two cores TSC at the same time).
--- It then computes the difference between the two and multiply it by the mult field from the vdso_data structure.
--- It adds that value to the kernel nano second value and shift the result by the shift field from the vdso_data structure.
- If the value weren't updated during that, the loop ends;
- The nano second count is converted to second and added to the result seconds, and the remainder is stored as nanoseconds.
- That's it.

This means that to convert TSC values to seconds, you need to multiply it by vdso_data.mult and shift it right by vdso_data.shift and you've got an integer representing the timestamp in nanoseconds (clock_getres seems to only return 1 nanoseconds).

- One small issue is that the mult field can change, but it only changes by 1 (oscillating between 0x5b7e10 and 0x5b7e0f for example). Se I don't think it's a big issue. The shift value never changed in my tests.
- A bigger issue is that the result of the multiply will quickly take more than 64bit. We could either use doubles for that or doing a 128bit integer multiply and maybe the shift will bring the value in a 64bit range ? I didn't tested the 128bit multiply as I couldn't figure out if there were instrinsics like _umul128 on linux. If anyone knows, I'm all ears.


One thing that is off is that if I convert rdtscp to ns using the mul and shift, the result is not the same as what I get from clock_gettime. In my tests there was a difference of about 30 seconds. I'm not sure but it seems to be the time it takes for the system to boot up. Maybe using CLOCK_MONOTONIC_RAW would give a closer result but I didn't tested it.


How I measured the interval for the kernel update:
- This is to get an idea of the range, not a precise measurement;
- I set a breakpoint in __vdso_clock_gettime in gdb;
- I stepped until I reached the rdtscp instruction;
- A little bit after that there is code that looks like this:
1
2
3
mov 0x8(%r10), %rcx
mov 0x28(%r11), %rax
mov 0x18(%r10), %esi

- I added a breakpoint on the second instruction, and used "continue" to take another iteration in the loop;
- In that code rcx is cycle_last in the vdso_data structure (r10 is the address of the vdso_data structure).
- I noted the value of rcx (0xadac82ca89c);
- Don't step as rax contains the current TSC value ( 0xadac8f51ca1 ) (also stored in rdx at that time).
- I subtracted them, multiply the result by vdso_data.mult and shift it right by vdso_data.shift
- 0xadac8f51ca1 - 0xadac82ca89c = 0xc87405
- 0xc87405 >> 0x18 = 0x47ad54 = 4 697 428 ns

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
#include <time.h>
#include <sys/auxv.h>
#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <sys/utsname.h>

typedef struct vdso_data_t {
	uint32_t seq;
    
	int32_t clock_mode;
	uint64_t cycle_last;
	uint64_t mask;
	uint32_t mult;
	uint32_t shift;
    
    struct {
        uint64_t sec;
        uint64_t nsec;
    } basetime[ 12 ];
    
    /*
        union {
            struct vdso_timestamp	basetime[VDSO_BASES];
            struct timens_offset	offset[VDSO_BASES];
        };
    */
    
	int32_t tz_minuteswest;
	int32_t tz_dsttime;
	uint32_t hrtimer_res;
	uint32_t __unused;
    
    /* Empty struct on x86
        struct arch_vdso_data	arch_data;
    */
} vdso_data_t;

typedef unsigned __int128 uint128_t;

void get_kernel_version( uint32_t* major_out,  uint32_t* minor_out ) {
    
    /* uname -r */
    
    *major_out = 5;
    *minor_out = 9;
    struct utsname info;
    
    if ( uname( &info ) >= 0 ) {
        
        uint32_t major = 0;
        uint32_t minor = 0;
        char* version = info.release;
        
        while ( *version && *version != '.' ) {
            
            char c = *version;
            
            if ( c >= '0' && c <= '9' ) {
                major *= 10;
                major += ( c - '0' );
            }
            
            version++;
        }
        
        version++;
        
        while ( *version && *version != '.' ) {
            
            char c = *version;
            
            if ( c >= '0' && c <= '9' ) {
                minor *= 10;
                minor += ( c - '0' );
            }
            
            version++;
        }
        
        if ( major ) {
            *major_out = major;
            *minor_out = minor;
        }
    }
}

intptr_t get_vdso_data_offset( void ) {
    
    uint32_t major, minor;
    get_kernel_version( &major, &minor );
    
    intptr_t page_size = 1 << 12;
    intptr_t offset = -4 * page_size;
    
    if ( major == 5 ) {
        
        if ( minor > 5 ) {
            offset = -4 * page_size;
        } else {
            offset = -3 * page_size;
        }
    } else if ( major == 4 ) {
        
        if ( minor > 11 ) {
            offset = -3 * page_size;
        } else if ( minor > 6 ) {
            offset = -2 * page_size;
        } else if ( minor > 4 ) {
            offset = -3 * page_size;
        } else if ( minor > 1 ) {
            offset = -2 * page_size;
        } else {
            assert( !"Unsupported" );
        }
    }
    
    offset += 128;
    
    return offset;
}

int main( int argc, char** argv ) {
    
    intptr_t vdso_data_offset = get_vdso_data_offset( );
    
    uint8_t* vdso = ( uint8_t* ) getauxval( AT_SYSINFO_EHDR );
    vdso_data_t* data = ( vdso_data_t* ) ( vdso + vdso_data_offset );
    
    uint32_t mul = data[ 0 ].mult;
    uint32_t shift = data[ 0 ].shift;
    
    struct timespec res;
    clock_getres( CLOCK_MONOTONIC, &res );
    double frequency = res.tv_nsec * 1000000000;
    
    uint32_t x = 0;
    uint64_t rdtsc = __builtin_ia32_rdtscp( &x );
    struct timespec spec;
    clock_gettime( CLOCK_MONOTONIC, &spec );
    
    uint128_t temp = ( uint128_t ) rdtsc * mul;
    temp >>= shift;
    assert( ( temp & 0xffffffffffffffff ) == temp );
    uint64_t result = ( uint64_t ) ( temp );
    
    double r1 = ( double ) result / frequency;
    printf( "integer: %5.9f\n", r1 );
    
    double divisor = 1;
    
    for ( uint32_t i = 0; i < shift; i++ ) {
        divisor *= 2;
    }
    
    double r1b = ( ( double ) rdtsc * ( double ) mul ) / divisor;
    r1b /= frequency;
    printf( "double : %5.9f\n", r1b );
    
    double r2 = ( double ) spec.tv_sec + ( ( double ) spec.tv_nsec / frequency );
    printf( "gettime: %5.9f\n", r2 );
    printf( "---\n" );
    printf("integer diff: %5.9f\n", r1 - r2 );
    printf("double  diff: %5.9f\n", r1b - r2 );
    
    return 0;
}


Edited by Simon Anciaux on Reason: updated the code
mrmixer
- A bigger issue is that the result of the multiply will quickly take more than 64bit. We could either use doubles for that or doing a 128bit integer multiply and maybe the shift will bring the value in a 64bit range ? I didn't tested the 128bit multiply as I couldn't figure out if there were instrinsics like _umul128 on linux. If anyone knows, I'm all ears.


Intrinsics like _umul128 are not OS specific. They are compiler specific. On gcc/clang you can use __int128 type instead (even on Windows):
1
2
3
uint64_t a = ..., b = ...;
unsigned __int128 big = (unsigned __int128)a * b;
uint64_t higer_64_bits = (uint64_t)(big >> 64);

Compiler will optimize it correctly two register mul, and in this example shift will be for "free", as I simply take upper 64-bits.

Alternative is inline asm. As you are writing architecture specific code, asm will be available only in one variant. For 64x64 mul it will be trivial one line of inline asm with "imul" instruction.
Thanks. I updated the code with the 128 multiply and some code to try to choose the offset based on the kernel version (only tested on kernel 5.9 so we shouldn't assume it's working).
Hello, thank you for the indepth info about the QPC, very interesting read.

I just wanted to add that the QueryPerformanceFrequency always returned 10mhz for me as well when Invariant TSC was used, and always 14.32mhz when HPET was used.

I came across this while searching for the best method to profile my (VBA) code with the QueryPerformanceCounter function(via). I found that the QPF always returned 10mhz as well, and after reading this and a bit more, that the HPET timer wasn't used: it was on in the BIOS, but win10 used the Invariant TSC (found out with this piece of software profiling software)). And surprisingly, when I turned HPET on the QPF also always returned the same value, now being 14.318.180. After restart but also after a restart where I cut off the power for half a minute. Still 14.32mhz. Info about my desktop: win10, version 2004, build 19041.985, i7-2600k (3.4GHz, not overclocked), hyper-v enabled in bios (did not test what the effect was when turning hyperv off).

And, also not sure why and if interesting, when looping over the QPC function 5000 times and storing its value in an array (calculations only done after the loop completed), with the ITSC the difference between QPC calls would be about 28 ticks and the total 5000 loops took about 14-15 milliseconds, but with HPET on it registered about 55 ticks per loop and total calculated time was 18-20 ms.
TimerTimmyyy
And surprisingly, when I turned HPET on the QPF also always returned the same value, now being 14.318.180. After restart but also after a restart where I cut off the power for half a minute. Still 14.32mhz.


I assume the surprising part is that it's always the same value. But HPET is a hardware timer so it is expected to always have the same frequency.

TimerTimmyyy
And, also not sure why and if interesting, when looping over the QPC function 5000 times and storing its value in an array (calculations only done after the loop completed), with the ITSC the difference between QPC calls would be about 28 ticks and the total 5000 loops took about 14-15 milliseconds, but with HPET on it registered about 55 ticks per loop and total calculated time was 18-20 ms.


I did a quick test, and QPC with HPET on most likely takes more time because it's a syscall instruction, meaning the program will ask the Windows kernel to do something (I haven't looked exactly why a syscall takes more time though). When HPET is off, QPC is a RDTSCP instruction with a few more instructions around it.
Thanks for the response!

QPC with HPET on most likely takes more time because it's a syscall instruction, meaning the program will ask the Windows kernel to do something


At first I thought it was because of the different time-scaling with HPET on/off (10 vs 14,32), but this actually makes more sense! Can you please tell me a little more about:

When HPET is off, QPC is a RDTSCP instruction with a few more instructions around it.


Can that last part stand on itself? Specifically --->

QPC is a RDTSCP instruction with a few more instructions around it.


I'm quite a noob and above texts (with assembly code) go way over my head, but I remember reading/learning from this thread that the QPC function actually has a loop in it. I also noticed this when calling the QPC repeatedly (for example loop of 5000 times). Usually the amount of tics in between two QPC calls was around 25 tics. But sometimes a QPC call would take about 160 tics. There didn't seem to be a difference where in the loop this occurred: it occurred as often in between call 1 to 500 as in between call 4000-4500. As explained above, it is probably because of that loop inside the QPC itself (hope I understand this correctly).

I spent reading on this for about a full week, where I kept having the feeling that RDTSCP would be a better option then QPC. I don't care about the amount of nanoseconds the result is off, I just want to output how one piece of code compares to another piece of code. What I think I need for that is core cpu cycles. So here comes my question, finally:

QPC is a RDTSCP instruction with a few more instructions around it.


Does this mean that it actually does not matter which function you use (QPC or RDTSCP) because they actually return the same thing (just QPC having a little more overhead)? Both returning the amount of clock cycles since boot.

It is what I concluded before (I mean, even Google Benchmark uses the rdtsc command), but when I read another comment yesterday I got confused again, making it sound like QPC (or also (RD)TSC) is 'just a wall clock time'...

"The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter."

- quote from 3rd alinea (part about Latency) in this answer

I thought QPC returned the core clock cycles, but according to this quote... it doesn't: TSC (and thus QPC and RDTSC) return a 'reference clock cycle'?

Edited by Tim on
It is called like that because it runs with same frequency - regardless of how fast actually core runs. Modern CPU cores have turbo scaling and can boost their frequency - how much cycles executes per second. But rdtsc instruction is invariant, it returns same amount of "reference cycles" per second.

> Does this mean that it actually does not matter which function you use (QPC or RDTSCP) because they actually return the same thing (just QPC having a little more overhead)?

No, you should not rely on QPC returning same value as rdtsc. There can be many different things affecting that - windows settings, hypervisor, hardware/bios configuration.

You should use rdtsc when want low-latency counter - but you won't know actual wall-clock time.
And you should use QPC when you need actual time.

If you want low-latency counter, but still need to convert to real time, you can call use rdtsc and from time to time (like every 100msec or once a second) call QPC and synchronize rdtsc readings with these QPC values - linearly interpolate between them.

Edited by Mārtiņš Možeiko on
mrmixer
QPC is a RDTSCP instruction with a few more instructions around it.


What I meant was that after the RDTSCP instruction there are some instruction to convert the resulting value to a value that you can divide by the result of QueryPerformanceFrequency to get a time in seconds. Otherwise the result of RDTSCP can't be "easily" converted to seconds. The few instructions are a 128bit multiply, 2 adds and a bit shift (on Windows 10). Note that this is based on my observations and I don't guarantee that it's correct or that it will continue to work in the future.

Here is the code I use in a profiler to convert the result of RDTSCP to a QPF compatible value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
uint64_t profiler_cycles_to_time( profiler_tool_t* tool, uint64_t cycles ) {

    uint64_t time = cycles;
    profiler_platform_data_t* platform = &tool->platform_data;

    if ( tool->platform == profiler_platform_windows ) {

        if ( platform->windows.version == profiler_windows_version_10 ) {

#if defined( PROFILER_MSVC )

            time = __umulh( cycles, platform->windows.mul128 );

#elif defined( PROFILER_CLANG ) || defined( PROFILER_GCC )

            unsigned __int128 big = ( unsigned __int128 ) cycles * platform->windows.mul128;
            time = ( uint64_t ) ( big >> 64 );

#else
# error Unsupported compiler.
#endif
            time += platform->windows.add;
        }

        time += platform->windows.qpc_bias;
        time >>= platform->windows.qpc_shift;

    } else if ( tool->platform == profiler_platform_linux ) {

#if defined( PROFILER_MSVC )

        uint64_t high = 0;
        uint64_t low = _umul128( cycles, platform->linux_.mult, &high );
        profiler_assert( platform->linux_.shift <= 0xff );
        time = __shiftright128( low, high, ( unsigned char ) platform->linux_.shift );

#elif defined( PROFILER_CLANG ) || defined( PROFILER_GCC )

        unsigned __int128 big = ( unsigned __int128 ) time * tool->platform_data.linux_.mult;
        big >>= tool->platform_data.linux_.shift;
        time = ( uint64_t ) big;

#else
# error Unsupported compiler.
#endif

    } else {

        profiler_assert( tool->platform == profiler_platform_fallback );
        time = profiler_cycles_to_time_fallback( tool, cycles );
    }

    return time;
}


TimerTimmyyy

I'm quite a noob and above texts (with assembly code) go way over my head, but I remember reading/learning from this thread that the QPC function actually has a loop in it. I also noticed this when calling the QPC repeatedly (for example loop of 5000 times). Usually the amount of tics in between two QPC calls was around 25 tics. But sometimes a QPC call would take about 160 tics. There didn't seem to be a difference where in the loop this occurred: it occurred as often in between call 1 to 500 as in between call 4000-4500. As explained above, it is probably because of that loop inside the QPC itself (hope I understand this correctly).


As I don't have the code of the QPC function, I can only guess what it does. My understanding of the loop in QPC is that it will almost never run more than once. I haven't measured that but In my test I don't think it ever did. The reason for the loop is (once again it's a guess) to make sure some information in the Hypervisor memory page doesn't change during the call to QPC.

The 160 tics you saw might be related to that, but it could also be Windows not giving processor time to you application for some other reason.

TimerTimmyyy
Does this mean that it actually does not matter which function you use (QPC or RDTSCP) because they actually return the same thing (just QPC having a little more overhead)? Both returning the amount of clock cycles since boot.


As mmozeiko said, no. As far as I know QPC will never return the same value than RDTSCP. It will return a value based on a value coming from RDTSCP so you can use QPF to convert the timestamp to seconds. If you don't care about the absolute time, and only want to compare different runs of a piece of code than you can use RDTSCP.

TimeTimmyyy
"The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter."


As mmozeiko said, the invariant time stamp counter (the value returned by RDTSCP and RDTSC) are not the actual clock speed as the processor speed changes in real time so a second can take X cycles at one point and take Y cycles at another point. The invariant TSC increases it's value at a constant rate, so that a second always take X cycles; until you restart your computer, at which point X might be different but it will still be X cycles per seconds until you restart.