I looked at how to do a similar thing on linux ( convert tsc result to seconds). This is based on the source of the kernel version 5.9. and tested on linux Manjaro 64bit in a VirtualBox virtual machine from Windows 7 (i7 860). I downloaded the source from
git.kernel.org but I will link to
elixir.bootlin.com as this should keep pointing to the right version.
clock_gettime uses
vdso.
clock_gettime is an alias for
__vdso_clock_gettime which in turns calls
__cvdso_clock_gettime which calls
__cvdso_clock_gettime_data.
__cvdso_clock_gettime retrieves the address of a data block shared by the kernel with the user process by calling
__arch_get_vdso_data which returns a global variable called __vdso_data. It's a bit complicated how this variable can be accessed.
It's defined in
arch/x86/include/asm/vdso/gettimeofday.h using the
VVAR macro. When processed the actual name is
vvar__vdso_data.
At the bottom of the
vvar.h file there is a declaration:
| DECLARE_VVAR(128, struct vdso_data, _vdso_data)
|
When processed, the macro expands to:
| extern struct vdso_data vvar__vdso_data[CS_BASES] __attribute__((visibility("hidden")));
|
Note that the offset (128) isn't used at all in that code.
CS_BASES, which value is 2, is defined in
datapage.h as is the
struct vdso_data. The definition of the struct can change between kernel versions, but the first 6 fields don't seem to change and those are the ones that we are interested in (we'll see why later).
The problem here is that I would like to avoid including headers, especially if those header requires the kernel source. In principle what we want is the location of the data and just read the bytes we want. To find that location, after stepping in the assembly I could identify the address of the vdso_data and it's offset from the vdso image you get with
getauxval seems to always be the same and about 16Kio. In fact it was 16Kio - 128o which is 4 memory pages minus the offset from the
DECLARE_VVAR macro. I don't remember how I found out but there is a file
arch/x86/entry/vdso/vdso-layout.lds.S that says that vvar_start = -4 * PAGE_SIZE. So I think it's ok to assume a constant offset from the vdso image. If someone knows a safer way to retrieve the location of
vvar__vdso_data I would like to know.
Note that the offset can change between kernel version. For example, on kernel
version 4.20 the offset is -3 * PAGE_SIZE.
__cvdso_clock_gettime_data calls
__cvdso_clock_gettime_common or calls a fallback function if gettime_common fails. The fallback function is a
clock_gettime system call. Note that even if a system doesn't support the vdso clock_gettime, it still goes through the vdso functions, fails and then make a syscall (I observed that on a 32 bit intel Atom processor, running debian 10 with the 4.19 kernel).
My understanding of how clock_gettime (with
CLOCK_MONOTONIC) works at a high level is:
- periodically the kernel updates the
vdso_data values.
- I measured a interval (if my understanding of what's going on is correct) to be 4 697 428 nano seconds ( 4.6ms) on my setup. So the update interval is in the range of 10 milliseconds. I'll explain how I measured that at the end.
- When you call gettime, it will read the TSC, measure how much it has changed since the last kernel update and return the kernel time + the change. This is I believe to keep more precision and keep every thing in a 64 bit integer range.
__cvdso_clock_gettime_common first check if we request a valid clock, then based on the clock calls either
do_hres or
do_coarse and if you request
CLOCK_MONOTONIC_RAW it will use the second element of the
vdso_data array instead of the first one.
do_coarse will simply return the last value from the kernel update, not reading the current TSC (not adding the difference).
do_hres:
- Most of the function is in a loop. I believe this is to make sure that the values read from the kernel don't change between the different reads. It's non blocking and will loop until it succeeds. If you try to step in this assembly you'll not be able to step out of this loop as the value will most likely change while you step.
- There is another loop that checks for
time name spaces. I'm not familiar with time namespaces but I'm confident we don't care about that in our case, so we can skip this loop.
- The code then calls
__arch_get_hw_counter: in our case this calls
rdtscp or
rdtsc with memory fences. Even PVCLOCK and HVCLOCK at some point will use
rdtscp and adjust it's value. In practive, while stepping in the assembly this result in
rdtscp being called.
- The code then retrieves the difference between the kernel last tsc value convert it to nano seconds and add it to the last kernel time value.
---
vdso_calc_delta verify that the new tsc value is greater than the one from the kernel (I believe because the intel spec says that TSC value can be a little off if you read two cores TSC at the same time).
--- It then computes the difference between the two and multiply it by the mult field from the vdso_data structure.
--- It adds that value to the kernel nano second value and shift the result by the shift field from the vdso_data structure.
- If the value weren't updated during that, the loop ends;
- The nano second count is converted to second and added to the result seconds, and the remainder is stored as nanoseconds.
- That's it.
This means that to convert TSC values to seconds, you need to multiply it by vdso_data.mult and shift it right by vdso_data.shift and you've got an integer representing the timestamp in nanoseconds (clock_getres seems to only return 1 nanoseconds).
- One small issue is that the mult field can change, but it only changes by 1 (oscillating between 0x5b7e10 and 0x5b7e0f for example). Se I don't think it's a big issue. The shift value never changed in my tests.
- A bigger issue is that the result of the multiply will quickly take more than 64bit. We could either use
doubles for that or doing a 128bit integer multiply and maybe the shift will bring the value in a 64bit range ? I didn't tested the 128bit multiply as I couldn't figure out if there were instrinsics like
_umul128 on linux. If anyone knows, I'm all ears.
One thing that is off is that if I convert
rdtscp to ns using the mul and shift, the result is not the same as what I get from clock_gettime. In my tests there was a difference of about 30 seconds. I'm not sure but it seems to be the time it takes for the system to boot up. Maybe using CLOCK_MONOTONIC_RAW would give a closer result but I didn't tested it.
How I measured the interval for the kernel update:
- This is to get an idea of the range, not a precise measurement;
- I set a breakpoint in __vdso_clock_gettime in gdb;
- I stepped until I reached the rdtscp instruction;
- A little bit after that there is code that looks like this:
| mov 0x8(%r10), %rcx
mov 0x28(%r11), %rax
mov 0x18(%r10), %esi
|
- I added a breakpoint on the second instruction, and used "continue" to take another iteration in the loop;
- In that code rcx is cycle_last in the vdso_data structure (r10 is the address of the vdso_data structure).
- I noted the value of rcx (0xadac82ca89c);
- Don't step as rax contains the current TSC value ( 0xadac8f51ca1 ) (also stored in rdx at that time).
- I subtracted them, multiply the result by vdso_data.mult and shift it right by vdso_data.shift
- 0xadac8f51ca1 - 0xadac82ca89c = 0xc87405
- 0xc87405 >> 0x18 = 0x47ad54 = 4 697 428 ns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167 | #include <time.h>
#include <sys/auxv.h>
#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <sys/utsname.h>
typedef struct vdso_data_t {
uint32_t seq;
int32_t clock_mode;
uint64_t cycle_last;
uint64_t mask;
uint32_t mult;
uint32_t shift;
struct {
uint64_t sec;
uint64_t nsec;
} basetime[ 12 ];
/*
union {
struct vdso_timestamp basetime[VDSO_BASES];
struct timens_offset offset[VDSO_BASES];
};
*/
int32_t tz_minuteswest;
int32_t tz_dsttime;
uint32_t hrtimer_res;
uint32_t __unused;
/* Empty struct on x86
struct arch_vdso_data arch_data;
*/
} vdso_data_t;
typedef unsigned __int128 uint128_t;
void get_kernel_version( uint32_t* major_out, uint32_t* minor_out ) {
/* uname -r */
*major_out = 5;
*minor_out = 9;
struct utsname info;
if ( uname( &info ) >= 0 ) {
uint32_t major = 0;
uint32_t minor = 0;
char* version = info.release;
while ( *version && *version != '.' ) {
char c = *version;
if ( c >= '0' && c <= '9' ) {
major *= 10;
major += ( c - '0' );
}
version++;
}
version++;
while ( *version && *version != '.' ) {
char c = *version;
if ( c >= '0' && c <= '9' ) {
minor *= 10;
minor += ( c - '0' );
}
version++;
}
if ( major ) {
*major_out = major;
*minor_out = minor;
}
}
}
intptr_t get_vdso_data_offset( void ) {
uint32_t major, minor;
get_kernel_version( &major, &minor );
intptr_t page_size = 1 << 12;
intptr_t offset = -4 * page_size;
if ( major == 5 ) {
if ( minor > 5 ) {
offset = -4 * page_size;
} else {
offset = -3 * page_size;
}
} else if ( major == 4 ) {
if ( minor > 11 ) {
offset = -3 * page_size;
} else if ( minor > 6 ) {
offset = -2 * page_size;
} else if ( minor > 4 ) {
offset = -3 * page_size;
} else if ( minor > 1 ) {
offset = -2 * page_size;
} else {
assert( !"Unsupported" );
}
}
offset += 128;
return offset;
}
int main( int argc, char** argv ) {
intptr_t vdso_data_offset = get_vdso_data_offset( );
uint8_t* vdso = ( uint8_t* ) getauxval( AT_SYSINFO_EHDR );
vdso_data_t* data = ( vdso_data_t* ) ( vdso + vdso_data_offset );
uint32_t mul = data[ 0 ].mult;
uint32_t shift = data[ 0 ].shift;
struct timespec res;
clock_getres( CLOCK_MONOTONIC, &res );
double frequency = res.tv_nsec * 1000000000;
uint32_t x = 0;
uint64_t rdtsc = __builtin_ia32_rdtscp( &x );
struct timespec spec;
clock_gettime( CLOCK_MONOTONIC, &spec );
uint128_t temp = ( uint128_t ) rdtsc * mul;
temp >>= shift;
assert( ( temp & 0xffffffffffffffff ) == temp );
uint64_t result = ( uint64_t ) ( temp );
double r1 = ( double ) result / frequency;
printf( "integer: %5.9f\n", r1 );
double divisor = 1;
for ( uint32_t i = 0; i < shift; i++ ) {
divisor *= 2;
}
double r1b = ( ( double ) rdtsc * ( double ) mul ) / divisor;
r1b /= frequency;
printf( "double : %5.9f\n", r1b );
double r2 = ( double ) spec.tv_sec + ( ( double ) spec.tv_nsec / frequency );
printf( "gettime: %5.9f\n", r2 );
printf( "---\n" );
printf("integer diff: %5.9f\n", r1 - r2 );
printf("double diff: %5.9f\n", r1b - r2 );
return 0;
}
|