Wow, thanks for the quick replies! (And yea, I'm not very good at this sort of thing, as is apparent!)
An acquaintance of mine, Marco Giordano, put me on a better path as well regarding the benchmark:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 | #include "xmmintrin.h"
#include <cmath>
#include <cstdio>
#include <chrono>
#include <iostream>
inline float squareRoot(const float val)
{
return _mm_cvtss_f32(_mm_sqrt_ps(_mm_set_ps(val)));
}
inline float squareRootSTL(const float val)
{
return sqrt(val);
}
int main()
{
float accum = 0.0f;
unsigned int counter = 1000000000;
float testVal, testVal2;
auto begin = std::chrono::high_resolution_clock::now();
for (auto i = 0; i < counter; ++i) {
testVal2 = 49.48491f + i;
accum += squareRootSTL(testVal2);
}
auto end1 = std::chrono::high_resolution_clock::now();
std::cout << "accum " <<accum<< std::endl;
accum = 0.0f;
auto start2 = std::chrono::high_resolution_clock::now();
for (auto i = 0; i < counter; ++i) {
testVal = 49.48491f + i;
accum += squareRoot(testVal);
}
auto end2 = std::chrono::high_resolution_clock::now();
std::cout << testVal << ": " << std::chrono::duration_cast<std::chrono::nanoseconds>(end1 - begin).count() << " STL version" << std::endl;
std::cout << testVal2 << ": " << std::chrono::duration_cast<std::chrono::nanoseconds>(end2 - start2).count() << " intrinsics version" << std::endl;
std::cout << "accum " <<accum<< std::endl;
return 0;
}
|
Compiling now gives:
1e+09: 3643982319 STL version
1e+09: 1969619375 intrinsics version
Oddly enough, if I switch from _mm_sqrt_ps to _mm_sqrt_ss, it turns out to be almost the same as the CRT version, or sometimes slower; why is this? Based on the description of what the ps version (calculate for all elements) does compared to the ss version (which calculates the sqrt for only a single element), why is the ss version slower?
Additionally, if the compiler could determine that the value was never used for the CRT version until the final value for the cout statement, why couldn't it do the same for the intrinsic version as well?