I was poking around the HMH code and was curious about the performance of using intrinsic functions for square root versus normal STL functions. I did a little test here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #include "xmmintrin.h" #include <cmath> #include <cstdio> #include <chrono> #include <iostream> inline float squareRoot(const float val) { return _mm_cvtss_f32(_mm_sqrt_ss(_mm_set_ss(val))); } inline float squareRootSTL(const float val) { return sqrt(val); } int main() { unsigned int counter = 100000; float testVal, testVal2; auto begin = std::chrono::high_resolution_clock::now(); for (auto i = 0; i < counter; ++i) { testVal2 = 49.48491 + i; testVal2 = squareRootSTL(testVal2); } auto end1= std::chrono::high_resolution_clock::now(); auto start2= std::chrono::high_resolution_clock::now(); for (auto i = 0; i < counter; ++i) { testVal = 49.48491 + i; testVal = squareRoot(testVal); } auto end2= std::chrono::high_resolution_clock::now(); std::cout << testVal << ": " <<std::chrono::duration_cast<std::chrono::nanoseconds>(end1 - begin).count() << "STL version" << std::endl; std::cout << testVal2 << " : " << std::chrono::duration_cast<std::chrono::nanoseconds>(end2 - start2).count() << "intrinsics version" << std::endl; return 0; } |
The results on my machine with /Ox and /fp:precise are:
316.304: 34334STL version
316.304 : 186730intrinsics version
Looking at the asm generated, it also seems like the intrinsic version generates a lot more instructions than the STL version:
https://godbolt.org/g/sje3w9
Looking at the codebase and where SquareRoot is used, I'm unclear as to how it would be faster than using the std::sqrt function in <cmath>? Could someone kindly explain to me the reasoning behind it? I'm geniunely curious to know under what circumstances it would perform faster, given an operation on a single float?
Thanks!