Siew Yi Liang
7 posts
Question about the intrinsics in SquareRoot
Edited by Siew Yi Liang on
(I'm pretty green when it comes to intrinsics and assembly in general)

I was poking around the HMH code and was curious about the performance of using intrinsic functions for square root versus normal STL functions. I did a little test here:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 #include "xmmintrin.h" #include #include #include #include inline float squareRoot(const float val) { return _mm_cvtss_f32(_mm_sqrt_ss(_mm_set_ss(val))); } inline float squareRootSTL(const float val) { return sqrt(val); } int main() { unsigned int counter = 100000; float testVal, testVal2; auto begin = std::chrono::high_resolution_clock::now(); for (auto i = 0; i < counter; ++i) { testVal2 = 49.48491 + i; testVal2 = squareRootSTL(testVal2); } auto end1= std::chrono::high_resolution_clock::now(); auto start2= std::chrono::high_resolution_clock::now(); for (auto i = 0; i < counter; ++i) { testVal = 49.48491 + i; testVal = squareRoot(testVal); } auto end2= std::chrono::high_resolution_clock::now(); std::cout << testVal << ": " <(end1 - begin).count() << "STL version" << std::endl; std::cout << testVal2 << " : " << std::chrono::duration_cast(end2 - start2).count() << "intrinsics version" << std::endl; return 0; } 

The results on my machine with /Ox and /fp:precise are:
316.304: 34334STL version
316.304 : 186730intrinsics version

Looking at the asm generated, it also seems like the intrinsic version generates a lot more instructions than the STL version:

https://godbolt.org/g/sje3w9

Looking at the codebase and where SquareRoot is used, I'm unclear as to how it would be faster than using the std::sqrt function in <cmath>? Could someone kindly explain to me the reasoning behind it? I'm geniunely curious to know under what circumstances it would perform faster, given an operation on a single float?

Thanks!
504 posts
Question about the intrinsics in SquareRoot
Edited by ratchetfreak on
 1 2 squareRootSTL, COMDAT PROC jmp sqrtf 

This is literally a jump to the actual implementation of sqrtf which lives elsewhere in the executable. jmp does no actual math.

so your amount of assembly code argument is kinda invalid.

Mārtiņš Možeiko
2383 posts / 2 projects
Question about the intrinsics in SquareRoot
Edited by Mārtiņš Možeiko on
This is not how you do microbenchmarks. Compiler noticed that you are not using result of sqrt (except last one), and simply skipped a lot of them.

Here's the generated assembly for sqrt loop:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  xor ebp, ebp mov edi, ebp [email protected]: ; Line 24 lea eax, DWORD PTR [rdi+9] movd xmm0, eax cvtdq2pd xmm0, xmm0 addsd xmm0, xmm7 cvtpd2ps xmm0, xmm0 ; Line 25 call sqrtf add edi, 10 movaps xmm8, xmm0 cmp edi, 100000 ; 000186a0H jb SHORT [email protected] 

As you can see, compile decided to increment loop counter by 10 ("add edi, 10"). Not by 1. So it actually does 10x less sqrt calculations than intrinsic function (which increments counter by 1). Why it does only ever 10th sqrt? No idea. But that's why you are seeing sqrt being "faster" than intrinsic version.

Also you are doing a huge overhead in this benchmark - converting between double's and float's. Just use floats only - for literal constant, and for sqrt function too.

And sqrt is not STL. Its in C runtime, not C++.
Siew Yi Liang
7 posts
Question about the intrinsics in SquareRoot
Edited by Siew Yi Liang on
Wow, thanks for the quick replies! (And yea, I'm not very good at this sort of thing, as is apparent!)

An acquaintance of mine, Marco Giordano, put me on a better path as well regarding the benchmark:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 `#include "xmmintrin.h" #include #include #include #include inline float squareRoot(const float val) { return _mm_cvtss_f32(_mm_sqrt_ps(_mm_set_ps(val))); } inline float squareRootSTL(const float val) { return sqrt(val); } int main() { float accum = 0.0f; unsigned int counter = 1000000000; float testVal, testVal2; auto begin = std::chrono::high_resolution_clock::now(); for (auto i = 0; i < counter; ++i) { testVal2 = 49.48491f + i; accum += squareRootSTL(testVal2); } auto end1 = std::chrono::high_resolution_clock::now(); std::cout << "accum " <(end1 - begin).count() << " STL version" << std::endl; std::cout << testVal2 << ": " << std::chrono::duration_cast(end2 - start2).count() << " intrinsics version" << std::endl; std::cout << "accum " <

Compiling now gives:
1e+09: 3643982319 STL version
1e+09: 1969619375 intrinsics version

Oddly enough, if I switch from _mm_sqrt_ps to _mm_sqrt_ss, it turns out to be almost the same as the CRT version, or sometimes slower; why is this? Based on the description of what the ps version (calculate for all elements) does compared to the ss version (which calculates the sqrt for only a single element), why is the ss version slower?

Additionally, if the compiler could determine that the value was never used for the CRT version until the final value for the cout statement, why couldn't it do the same for the intrinsic version as well?

504 posts
Question about the intrinsics in SquareRoot
The ss (single value, single precision) version needs to preserve the upper 3 values of the 4 value vec4 that each mmx value is. So the ps (packed values, single precision) version does not need the result register cleared and doesn't depend on anything but the input operand.
Mārtiņš Možeiko
2383 posts / 2 projects
Question about the intrinsics in SquareRoot
It prints out almost the same times for me for ss or ps variant.
Siew Yi Liang
7 posts
Question about the intrinsics in SquareRoot
Edited by Siew Yi Liang on
Interesting, I just tried it at work on Linux (which has a newer i7-6850K CPU, but older GCC 4.8.5) using GCC/Clang (6.0.0), and the results were markedly different from what I had at home. I guess the performance of the extended registers really depends on hardware as well? (And of course newer compiler helps I imagine)

Clang, using SS:
accum 5.49756e+11
1e+09: 1181598806 STL version
1e+09: 1058039552 intrinsics version
accum 5.49756e+11

GCC, using PS:
accum 5.49756e+11
1e+09: 3575760757 STL version
1e+09: 1679399898 intrinsics version
accum 5.49756e+11

GCC, using SS:
accum 5.49756e+11
1e+09: 3604408999 STL version
1e+09: 1875016342 intrinsics version
accum 5.49756e+11

I'll try again at home using GCC/Clang and see what the times are; that's something I didn't consider.

This is super informative, thanks guys!