Question about the intrinsics in SquareRoot

(I'm pretty green when it comes to intrinsics and assembly in general)

I was poking around the HMH code and was curious about the performance of using intrinsic functions for square root versus normal STL functions. I did a little test here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include "xmmintrin.h"
#include <cmath>
#include <cstdio>
#include <chrono>
#include <iostream>

inline float squareRoot(const float val)
{
	return _mm_cvtss_f32(_mm_sqrt_ss(_mm_set_ss(val)));
}

inline float squareRootSTL(const float val)
{
	return sqrt(val);
}


int main()
{
	unsigned int counter = 100000;
    float testVal, testVal2;
	auto begin = std::chrono::high_resolution_clock::now();
	for (auto i = 0; i < counter; ++i) {
		testVal2 = 49.48491 + i;
		testVal2 = squareRootSTL(testVal2);
	}
	auto end1= std::chrono::high_resolution_clock::now();

	auto start2= std::chrono::high_resolution_clock::now();
	for (auto i = 0; i < counter; ++i) {
		testVal = 49.48491 + i;
		testVal = squareRoot(testVal);
	}
	auto end2= std::chrono::high_resolution_clock::now();
	std::cout << testVal << ": " <<std::chrono::duration_cast<std::chrono::nanoseconds>(end1 - begin).count() << "STL version" << std::endl;
	std::cout << testVal2 << " : " << std::chrono::duration_cast<std::chrono::nanoseconds>(end2 - start2).count() << "intrinsics version" << std::endl;
	return 0;
}


The results on my machine with /Ox and /fp:precise are:
316.304: 34334STL version
316.304 : 186730intrinsics version

Looking at the asm generated, it also seems like the intrinsic version generates a lot more instructions than the STL version:

https://godbolt.org/g/sje3w9

Looking at the codebase and where SquareRoot is used, I'm unclear as to how it would be faster than using the std::sqrt function in <cmath>? Could someone kindly explain to me the reasoning behind it? I'm geniunely curious to know under what circumstances it would perform faster, given an operation on a single float?

Thanks!

Edited by Siew Yi Liang on
1
2
squareRootSTL, COMDAT PROC
        jmp      sqrtf


This is literally a jump to the actual implementation of sqrtf which lives elsewhere in the executable. jmp does no actual math.

so your amount of assembly code argument is kinda invalid.


Edited by ratchetfreak on
This is not how you do microbenchmarks. Compiler noticed that you are not using result of sqrt (except last one), and simply skipped a lot of them.

Here's the generated assembly for sqrt loop:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
        xor     ebp, ebp
        mov     edi, ebp
$LL127@main:
; Line 24
        lea     eax, DWORD PTR [rdi+9]
        movd    xmm0, eax
        cvtdq2pd xmm0, xmm0
        addsd   xmm0, xmm7
        cvtpd2ps xmm0, xmm0
; Line 25
        call    sqrtf
        add     edi, 10
        movaps  xmm8, xmm0
        cmp     edi, 100000                             ; 000186a0H
        jb      SHORT $LL127@main

As you can see, compile decided to increment loop counter by 10 ("add edi, 10"). Not by 1. So it actually does 10x less sqrt calculations than intrinsic function (which increments counter by 1). Why it does only ever 10th sqrt? No idea. But that's why you are seeing sqrt being "faster" than intrinsic version.

Also you are doing a huge overhead in this benchmark - converting between double's and float's. Just use floats only - for literal constant, and for sqrt function too.

And sqrt is not STL. Its in C runtime, not C++.

Edited by Mārtiņš Možeiko on
Wow, thanks for the quick replies! (And yea, I'm not very good at this sort of thing, as is apparent!)

An acquaintance of mine, Marco Giordano, put me on a better path as well regarding the benchmark:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include "xmmintrin.h"
#include <cmath>
#include <cstdio>
#include <chrono>
#include <iostream>

inline float squareRoot(const float val)
{
	return _mm_cvtss_f32(_mm_sqrt_ps(_mm_set_ps(val)));
}

inline float squareRootSTL(const float val)
{
	return sqrt(val);
}


int main()
{
	float accum = 0.0f;
	unsigned int counter = 1000000000;
	float testVal, testVal2;
	auto begin = std::chrono::high_resolution_clock::now();
	for (auto i = 0; i < counter; ++i) {
		testVal2 = 49.48491f + i;
		accum += squareRootSTL(testVal2);
	}
	auto end1 = std::chrono::high_resolution_clock::now();
	std::cout << "accum " <<accum<< std::endl; 

	accum = 0.0f;
	auto start2 = std::chrono::high_resolution_clock::now();
	for (auto i = 0; i < counter; ++i) {
		testVal = 49.48491f + i;
		accum += squareRoot(testVal);
	}
	auto end2 = std::chrono::high_resolution_clock::now();
	std::cout << testVal << ": " << std::chrono::duration_cast<std::chrono::nanoseconds>(end1 - begin).count() << " STL version" << std::endl;
	std::cout << testVal2 << ": " << std::chrono::duration_cast<std::chrono::nanoseconds>(end2 - start2).count() << " intrinsics version" << std::endl;
	std::cout << "accum " <<accum<< std::endl; 
	return 0;
}


Compiling now gives:
1e+09: 3643982319 STL version
1e+09: 1969619375 intrinsics version

Oddly enough, if I switch from _mm_sqrt_ps to _mm_sqrt_ss, it turns out to be almost the same as the CRT version, or sometimes slower; why is this? Based on the description of what the ps version (calculate for all elements) does compared to the ss version (which calculates the sqrt for only a single element), why is the ss version slower?

Additionally, if the compiler could determine that the value was never used for the CRT version until the final value for the cout statement, why couldn't it do the same for the intrinsic version as well?



Edited by Siew Yi Liang on
The ss (single value, single precision) version needs to preserve the upper 3 values of the 4 value vec4 that each mmx value is. So the ps (packed values, single precision) version does not need the result register cleared and doesn't depend on anything but the input operand.
It prints out almost the same times for me for ss or ps variant.
Interesting, I just tried it at work on Linux (which has a newer i7-6850K CPU, but older GCC 4.8.5) using GCC/Clang (6.0.0), and the results were markedly different from what I had at home. I guess the performance of the extended registers really depends on hardware as well? (And of course newer compiler helps I imagine)

Clang, using SS:
accum 5.49756e+11
1e+09: 1181598806 STL version
1e+09: 1058039552 intrinsics version
accum 5.49756e+11

GCC, using PS:
accum 5.49756e+11
1e+09: 3575760757 STL version
1e+09: 1679399898 intrinsics version
accum 5.49756e+11

GCC, using SS:
accum 5.49756e+11
1e+09: 3604408999 STL version
1e+09: 1875016342 intrinsics version
accum 5.49756e+11

I'll try again at home using GCC/Clang and see what the times are; that's something I didn't consider.

This is super informative, thanks guys!



Edited by Siew Yi Liang on