_mm_sqrt_ps optimization for Day 119

How about replacing expensive _mm_sqrt_ps with approximation of inverse square root?

Basically replace:
1
_mm_sqrt_ps(a)


with:
1
_mm_mul_ps(a, _mm_rsqrt_ps(a))


For me on Haswell this reduces 40cy/h to 38cy/h. So not a big improvement. I have no older CPUs to test this on, but I would expect it will make bigger difference where sqrt is more expensive.

Edited by Mārtiņš Možeiko on
Fabian Giesen posten this twitter thread which could also be helpfull

https://twitter.com/rygorous/status/598795742145224704