How about replacing expensive _mm_sqrt_ps with approximation of inverse square root?
Basically replace:
with:
| _mm_mul_ps(a, _mm_rsqrt_ps(a))
|
For me on Haswell this reduces
40cy/h to
38cy/h. So not a big improvement. I have no older CPUs to test this on, but I would expect it will make bigger difference where sqrt is more expensive.