Basically replace:
1 | _mm_sqrt_ps(a) |
with:
1 | _mm_mul_ps(a, _mm_rsqrt_ps(a)) |
For me on Haswell this reduces 40cy/h to 38cy/h. So not a big improvement. I have no older CPUs to test this on, but I would expect it will make bigger difference where sqrt is more expensive.