How about replacing expensive _mm_sqrt_ps with approximation of inverse square root?
Basically replace:
with:
 | _mm_mul_ps(a, _mm_rsqrt_ps(a))
  
 | 
 
For me on Haswell this reduces 
40cy/h to 
38cy/h. So not a big improvement. I have no older CPUs to test this on, but I would expect it will make bigger difference where sqrt is more expensive.