Thanks for posting this!
A few random comments:
Using __builtin_memcpy instead of unions or pointer hackery is an interesting approach. It's obviously not completely portable as written (MSVC doesn't have that), but it's certainly worth knowing about on platforms that support it. I didn't know that GCC optimised this.
Of course, this is a temporary hack anyway. For SIMD registers, on most compilers you can just do this:
__m128i xi = (__m128i)x;
Except on MSVC, where you would use _mm_castps_si128 and _mm_castsi128_ps. Moving data between the floating point and register files is a potential bottleneck if you're not careful, so you'd like to keep your bit hacking code in SIMD registers as much as possible. (Fun fact: the CVTPS2PI instruction is still slightly cheaper than CVTSS2SI, one of the few operations where the "packed" version is cheaper than the "single" version.)
Also for this reason, you should try the RoundToInteger trick instead of casting to see if that has a measurable performance impact.
The four-stage Cody-Waite reduction here gives you 12 bits of precision to play with, which means this works up to about sin(6400) or so. That's probably overkill for HMH (HMH is never going to call sin and cosine with an angle more than, say, 4pi), but it's good to see a worked example.
An earlier version of the code mentioned Estrin's method, which I alluded to on the stream (see around the 1:09:00 mark). We do need to talk about that at some point.
I like the idea of conditionally swapping sine and cosine after calculating them both, rather than swapping the arguments before calculating. This could save some register pressure. I'll have to think a bit more about this.
There are still a lot of branches here.