Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage

As promised, here is a reasonably straightforward sincos implementation that will just work. The has less <= 2 ULP(units of last place) of error within the range [-39000, +39000].

I tried to remove most of the C++ isms, so that this will get more use. The bit_cast really is the most portable way to do this, you need to call memcpy to be safe, unions, reinterpret_cast, etc.. don't cut it. The memcpy will compile away (I've checked on every major version of compiler that people usually care about). This is hacked together from my personal codebase, so if there are any problems let me know and I'll do a more thorough verification.

----------------------------------------------------------------------------------------------------

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Bit casting is the safe portable way to reinterpret bits, this generates exactly what you want it to. (It's free)
template <int BYTES>
inline void Copy(void* out, const void* in) {
  __builtin_memcpy(out, in, BYTES);
}

template <typename TO, typename FROM>
inline TO bit_cast(FROM x) {
  static_assert(sizeof(FROM) == sizeof(TO), "sizeof(FROM) != sizeof(TO)");
  TO y;
  Copy<sizeof(y)>(&y, &x);
  return y;
}

inline int32 Round(float x) { return int32(x + ((x < 0.0f) ? -0.5f : +0.5f)); }

inline float MultiplyAdd(float x, float y, float z) {
  return ((x * y) + z);
}

const float kPi = 3.141592653589793238462643383279502884;

void SineCosine(float x, float* sc) {
  const int32 kNotSignBit(0x7FFFFFFF);
  const int32 kRangeTiny(0x39800000);  // bit_cast<int32>(~0.000244141)

  // Small angle approximation, handles signed 0's and denormals, skip this if you don't care.
  int32 i(bit_cast<int32>(x) & kNotSignBit);  // |x| as int32
  if (i < kRangeTiny) {                       // |x| < ~0.000244141
    sc[0] = x;
    sc[1] = 1.0f;
    return;
  }

  // Extended precision modular arithmetic.
  int32 q(Round(x * (2.0f / kPi)));
  float qf(q);
  x = MultiplyAdd(qf, -(3.140625f * 0.5f), x);
  x = MultiplyAdd(qf, -(0.0009670257568359375f * 0.5f), x);
  x = MultiplyAdd(qf, -(6.2771141529083251953e-7f * 0.5f), x);
  x = MultiplyAdd(qf, -(1.2154201256553420762e-10 * 0.5f), x);

  // Note: This can be solved faster in the non-vectorized case by
  // computing both polynomials at the same time, when vectorized
  // it is faster to just evaluate them like this.

  float y(x * x);

  float s = +2.6083159809786593541503e-6f;
  s = MultiplyAdd(s, y, -1.981069071916863322258e-4f);
  s = MultiplyAdd(s, y, +8.33307858556509017944336e-3f);
  s = MultiplyAdd(s, y, -1.66666597127914428710938e-1f);
  s = MultiplyAdd(s, (y * x), x);

  float c = -2.718118424e-7f;
  c = MultiplyAdd(c, y, +2.479904470e-5f);
  c = MultiplyAdd(c, y, -1.388887875e-3f);
  c = MultiplyAdd(c, y, +4.166666418e-2f);
  c = MultiplyAdd(c, y, -0.5f);
  c = MultiplyAdd(c, y, +1.0f);

  if (q & 1) { // maybe swap sine and cosine
    float swap = c;
    c = s;
    s = swap;
  }
  if (q & 2) { // maybe negate sine
    s = -s;
  }
  if ((q + 1) & 2) { // maybe negate cosine
    c = -c;
  }

  sc[0] = s;
  sc[1] = c;
}

Edited by 0 on
[ code ] bbtags please.
Fixed.
Thanks for posting this!

A few random comments:

Using __builtin_memcpy instead of unions or pointer hackery is an interesting approach. It's obviously not completely portable as written (MSVC doesn't have that), but it's certainly worth knowing about on platforms that support it. I didn't know that GCC optimised this.

Of course, this is a temporary hack anyway. For SIMD registers, on most compilers you can just do this:

1
2
__m128 x;
__m128i xi = (__m128i)x;


Except on MSVC, where you would use _mm_castps_si128 and _mm_castsi128_ps. Moving data between the floating point and register files is a potential bottleneck if you're not careful, so you'd like to keep your bit hacking code in SIMD registers as much as possible. (Fun fact: the CVTPS2PI instruction is still slightly cheaper than CVTSS2SI, one of the few operations where the "packed" version is cheaper than the "single" version.)

Also for this reason, you should try the RoundToInteger trick instead of casting to see if that has a measurable performance impact.

The four-stage Cody-Waite reduction here gives you 12 bits of precision to play with, which means this works up to about sin(6400) or so. That's probably overkill for HMH (HMH is never going to call sin and cosine with an angle more than, say, 4pi), but it's good to see a worked example.

An earlier version of the code mentioned Estrin's method, which I alluded to on the stream (see around the 1:09:00 mark). We do need to talk about that at some point.

I like the idea of conditionally swapping sine and cosine after calculating them both, rather than swapping the arguments before calculating. This could save some register pressure. I'll have to think a bit more about this.

There are still a lot of branches here.

Edited by Andrew Bromage on
Unless we talk about 32-bit code, the 64-bit uses simd registers for float ops. So it does not cost much reinterpreting them as ints, because I believe compiler can do int operations on same simd registers. But yeah, using float add/sub to round it without leaving floating point type is a nice trick. I use it for floor/ceil functions in SSE2.

Also are the branches so bad? Most of them can be simd'ified very easily. For example:
1
2
3
  if (q & 2) { // maybe negate sine
    s = -s;
  }

Can be done in 3 simd instructions:
1
2
3
tmp1 = q & 2
tmp2 = tmp1 << 30
s = s ^ tmp2

If shift is expensive, then 4 instructions:
1
2
3
4
tmp1 = q & 2
tmp2 = tmp1 != 0
tmp3 = tmp2 & 0x80000000
s = s ^ tmp3


To avoid branch for handling denormals you can simply "flush" value to 0. So instead of this:
1
2
3
4
5
  if (i < kRangeTiny) {                       // |x| < ~0.000244141
    sc[0] = x;
    sc[1] = 1.0f;
    return;
  }
do this:
1
2
3
  if (i < small_number) {
    x = 0.f;
  }

which can be done pretty reasonable with SSE.

Btw, if somebody was not aware there is a nice "zlib" licensed header file that implements SSE and Neon based sin/cos (and other functions) using very good precision: http://gruntthepeon.free.fr/ssemath/

Author claims it is max error is 2^-24 in [-8192,+8192] interval. The code is pretty short and branch free - it does "bitwise trickery" to figure out which quadrant the argument value is and adjusts input or output to get correct result. To get good precision it evaluates two polynomials for quadrant, each is more precise in specific part of function and then chooses one or another.

The code is based on scalar implementation of cephes library that is in public domain: http://www.netlib.org/cephes/ (look for single.tgz file).

Edited by Mārtiņš Možeiko on
For the record, I'm not saying it can't be branch-free, I'm just saying it's not (yet) branch-free.
I posted this version because it is very straightforward to simdify for anyone familiar in doing so. The branches at the end, really aren't branches in the normal sense, they can all be handled with bit math. The denormal and signed zero handling was just for completeness, but can be completely ignored if you don't care. __builtinmemcpy was just a carryover from my codebase, memcpy with a constant number of bytes is sufficient and works everywhere and will compile to the correct thing. There are definitely faster/cheaper approximations, this code isn't meant to be that sort of approximation, it is meant as a replacement for std::sin and std::cos with extremely tight error bounds. The code is written in MultiplyAdd form so that readers can see that the whole thing can be turned into FusedMultiplyAdds when available. Hopefully that helps clarify some things.
Was the source code for this episode ever made available?

I only just watched this episode and would like to dive into the math more deeply.
It was not uploaded.