Handmade Hero » Forums » Game » Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
0
4 posts
#14816 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago Edited by 0 on April 4, 2018, 6:26 p.m.

As promised, here is a reasonably straightforward sincos implementation that will just work. The has less <= 2 ULP(units of last place) of error within the range [-39000, +39000].

I tried to remove most of the C++ isms, so that this will get more use. The bit_cast really is the most portable way to do this, you need to call memcpy to be safe, unions, reinterpret_cast, etc.. don't cut it. The memcpy will compile away (I've checked on every major version of compiler that people usually care about). This is hacked together from my personal codebase, so if there are any problems let me know and I'll do a more thorough verification.

----------------------------------------------------------------------------------------------------

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Bit casting is the safe portable way to reinterpret bits, this generates exactly what you want it to. (It's free)
template <int BYTES>
inline void Copy(void* out, const void* in) {
  __builtin_memcpy(out, in, BYTES);
}

template <typename TO, typename FROM>
inline TO bit_cast(FROM x) {
  static_assert(sizeof(FROM) == sizeof(TO), "sizeof(FROM) != sizeof(TO)");
  TO y;
  Copy<sizeof(y)>(&y, &x);
  return y;
}

inline int32 Round(float x) { return int32(x + ((x < 0.0f) ? -0.5f : +0.5f)); }

inline float MultiplyAdd(float x, float y, float z) {
  return ((x * y) + z);
}

const float kPi = 3.141592653589793238462643383279502884;

void SineCosine(float x, float* sc) {
  const int32 kNotSignBit(0x7FFFFFFF);
  const int32 kRangeTiny(0x39800000);  // bit_cast<int32>(~0.000244141)

  // Small angle approximation, handles signed 0's and denormals, skip this if you don't care.
  int32 i(bit_cast<int32>(x) & kNotSignBit);  // |x| as int32
  if (i < kRangeTiny) {                       // |x| < ~0.000244141
    sc[0] = x;
    sc[1] = 1.0f;
    return;
  }

  // Extended precision modular arithmetic.
  int32 q(Round(x * (2.0f / kPi)));
  float qf(q);
  x = MultiplyAdd(qf, -(3.140625f * 0.5f), x);
  x = MultiplyAdd(qf, -(0.0009670257568359375f * 0.5f), x);
  x = MultiplyAdd(qf, -(6.2771141529083251953e-7f * 0.5f), x);
  x = MultiplyAdd(qf, -(1.2154201256553420762e-10 * 0.5f), x);

  // Note: This can be solved faster in the non-vectorized case by
  // computing both polynomials at the same time, when vectorized
  // it is faster to just evaluate them like this.

  float y(x * x);

  float s = +2.6083159809786593541503e-6f;
  s = MultiplyAdd(s, y, -1.981069071916863322258e-4f);
  s = MultiplyAdd(s, y, +8.33307858556509017944336e-3f);
  s = MultiplyAdd(s, y, -1.66666597127914428710938e-1f);
  s = MultiplyAdd(s, (y * x), x);

  float c = -2.718118424e-7f;
  c = MultiplyAdd(c, y, +2.479904470e-5f);
  c = MultiplyAdd(c, y, -1.388887875e-3f);
  c = MultiplyAdd(c, y, +4.166666418e-2f);
  c = MultiplyAdd(c, y, -0.5f);
  c = MultiplyAdd(c, y, +1.0f);

  if (q & 1) { // maybe swap sine and cosine
    float swap = c;
    c = s;
    s = swap;
  }
  if (q & 2) { // maybe negate sine
    s = -s;
  }
  if ((q + 1) & 2) { // maybe negate cosine
    c = -c;
  }

  sc[0] = s;
  sc[1] = c;
}
mmozeiko
Mārtiņš Možeiko
1716 posts / 1 project
#14817 Day 440 - Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago

[ code ] bbtags please.
0
4 posts
#14818 Day 440 - Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago

Fixed.
d7samurai
69 posts

innovation through innovation™

#14837 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago Edited by d7samurai on April 5, 2018, 8:50 a.m.

You removed the simple version? I was curious about how that compares to the quick'n'dirty one I've been using:
1
2
3
4
5
float _sin(float x) // x in range -PI, PI
{
    float s = ((4.0f / PI) + (4.0f / (PI * PI)) * (x < 0.0f ? x : -x)) * x;
    return 0.22400815411541156f * ((s < 0.0f ? -s : s) * s - s) + s;
}

software development should be a creative exercise, not a technical ordeal
0
4 posts
#14838 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago Edited by 0 on April 5, 2018, 9:38 a.m.

Your version will work fine as an approximation and that's often all people need, but its error is way worse. The code I suggested was meant to be a replacement for more general purpose use. It will only be different in the last bit by at most 2 and the average is more like 0.01 ULPS, I removed all the template code I had before to make it easier to understand. Using Horner form for evaluating polynomials is not the fastest way, but I didn't want that to confuse what I was trying to get across.
d7samurai
69 posts

innovation through innovation™

#14839 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago Edited by d7samurai on April 7, 2018, 8:11 p.m.

There are no divides in this code (the divides you see are all compile time). EDIT: I just noticed you removed the comment on the divides, so disregard.

It compiles down to this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
movss xmm3, DWORD PTR __real@3fa2f983
movaps xmm1, xmm0
andps xmm1, DWORD PTR __xmm@7fffffff7fffffff7fffffff7fffffff
mulss xmm1, DWORD PTR __real@3ecf817a
subss xmm3, xmm1
mulss xmm3, xmm0
movaps xmm0, xmm3
andps xmm0, DWORD PTR __xmm@7fffffff7fffffff7fffffff7fffffff
mulss xmm0, xmm3
subss xmm0, xmm3
mulss xmm0, DWORD PTR __real@3e656265
addss xmm0, xmm3

Max error is < 0.001 vs canonical sine, which is good enough for a surprisingly large range of uses (3D graphics, sound synthesis..).

Anyway, I know yours is intended to be a proper replacement for the crt sin/cos. I had just noticed a "SinePolySimple" function in your originally posted code and at the time I took that to be a standalone approximative alternative in the same vein as mine (hence curious). Looking back I guess that part was just a[n interchangeable] subcomponent of the full-on algorithm (and the other options since got pruned for clarity, as you said). It's a nice general purpose implementation..

software development should be a creative exercise, not a technical ordeal
Pseudonym
Andrew Bromage
184 posts / 1 project

Research engineer, resident maths nerd (Erdős number 3).

#14844 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago Edited by Andrew Bromage on April 6, 2018, 2:13 a.m.

Thanks for posting this!

A few random comments:

Using __builtin_memcpy instead of unions or pointer hackery is an interesting approach. It's obviously not completely portable as written (MSVC doesn't have that), but it's certainly worth knowing about on platforms that support it. I didn't know that GCC optimised this.

Of course, this is a temporary hack anyway. For SIMD registers, on most compilers you can just do this:

1
2
__m128 x;
__m128i xi = (__m128i)x;


Except on MSVC, where you would use _mm_castps_si128 and _mm_castsi128_ps. Moving data between the floating point and register files is a potential bottleneck if you're not careful, so you'd like to keep your bit hacking code in SIMD registers as much as possible. (Fun fact: the CVTPS2PI instruction is still slightly cheaper than CVTSS2SI, one of the few operations where the "packed" version is cheaper than the "single" version.)

Also for this reason, you should try the RoundToInteger trick instead of casting to see if that has a measurable performance impact.

The four-stage Cody-Waite reduction here gives you 12 bits of precision to play with, which means this works up to about sin(6400) or so. That's probably overkill for HMH (HMH is never going to call sin and cosine with an angle more than, say, 4pi), but it's good to see a worked example.

An earlier version of the code mentioned Estrin's method, which I alluded to on the stream (see around the 1:09:00 mark). We do need to talk about that at some point.

I like the idea of conditionally swapping sine and cosine after calculating them both, rather than swapping the arguments before calculating. This could save some register pressure. I'll have to think a bit more about this.

There are still a lot of branches here.

sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f});
mmozeiko
Mārtiņš Možeiko
1716 posts / 1 project
#14845 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 2 weeks ago Edited by Mārtiņš Možeiko on April 6, 2018, 6:55 a.m.

Unless we talk about 32-bit code, the 64-bit uses simd registers for float ops. So it does not cost much reinterpreting them as ints, because I believe compiler can do int operations on same simd registers. But yeah, using float add/sub to round it without leaving floating point type is a nice trick. I use it for floor/ceil functions in SSE2.

Also are the branches so bad? Most of them can be simd'ified very easily. For example:
1
2
3
  if (q & 2) { // maybe negate sine
    s = -s;
  }

Can be done in 3 simd instructions:
1
2
3
tmp1 = q & 2
tmp2 = tmp1 << 30
s = s ^ tmp2

If shift is expensive, then 4 instructions:
1
2
3
4
tmp1 = q & 2
tmp2 = tmp1 != 0
tmp3 = tmp2 & 0x80000000
s = s ^ tmp3


To avoid branch for handling denormals you can simply "flush" value to 0. So instead of this:
1
2
3
4
5
  if (i < kRangeTiny) {                       // |x| < ~0.000244141
    sc[0] = x;
    sc[1] = 1.0f;
    return;
  }
do this:
1
2
3
  if (i < small_number) {
    x = 0.f;
  }

which can be done pretty reasonable with SSE.

Btw, if somebody was not aware there is a nice "zlib" licensed header file that implements SSE and Neon based sin/cos (and other functions) using very good precision: http://gruntthepeon.free.fr/ssemath/

Author claims it is max error is 2^-24 in [-8192,+8192] interval. The code is pretty short and branch free - it does "bitwise trickery" to figure out which quadrant the argument value is and adjusts input or output to get correct result. To get good precision it evaluates two polynomials for quadrant, each is more precise in specific part of function and then chooses one or another.

The code is based on scalar implementation of cephes library that is in public domain: http://www.netlib.org/cephes/ (look for single.tgz file).
Pseudonym
Andrew Bromage
184 posts / 1 project

Research engineer, resident maths nerd (Erdős number 3).

#14847 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 1 week ago

For the record, I'm not saying it can't be branch-free, I'm just saying it's not (yet) branch-free.

sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f});
0
4 posts
#14852 Handmade Hero Day 440 - Introduction to Function Approximation with Andrew Bromage
3 months, 1 week ago

I posted this version because it is very straightforward to simdify for anyone familiar in doing so. The branches at the end, really aren't branches in the normal sense, they can all be handled with bit math. The denormal and signed zero handling was just for completeness, but can be completely ignored if you don't care. __builtinmemcpy was just a carryover from my codebase, memcpy with a constant number of bytes is sufficient and works everywhere and will compile to the correct thing. There are definitely faster/cheaper approximations, this code isn't meant to be that sort of approximation, it is meant as a replacement for std::sin and std::cos with extremely tight error bounds. The code is written in MultiplyAdd form so that readers can see that the whole thing can be turned into FusedMultiplyAdds when available. Hopefully that helps clarify some things.