0
4 posts

#14816
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 1 year, 1 month ago Edited by 0 on April 4, 2018, 6:26 p.m.
As promised, here is a reasonably straightforward sincos implementation that will just work. The has less <= 2 ULP(units of last place) of error within the range [39000, +39000].
I tried to remove most of the C++ isms, so that this will get more use. The bit_cast really is the most portable way to do this, you need to call memcpy to be safe, unions, reinterpret_cast, etc.. don't cut it. The memcpy will compile away (I've checked on every major version of compiler that people usually care about). This is hacked together from my personal codebase, so if there are any problems let me know and I'll do a more thorough verification. 

mmozeiko
Mārtiņš Možeiko
1923 posts
/ 1 project

#14817
Day 440  Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 1 year, 1 month ago
[ code ] bbtags please.

0
4 posts

#14818
Day 440  Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 1 year, 1 month ago
Fixed.

Pseudonym
Andrew Bromage
184 posts
/ 1 project
Research engineer, resident maths nerd (Erdős number 3). 
#14844
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 1 year, 1 month ago Edited by Andrew Bromage on April 6, 2018, 2:13 a.m.
Thanks for posting this!
A few random comments: Using __builtin_memcpy instead of unions or pointer hackery is an interesting approach. It's obviously not completely portable as written (MSVC doesn't have that), but it's certainly worth knowing about on platforms that support it. I didn't know that GCC optimised this. Of course, this is a temporary hack anyway. For SIMD registers, on most compilers you can just do this:
Except on MSVC, where you would use _mm_castps_si128 and _mm_castsi128_ps. Moving data between the floating point and register files is a potential bottleneck if you're not careful, so you'd like to keep your bit hacking code in SIMD registers as much as possible. (Fun fact: the CVTPS2PI instruction is still slightly cheaper than CVTSS2SI, one of the few operations where the "packed" version is cheaper than the "single" version.) Also for this reason, you should try the RoundToInteger trick instead of casting to see if that has a measurable performance impact. The fourstage CodyWaite reduction here gives you 12 bits of precision to play with, which means this works up to about sin(6400) or so. That's probably overkill for HMH (HMH is never going to call sin and cosine with an angle more than, say, 4pi), but it's good to see a worked example. An earlier version of the code mentioned Estrin's method, which I alluded to on the stream (see around the 1:09:00 mark). We do need to talk about that at some point. I like the idea of conditionally swapping sine and cosine after calculating them both, rather than swapping the arguments before calculating. This could save some register pressure. I'll have to think a bit more about this. There are still a lot of branches here. sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f}); 
mmozeiko
Mārtiņš Možeiko
1923 posts
/ 1 project

#14845
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 1 year, 1 month ago Edited by Mārtiņš Možeiko on April 6, 2018, 6:55 a.m.
Unless we talk about 32bit code, the 64bit uses simd registers for float ops. So it does not cost much reinterpreting them as ints, because I believe compiler can do int operations on same simd registers. But yeah, using float add/sub to round it without leaving floating point type is a nice trick. I use it for floor/ceil functions in SSE2.
Also are the branches so bad? Most of them can be simd'ified very easily. For example:
Can be done in 3 simd instructions:
If shift is expensive, then 4 instructions:
To avoid branch for handling denormals you can simply "flush" value to 0. So instead of this:
which can be done pretty reasonable with SSE. Btw, if somebody was not aware there is a nice "zlib" licensed header file that implements SSE and Neon based sin/cos (and other functions) using very good precision: http://gruntthepeon.free.fr/ssemath/ Author claims it is max error is 2^24 in [8192,+8192] interval. The code is pretty short and branch free  it does "bitwise trickery" to figure out which quadrant the argument value is and adjusts input or output to get correct result. To get good precision it evaluates two polynomials for quadrant, each is more precise in specific part of function and then chooses one or another. The code is based on scalar implementation of cephes library that is in public domain: http://www.netlib.org/cephes/ (look for single.tgz file). 
Pseudonym
Andrew Bromage
184 posts
/ 1 project
Research engineer, resident maths nerd (Erdős number 3). 
#14847
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 1 year, 1 month ago
For the record, I'm not saying it can't be branchfree, I'm just saying it's not (yet) branchfree.
sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f}); 
0
4 posts

#14852
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 1 year, 1 month ago
I posted this version because it is very straightforward to simdify for anyone familiar in doing so. The branches at the end, really aren't branches in the normal sense, they can all be handled with bit math. The denormal and signed zero handling was just for completeness, but can be completely ignored if you don't care. __builtinmemcpy was just a carryover from my codebase, memcpy with a constant number of bytes is sufficient and works everywhere and will compile to the correct thing. There are definitely faster/cheaper approximations, this code isn't meant to be that sort of approximation, it is meant as a replacement for std::sin and std::cos with extremely tight error bounds. The code is written in MultiplyAdd form so that readers can see that the whole thing can be turned into FusedMultiplyAdds when available. Hopefully that helps clarify some things.

Kapsy
16 posts
Based in Japan. Building a fully featured music production system with a twist. 
#16074
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 9 months ago
Was the source code for this episode ever made available?
I only just watched this episode and would like to dive into the math more deeply. 
mmozeiko
Mārtiņš Možeiko
1923 posts
/ 1 project

#16076
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 9 months ago
It was not uploaded.
