0
4 posts

#14816
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago Edited by 0 on April 4, 2018, 6:26 p.m.
As promised, here is a reasonably straightforward sincos implementation that will just work. The has less <= 2 ULP(units of last place) of error within the range [39000, +39000].
I tried to remove most of the C++ isms, so that this will get more use. The bit_cast really is the most portable way to do this, you need to call memcpy to be safe, unions, reinterpret_cast, etc.. don't cut it. The memcpy will compile away (I've checked on every major version of compiler that people usually care about). This is hacked together from my personal codebase, so if there are any problems let me know and I'll do a more thorough verification. 

mmozeiko
Mārtiņš Možeiko
1741 posts
/ 1 project

#14817
Day 440  Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago
[ code ] bbtags please.

0
4 posts

#14818
Day 440  Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago
Fixed.

d7samurai
69 posts
innovation through innovation™ 
#14837
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago Edited by d7samurai on April 5, 2018, 8:50 a.m.
You removed the simple version? I was curious about how that compares to the quick'n'dirty one I've been using:
software development should be a creative exercise, not a technical ordeal 
0
4 posts

#14838
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago Edited by 0 on April 5, 2018, 9:38 a.m.
Your version will work fine as an approximation and that's often all people need, but its error is way worse. The code I suggested was meant to be a replacement for more general purpose use. It will only be different in the last bit by at most 2 and the average is more like 0.01 ULPS, I removed all the template code I had before to make it easier to understand. Using Horner form for evaluating polynomials is not the fastest way, but I didn't want that to confuse what I was trying to get across.

d7samurai
69 posts
innovation through innovation™ 
#14839
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago Edited by d7samurai on April 7, 2018, 8:11 p.m.
There are no divides in this code (the divides you see are all compile time). EDIT: I just noticed you removed the comment on the divides, so disregard.
It compiles down to this:
Max error is < 0.001 vs canonical sine, which is good enough for a surprisingly large range of uses (3D graphics, sound synthesis..). Anyway, I know yours is intended to be a proper replacement for the crt sin/cos. I had just noticed a "SinePolySimple" function in your originally posted code and at the time I took that to be a standalone approximative alternative in the same vein as mine (hence curious). Looking back I guess that part was just a[n interchangeable] subcomponent of the fullon algorithm (and the other options since got pruned for clarity, as you said). It's a nice general purpose implementation.. software development should be a creative exercise, not a technical ordeal 
Pseudonym
Andrew Bromage
184 posts
/ 1 project
Research engineer, resident maths nerd (Erdős number 3). 
#14844
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago Edited by Andrew Bromage on April 6, 2018, 2:13 a.m.
Thanks for posting this!
A few random comments: Using __builtin_memcpy instead of unions or pointer hackery is an interesting approach. It's obviously not completely portable as written (MSVC doesn't have that), but it's certainly worth knowing about on platforms that support it. I didn't know that GCC optimised this. Of course, this is a temporary hack anyway. For SIMD registers, on most compilers you can just do this:
Except on MSVC, where you would use _mm_castps_si128 and _mm_castsi128_ps. Moving data between the floating point and register files is a potential bottleneck if you're not careful, so you'd like to keep your bit hacking code in SIMD registers as much as possible. (Fun fact: the CVTPS2PI instruction is still slightly cheaper than CVTSS2SI, one of the few operations where the "packed" version is cheaper than the "single" version.) Also for this reason, you should try the RoundToInteger trick instead of casting to see if that has a measurable performance impact. The fourstage CodyWaite reduction here gives you 12 bits of precision to play with, which means this works up to about sin(6400) or so. That's probably overkill for HMH (HMH is never going to call sin and cosine with an angle more than, say, 4pi), but it's good to see a worked example. An earlier version of the code mentioned Estrin's method, which I alluded to on the stream (see around the 1:09:00 mark). We do need to talk about that at some point. I like the idea of conditionally swapping sine and cosine after calculating them both, rather than swapping the arguments before calculating. This could save some register pressure. I'll have to think a bit more about this. There are still a lot of branches here. sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f}); 
mmozeiko
Mārtiņš Možeiko
1741 posts
/ 1 project

#14845
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago Edited by Mārtiņš Možeiko on April 6, 2018, 6:55 a.m.
Unless we talk about 32bit code, the 64bit uses simd registers for float ops. So it does not cost much reinterpreting them as ints, because I believe compiler can do int operations on same simd registers. But yeah, using float add/sub to round it without leaving floating point type is a nice trick. I use it for floor/ceil functions in SSE2.
Also are the branches so bad? Most of them can be simd'ified very easily. For example:
Can be done in 3 simd instructions:
If shift is expensive, then 4 instructions:
To avoid branch for handling denormals you can simply "flush" value to 0. So instead of this:
which can be done pretty reasonable with SSE. Btw, if somebody was not aware there is a nice "zlib" licensed header file that implements SSE and Neon based sin/cos (and other functions) using very good precision: http://gruntthepeon.free.fr/ssemath/ Author claims it is max error is 2^24 in [8192,+8192] interval. The code is pretty short and branch free  it does "bitwise trickery" to figure out which quadrant the argument value is and adjusts input or output to get correct result. To get good precision it evaluates two polynomials for quadrant, each is more precise in specific part of function and then chooses one or another. The code is based on scalar implementation of cephes library that is in public domain: http://www.netlib.org/cephes/ (look for single.tgz file). 
Pseudonym
Andrew Bromage
184 posts
/ 1 project
Research engineer, resident maths nerd (Erdős number 3). 
#14847
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago
For the record, I'm not saying it can't be branchfree, I'm just saying it's not (yet) branchfree.
sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f}); 
0
4 posts

#14852
Handmade Hero Day 440  Introduction to Function Approximation with Andrew Bromage 4 months, 2 weeks ago
I posted this version because it is very straightforward to simdify for anyone familiar in doing so. The branches at the end, really aren't branches in the normal sense, they can all be handled with bit math. The denormal and signed zero handling was just for completeness, but can be completely ignored if you don't care. __builtinmemcpy was just a carryover from my codebase, memcpy with a constant number of bytes is sufficient and works everywhere and will compile to the correct thing. There are definitely faster/cheaper approximations, this code isn't meant to be that sort of approximation, it is meant as a replacement for std::sin and std::cos with extremely tight error bounds. The code is written in MultiplyAdd form so that readers can see that the whole thing can be turned into FusedMultiplyAdds when available. Hopefully that helps clarify some things.
