mmozeiko
What do you mean by alpha=128? That means 0.5 transparency, meaning you need to do same caluclations when alpha=100 or alpha=50. Only thing you could do the same when alpha=0 is when alpha=255.
In the testassets theres a +0.5 alfa bias. So when Alpha is 1.0 the alfabyte will be 128.
mmozeiko
But that would involve exactly same branch. So probably not useful.
Yes, but would it only suffers on mispredictions? So if there is enough redundancies.
I am not sure, but it seems to me, the problem is less that of misprediction, but more that runtimes become less predictable.(Due the dropped and cached pixels become superfast, while the (my) alfacalc is still slow) More stocastic runtimes.
I want to try improve it fully in "scalar" before turning to sse2, because if I can get an average of 100 cy/p in scalar, then I will optimistically suspect ~25 in sse2. And then 4x from threading. At least that's my "working assumption". And by that time it becomes really interesting.
Given the way CPUs go, and how slow I am with coding, by the time my game is finished I should not need a GPU ;)
According to the video with Sergiy Migdalskiy scalar is not always bad. All 4 pixels fit a cacheline, so this should mean that accessing 3 of them is literally "free", even in scalar? And who knows if mixing scalar with sse2 will also be helpful, unless tried?
mmozeiko
This routine is easy. If you look what calculations he does, it's not much - u/v calculation, bilinear texel fetch, squaring, bilinear interpolation, blending, and writing back - it's nothing. And if you original code fits almost on screen or two, then that is easy routine by default :) And Casey uses bigger fonts, so his original code (inner loop) for this function fits in screen.
Wait till we get to optimize rectangle drawing routine with normalmaps.. :)
:)
This routine almost broke my ego! :((( I needed therapy to find energy for figuring out how it works. -)