Day 93

Livet Ersomen Strøm

#3670

May 21, 2015

Hi. Thanks for an excellent show. I watched everyone.

But only today I got around to implementing the DrawBitmapSlowly function from day 93, which I have enjoyed greatly to see implemented on ss2, in later episodes.

Now I have a question. I found that if the alfabyte is zero(in sourcebitmap), I can drop the pixel entirely and still see the same image stretched and rotated. This can remove a lot of lerps and pixelops when scaling especially, seems like.

But is this a "valid" optimization?

(please note I have not implemented the 3 lerps yet(subpixel), nor the ss2 changes, this is from halfways into 093)

Edited by Livet Ersomen Strøm on May 21, 2015, 5:29am

Mārtiņš Možeiko

#3672

May 21, 2015

In older times for older computers yes that pretty much always were useful optimization.

For modern computers it depends. Sometimes it is useful, sometimes it doesn't matter, and sometimes it is actually harmful. Modern CPU's have large pipelines, so they decode and execute instructions ahead of time. But if branch is predicted incorrectly they need to stop, undo all executed instructions and only then continue. And that is harmful. It is better simply to execute all code and at the end mask the result before writing it back. This way CPU can execute everything ahead of the time without worrying - to branch or not branch.

Here's a very good explanation with examples: http://stackoverflow.com/a/11227902/675078 I know.. StackOverflow and answers Java question, but explanation is very good - it exactly explains why putting such "if" statement you are suggesting for alpha value is harmful (look at code in the question text).

And here's a really good presentation from GDC 2015 that also talks about this:
https://www.youtube.com/watch?v=Nsf2_Au6KxU
"Performance Optimization, SIMD and Cache" by Sergiy Migdalskiy

Edited by Mārtiņš Možeiko on November 26, 2015, 9:46am

Livet Ersomen Strøm

#3675

May 21, 2015

mmozeiko

For modern computers it depends. Sometimes it is useful, sometimes it doesn't matter, and sometimes it is actually harmful.

Yes, thank you, I see this when measuring my code.

mmozeiko

And here's a really good presentation from GDC 2015 that also talks about this:
https://www.youtube.com/watch?v=Nsf2_Au6KxU
"Performance Optimization, SIMD and Cache" by Sergiy Migdalskiy

This was a particularly interesting video, well worth watching thanks.

How about caching of the pixel when alfabyte=128 as well? Is this also "valid"? (without graphic artifacts)?

Casey says this routine is "easy", but I find it to be a monster routine, when not used to this. Afterall, we spent 28 hours on just this one, so far. Very useful exersize.

Mārtiņš Možeiko

#3676

May 21, 2015

What do you mean by alpha=128? That means 0.5 transparency, meaning you need to do same caluclations when alpha=100 or alpha=50. Only thing you could do the same when alpha=0 is when alpha=255. But that would involve exactly same branch. So probably not useful.

This routine is easy. If you look what calculations he does, it's not much - u/v calculation, bilinear texel fetch, squaring, bilinear interpolation, blending, and writing back - it's nothing. And if you original code fits almost on screen or two, then that is easy routine by default :) And Casey uses bigger fonts, so his original code (inner loop) for this function fits in screen.

Wait till we get to optimize rectangle drawing routine with normalmaps.. :)

Livet Ersomen Strøm

#3680

May 21, 2015

mmozeiko
What do you mean by alpha=128? That means 0.5 transparency, meaning you need to do same caluclations when alpha=100 or alpha=50. Only thing you could do the same when alpha=0 is when alpha=255.

In the testassets theres a +0.5 alfa bias. So when Alpha is 1.0 the alfabyte will be 128.

mmozeiko

But that would involve exactly same branch. So probably not useful.

Yes, but would it only suffers on mispredictions? So if there is enough redundancies.

I am not sure, but it seems to me, the problem is less that of misprediction, but more that runtimes become less predictable.(Due the dropped and cached pixels become superfast, while the (my) alfacalc is still slow) More stocastic runtimes.

I want to try improve it fully in "scalar" before turning to sse2, because if I can get an average of 100 cy/p in scalar, then I will optimistically suspect ~25 in sse2. And then 4x from threading. At least that's my "working assumption". And by that time it becomes really interesting.

Given the way CPUs go, and how slow I am with coding, by the time my game is finished I should not need a GPU ;)

According to the video with Sergiy Migdalskiy scalar is not always bad. All 4 pixels fit a cacheline, so this should mean that accessing 3 of them is literally "free", even in scalar? And who knows if mixing scalar with sse2 will also be helpful, unless tried?

mmozeiko

This routine is easy. If you look what calculations he does, it's not much - u/v calculation, bilinear texel fetch, squaring, bilinear interpolation, blending, and writing back - it's nothing. And if you original code fits almost on screen or two, then that is easy routine by default :) And Casey uses bigger fonts, so his original code (inner loop) for this function fits in screen.

Wait till we get to optimize rectangle drawing routine with normalmaps.. :)

:)

This routine almost broke my ego! :((( I needed therapy to find energy for figuring out how it works. -)

Mārtiņš Možeiko

#3681

May 21, 2015

Kladdehelvete
I am not sure, but it seems to me, the problem is less that of misprediction, but more that runtimes become less predictable.

That's the same thing. If branch is less predictable that means there will be many cases when it will incorrectly predicted = misprediction.

I want to try improve it fully in "scalar" before turning to sse2, because if I can get an average of 100 cy/p in scalar, then I will optimistically suspect ~25 in sse2. And then 4x from threading. At least that's my "working assumption". And by that time it becomes really interesting.

Given the way CPUs go, and how slow I am with coding, by the time my game is finished I should not need a GPU ;)

By introducing GPU you immediately will get 128x or 1024x speedup (or some other big number). Doing 4x SSE and 4x threading will give you only 16x speedup.

All 4 pixels fit a cacheline, so this should mean that accessing 3 of them is literally "free", even in scalar?

It's not only about memory (which is also important), but also for calculations. Doing things 4x at time obviously is 4x faster. Of course SSE will be faster than scalar. There is no doubt there.

[qoute] And who knows if mixing scalar with sse2 will also be helpful, unless tried?
Yes, sometimes it is helpful. But in this case (draw rectangle call from HH) there is no way your scalar code will beat or help SSE code.

Edited by Mārtiņš Možeiko on May 21, 2015, 10:26pm

Livet Ersomen Strøm

#3692

May 23, 2015

mmozeiko
What do you mean by alpha=128? That means 0.5 transparency:)

Yes you are 100% correct. I was confusing myself. Thanks for correcting me by the way, as this detail got me 10 cycles per pixel, alone.