Memory arenas and alignment

Casey Muratori

#2702

February 15, 2015

Kladdehelvete
I have tested jumpaligning, codealigning and data alignment on my entire codebase, and found zero perceived difference in speed. I don't think its worth it. Unless a few cycles actually matter to you.

This is highly erroneous. Data alignment makes a massive difference, full stop. If you write slow code, then it is possible for the data alignment not to matter because the code is slow, yes. But assuming you are writing your code to use full-width loads (SSE, AVX, or AVX 2), then not only is it imperative for performance, it's imperative to have the code run at all, because, for example, certain processors hard-fault on full-width loads that are not 16-byte aligned.

- Casey

d7samurai

#2703

February 15, 2015

I was just about to mention something to Kladdehelvete about SIMD, but I see Casey took care of it.

Edited by d7samurai on February 15, 2015, 10:28pm

Casey Muratori

#2705

February 15, 2015

Yeah, just to follow up on this further: it's a bad idea to do something like "I tested to see if this performance mattered, and it didn't" because you really don't have any idea if it did or not. What you have to do first is construct some kind of model for how the processor is working by researching it, and then you had to have a target in mind for how fast something should run at its theoretical maximum. Then you try to drive toward that maximum, and you see how close you can get, and then you have some idea of whether or not thing X helped you get there or not.

Simply going into a codebase and doing random thing X and then timing it is not helpful. It doesn't tell you anything about whether X was useful, because the codebase may be so shitty or so poorly designed for performance that thing X's improvement can't actually help it.

We'll talk more about this sort of thing when we get to the optimization parts of Handmade Hero, to be sure!

- Casey

Livet Ersomen Strøm

#2706

February 15, 2015

cmuratori
Kladdehelvete
I have tested jumpaligning, codealigning and data alignment on my entire codebase, and found zero perceived difference in speed. I don't think its worth it. Unless a few cycles actually matter to you.

This is highly erroneous. Data alignment makes a massive difference, full stop. If you write slow code, then it is possible for the data alignment not to matter because the code is slow, yes. But assuming you are writing your code to use full-width loads (SSE, AVX, or AVX 2), then not only is it imperative for performance, it's imperative to have the code run at all, because, for example, certain processors hard-fault on full-width loads that are not 16-byte aligned.

- Casey

Since I don't know what slow code even is. Maybe you could show me? :) *kidding* Well. I always make notes of anything you say Casey. I find a lot of it, both interessting and informative. But have you actually meassured the difference of "bad alignment" on real code? Like your own games? Can you reproduce it? And give actuall numbers?

Livet Ersomen Strøm

#2708

February 15, 2015

d7samurai
I was just about to mention something to Kladdehelvete about SIMD, but I see Casey took care of it.

SIMD I havent any experience with. But for sure, it can't make your memory or cache faster. This would be the limiting factor, is what I am _guessing_ at. Code is so fast, that it will only be limited by how often there is data available. Isn't that more or less accurate?

Livet Ersomen Strøm

#2710

February 15, 2015

cmuratori
But assuming you are writing your code to use full-width loads (SSE, AVX, or AVX 2), then not only is it imperative for performance, it's imperative to have the code run at all, because, for example, certain processors hard-fault on full-width loads that are not 16-byte aligned.

- Casey

Ah yes. But thats a different thing. Some instructions must have 16 bytes alignment? I heard about it. I would still loved to have expained to me why doing 4 calculations in paralell at incredible speeds, inside the CPU, will in anyway speed up the flow of data to and from.

And since when would you consider, instant to be done by slow code? Listen man. I can start my compiler, compile the code, run the app, load the gamestate, faster then you can work with your hotloading DLL. You think I could be able to do that for writing sloppy code? ;-)

Anyway, look forward to episode 66 :)

David Owens II

#2711

February 15, 2015

The point is less memory reads per instruction.

d7samurai

#2712

February 15, 2015

Kladdehelvete
Code is so fast, that it will only be limited by how often there is data available

You're making the invalid assumption that memory access is always the bottleneck, as if program performance is entirely independent of the scope/complexity/efficiency of the code operating on the data (for that to be true, you'd literally need a CPU with infinite speed).

So if you're performing some minimum amount of calculations [for each piece of data], the speed at which you can do those calculations obviously matters. Using SIMD (SSE, AVX etc), you can typically do four operations in parallel - which is patently faster than doing the same four operations sequentially, no matter what the memory access speed is. And to be able to take advantage of the SIMD paradigm, that data must be 16 byte aligned.

If you're not using SIMD instructions (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) where applicable, you are already writing what Casey calls "slow code".

Then there's cache lines.. aligning data to cache line size (typically 64 bytes) can mean the difference between one cache miss and two (plus directly affect how well the caches are utilized / packed with relevant data over time), which in itself has performance implications.

But, assuming that the data layout and memory access aspect itself is properly optimized in either case: If certain memory alignments allows you to take advantage of instructions that are substantially faster than the alternative, isn't it quite clear that memory alignment has ties to performance oriented code?

The gain from all this is of course proportional to the amount of work that needs to be done (as well as the performance of the rest of your code).

Edited by d7samurai on February 16, 2015, 6:06am

Casey Muratori

#2716

February 16, 2015

Kladdehelvete
But have you actually meassured the difference of "bad alignment" on real code? Like your own games? Can you reproduce it? And give actuall numbers?
...
And since when would you consider, instant to be done by slow code? Listen man. I can start my compiler, compile the code, run the app, load the gamestate, faster then you can work with your hotloading DLL. You think I could be able to do that for writing sloppy code?

In general, I don't like the tone of your posts, so I will not be reading them in the future.

- Casey

Edited by Casey Muratori on February 16, 2015, 5:10am

Livet Ersomen Strøm

#2718

February 16, 2015

cmuratori

In general, I don't like the tone of your posts, so I will not be reading them in the future.

- Casey

Ah ok. Sorry about that. Maybe have to work on "the tone of my posts". I ment no offence, (thats why I added the smileys). It just very strange hear see you insinuate my code to be slow, to be the reason why aligning loops, in real code, dont seem to matter.

But sorry if I come across like that. I didn't mean to. I am a huge fan.

One of the very reasons I never was much interessted in SSE and so on, is I never had a need for it.

One of the things I am going to try, once I get around to it, is to use the 4 cores of my i7 to write 4 parts of the backbuffer at the same time.(or maybe at least two) At that time, I will give simd a go, and do the same code both with simd and without it. And if I am mistaken, like very much mistaken, I will post about it.

Until then I will stop posting.

Edited by Livet Ersomen Strøm on February 16, 2015, 6:32am

Livet Ersomen Strøm

#2720

February 16, 2015

d7samurai

If you're not using SIMD instructions (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) where applicable, you are already writing what Casey calls "slow code".

Ok. I see. Yes, well I will give it a go then.

d7samurai

Then there's cache lines.. aligning data to cache line size (typically 64 bytes) can mean the difference between one cache miss and two (plus directly affect how well the caches are utilized / packed with relevant data over time), which in itself has performance implications.

But, assuming that the data layout and memory access aspect itself is properly optimized in either case: If certain memory alignments allows you to take advantage of instructions that are substantially faster than the alternative, isn't it quite clear that memory alignment has ties to performance oriented code?

The gain from all this is of course proportional to the amount of work that needs to be done (as well as the performance of the rest of your code).

I am here, to learn. Dispite the "tone of my posts". So I will take to heart all you all just said, and write code to eventually turn all my current float code, for movement and physics calculations on my units, into SIMD.

Thanks for feedback.