Memory arenas and alignment

I know that intel CPUs are really permissive when it comes to memory alignment issues (although I think there can be a performance penalty) but I'm wondering if ARM will be as forgiving.

For those who are unaware of the issue, on x64 the following struct will have size 12:

1
2
3
struct Foo {
    int32 x, y, z;
};


If this is the first object put into the arena, the next object will occupy memory that is aligned on a 4 byte but not 8 byte boundary. If the next struct that gets pushed on requires 8 byte alignment (an x64 pointer, say), then there will be a problem. On intel, I think there is just a performance penalty (at least until you get to SIMD stuff).

I guess if we compile 32 bit on ARM it will probably work out. The Pi only has a 1 GB of memory anyway...

Edited by Patrick Lahey on
So when dishing out new segments, we can just align them to an 8 byte boundary:
1
2
3
4
inline void* Align8(void* address)
{
    return (void*)(((uint64)(address) + 0x7) & 0xfffffffffffffff8);
}

..or a 16 byte boundary (to make SIMD happy):
1
2
3
4
inline void* Align16(void* address)
{
    return (void*)(((uint64)(address) + 0xf) & 0xfffffffffffffff0);
}

Edited by d7samurai on
I'm not quite sure what you are suggesting. If you are suggesting that every allocation be forced to a particular byte boundary (which I thought of too), then although that works it doesn't "feel" right. It prevents allocating a bunch of closely packed objects that are, say, 12 bytes each. If we do that then we are forced to waste 4 bytes per object and we can no longer treat them as an array using the pointer to the first (at least not without the alignment details leaking through the abstraction or allocating them as an array of known size up front).

If you are suggesting that we introduce an "AlignArena" function:

1
2
3
4
5
6
inline void
AlignArena(memory_arena *Arena, int32 Alignment)
{
    Assert(IsPowerOf2(Alignment));
    Arena->Used = (Arena->Used + Alignment - 1) & (-Alignment);
}


then I agree. That is the best idea I can think of (but I still don't like it very much).
I am suggesting aligning each new segment of memory we allocate like that (according to need). By segment I mean a stretch of memory dedicated to holding a bunch of objects of the same type and so on.

I don't think we will be allocating space for single, small objects much, nor that we'll be placing single, small objects of varying sizes after each other (like a 12 byte struct followed by an 8 byte pointer, for instance), at least not ones where alignment matters (or that needs to be read by SIMD instructions).

Then what will happen is that we go for an SOA ('struct of arrays') arrangement rather than an AOS ('array of structs') arrangement. Instead of storing, say, a hundred 12-byte Foo objects sequentially in memory, we will be storing a hundred 4-byte x values sequentially, then a hundred y values and then a hundred z values. So instead of
1
Foo entities[100];

We'll be doing
1
2
3
int32 x[100];
int32 y[100];
int32 z[100];

This way we only require a 4 byte alignment, and we can process four x-values at a time using SIMD and not have a problem (we'll just make sure the first one / the start of that segment is 16-byte aligned). Similarly with the y and z values, of course.

..or just make sure structs are 16 bytes (and start aligned):
1
2
3
struct Foo {
    int32 x, y, z, w;
};

Edited by d7samurai on
Ok, got it. The 12 byte struct was just an example though. All you need is an odd number of structs that are divisible by 4 but not by 8 (so they can't, for example, contain a pointer).
Yes, I think I know what you meant - I just kept your Foo for the sake of simplicity :)

You imagined if we perhaps had a Foo like this (with no padding), right?
1
2
3
4
struct Foo {
    void* ptr;
    int32 a;
};

Edited by d7samurai on
Typically, for things you care about, you always 16-byte align them, and for things that you don't care about, well, you don't care :)

The way you do this with arenas is as suggested by the replies here - you just align the pushes that need to be aligned as necessary. It's very simple.

- Casey
I have tested jumpaligning, codealigning and data alignment on my entire codebase, and found zero perceived difference in speed. I don't think its worth it. Unless a few cycles actually matter to you.

But if it does, it's time to try out other solutions to the problem, as cyclecounting is mostly a worthless activity. In fact it's the only case where "premature optimization" is premature. In most other cases, "premature optimization" is a very dangerous myth. Because strategic optimizations, (which are the only one that counts) must be evolved! You don't just arrive at them on first try. They follow from many iterations over the problem. or from propper "code compression" if you will. From seeing he problem differently, because you suddenly grasp it a different way.

All info posted to the net, in the last 2 decades about cyclecounting has proven worthless. It turned out rep stosb was better then aligning and using rep stosd. I am of the *opinion* that the reason is obvious. The CPU is only so fast, and after the cache is saturated, it cant do faster. And most memory related instructions, more or less, are MEMORY bound, after a few iterations. And the speed difference of various instructions then becomes utterly irrelevant. (I believe).
Kladdehelvete
I have tested jumpaligning, codealigning and data alignment on my entire codebase, and found zero perceived difference in speed. I don't think its worth it. Unless a few cycles actually matter to you.


It doesn't matter what you perceive the speed to be, you can only trust actual measured data. But data alignment in general is not just about speed. On x86 you only get slowed down for misaligned memory accesses, but on other architectures like ARM your program will crash for misaligned accesses. If you want to stay portable you always have to consider memory alignment.


Kladdehelvete

But if it does, it's time to try out other solutions to the problem, as cyclecounting is mostly a worthless activity. In fact it's the only case where "premature optimization" is premature. In most other cases, "premature optimization" is a very dangerous myth. Because strategic optimizations, (which are the only one that counts) must be evolved! You don't just arrive at them on first try. They follow from many iterations over the problem. or from propper "code compression" if you will. From seeing he problem differently, because you suddenly grasp it a different way.


I think that's not what people mean in general when they talk about premature optimization. It's just optimization before the time is right for it. A former coworker of me once spent three days optimizing a function pretty much at the beginning of a project. He got it's runtime down from 10 ms to 1 ms, which was rather impressive. But in the finished product this function was called exactly once during startup. I think that's pretty much the definition of premature optimization. Don't optimize anything until you know (because you measured it) that a certain part of code needs to go faster.
@d7samurai


You imagined if we perhaps had a Foo like this (with no padding), right?
struct Foo {
void* ptr;
int32 a;
};

Not quite. The compiler automatically pads out any struct that contains something requiring 8 byte alignment so that its size is a multiple of 8. This is done to allow arrays of these structs to allocated "closely" packed (with the quotes because of the compiler added padding). The concern is that allocations of "stuff" prior to allocating a struct requiring 8 byte alignment results in an Arena->Used value that is not divisible by 8.

It sounds like the plan is to have something like an "AlignArena" function to deal with this.
@Kladdehelvete

On intel there may be no real world consequences to ignoring alignment. On other kinds of hardware, as 5sw points out, you just crash.

I had a former colleague develop a VM, working and testing exclusively on x86. As soon as he tried to run in on real hardware it crashed everywhere due to alignment requirements. Because how it was written, a lot of code needed to get modified. Not fun.
rathersleepy
Not quite. The compiler automatically pads out any struct that contains something requiring 8 byte alignment so that its size is a multiple of 8


That is why I wrote "with no padding" (you even quoted it).

Edited by d7samurai on
Clearly I misunderstood what you were trying to say. My mistake.
5sw


It doesn't matter what you perceive the speed to be, you can only trust actual measured data.


I'd say its the only thing that matters (for desktop software, which is what I write normally). But I agree of course, in theory. I of course measure ALL my code, and always did. I even have created a significant application, only for measuring the speed of my allocator, and for testing various pieces of code. All written in assembly.

But since you insist. Now I have a working game. And I have onscreen measurements in milliseconds of every key piece of code. So I will enable those macros and see if there actually are a difference. Just a second...

It means actually nothing. There is absolutly no difference. The average m/s is the same, the maximum and minimum m/s is the same, the percentage of frametime on physics, drawing and sounds are the same. There is 800 "entities" on screen (10*10 pixels by two filled circles). There is about 400 bullets in total comming from those entities, flying through the "air".(lines) There is blinking from red light, when a bullet hits an "entity", and lightning and sounds of fire, like it was the 4th of July or the endtimes. This is gdi though. But GDI can do 280 fps on this machine, so you can still do a lot. I have lighting calculations for the background, and for every single "TILE". 66*50. The tiles are written by 1 fillrect, and one framerect, per tile. The background just a single clear. All "entities" are constantly in flight, using both acceleration, and deacceleration from and to new positions. They signal when hit, they perform target scans, and hittesting of bullets agains all other "entities" each frame. Also hittest bullets against the TILES.

Aligning all routinelabels, and jumplabels for every loop, to 16 bytes, and memory too, gives ZERO difference. Except the exefile gets a lot bigger. This alignment thing, is just like the cyclecounting thing. It is a MYTH. Or to be 100% fair, all I can say is that to me, for my spesific code, it appears that it is a myth. And it is a myth on like 30-40 applications. So I am not just pulling this out of my ass. It very consistently, does not matter.

It may mean something for testing small pieces of code. But you only be fooling yourself. When you are using every resource for all kinds of data, in a real application, it does not matter anymore. At that time, what matters is efficient cache usage. (Memory bottleneck). You don't have the luxury of doing perfect measurements, nor of coding perfect code then. The most you can do is a reasonable workable compromise between your various sub-systems, and you can tune and tune and tune, sometimes making big leaps, that then are nullified for reasons unknown and only guessed at, over the next week of added code. And then you do it again. And you know you only have time to make a relatively few "tuning" tests, out of the incredible space of all possible tuning tests you could make to squeese out more performance. And in this situation, this data alignment no longer matters. It an academic "plaything" that has no real value. People who measure cycles of instructions are wasting their time, and that of others.


5sw


I think that's not what people mean in general when they talk about premature optimization. It's just optimization before the time is right for it. A former coworker of me once spent three days optimizing a function pretty much at the beginning of a project. He got it's runtime down from 10 ms to 1 ms, which was rather impressive. But in the finished product this function was called exactly once during startup. I think that's pretty much the definition of premature optimization. Don't optimize anything until you know (because you measured it) that a certain part of code needs to go faster.


Yes, I agree, but that's what I try to say too. Only strategic optimization matters much, and for the once you can't find, other have already done, you have to discover them by starting somewhere. But you also need to test along the way. To learn something. You test each subsystem. Then you test them together. Which is never the same thing. Then you add something, like lightning, or shadows or something, and now the entire thing is skewed totally, making you into a questionmark. I mean, first week, I could do 3000 entities without any perceived slowdown. But now, I can do at most 1000 in total. But each day I learn new ways to kill cycles, but its not an exact science. Yes, in a sense it is exact, but my brain is not. There too many variables for being able to accurately measure all of them happening at once, and for many different situations. I have now an average of "only" 1,8 m/s per frame, even at the worst case, but the worst frame per sec is still 26 m/s. Those numbers are more or less confusing. What REALLY tells me something is right or wrong, is how I _perceive_ playing the game. how resposive it feels to my keyspresses, when a lot is going on.

Edited by Livet Ersomen Strøm on
rathersleepy
@Kladdehelvete

On intel there may be no real world consequences to ignoring alignment. On other kinds of hardware, as 5sw points out, you just crash.

I had a former colleague develop a VM, working and testing exclusively on x86. As soon as he tried to run in on real hardware it crashed everywhere due to alignment requirements. Because how it was written, a lot of code needed to get modified. Not fun.


Noted. But I have no need for other plattforms, as I only (or mostly) do assembly code.