Ep 4: Unaligned Access

Hi Guys,

I am on Episode 3 of Handmade Hero, but I'm embarrassed to say that I have already encountered something that confused me.

https://youtu.be/hNKU8Jiza2g?t=342
Casey Says (In the Above Link): "In general on the x86 architecture often times there is a penalty for doing what's called unaligned accessing."

He goes on to explain that unaligned accessing is when say you operate on say a 32 bit value on a boundary that is not 4 bytes. However, I do not see how this relates to him allocating an extra byte of padding for the Bitmap's RGB. I would understand if he said he allocated the extra pad byte for something like SIMD or to access the byte as a single 32-bit integer as opposed to multiple chars. I would also understand if he said the OS had some faster path for DWORD sized data, because it can access them with a single 32-bit variable.

Is unaligned accessing really related here? I don't see how it makes sense, unless Casey's plan is to access these things on a 32-bit boundary. Also, can someone explain why unaligned accessing is slower for the CPU at all? my only thought is that the data may straddle a cache line leading to it needing to fetch a byte at a time then combine everything together, to make sure it can actually read all the data.

Edited by braksten on
Reading a whole pixel as a single 32-bit word and doing bit operations on it (shifts, masks, etc.) is indeed very common in image processing code - especially when using SIMD, which I believe Casey does later on in the series.
I have no experience in those things, but you can have a look at chapter 5.7 in this intel document: Memory optimizations.
There are at least two reasons given there for unaligned being slower (to my understanding):
- the data crossing cache line boundary during a load or store;
- if you store some data, and than try to access it unaligned, the cpu needs to wait for the store to be complete and then read from memory. If the data is aligned, the cpu can read the data directly from the register without waiting (to my understanding the store and load can happen at the same time in the cpu pipeline).

I though that there might be different performance for SSE instructions for loading aligned and unaligned but it doesn't seem to be the case.