Day 127: Why 16 byte alignment of framebuffer adr?

On day 126 Casey talks about aligning the memory for the tiled renderer. As I understand it he sets up three goals:

- Tile widths should be a multiple of 4
- Framebuffer memory should have a little row right-padding
- Framebuffer address should be 16 byte aligned

This first two I understand. Widths of 4 means that two threads won't mess with the same data. The extra memory padding is for the right-most tile in cases where actual screen width doesn't match up with the SSE size of 16 bytes.

But why is it necessary to make the framebuffer *address* 16 byte aligned?
https://youtu.be/blcNbU70I9o?t=1790

I can see how the 16 byte address alignment might help regarding CPU cache lines and also it lets us use `_mm_store_si128` instead of `_mm_storeu_si128`. But I don't hear Casey mentioning these at all when discussing the need for the 16 byte alignment so I suspect there is some other reason that I just don't understand.

Thank you Casey for a great show and thank you everyone else for this great community :)
That's exactly the reason - to use aligned store/load. On older CPUs unaligned SSE load/store is very expensive. Starting with Nehalem microarchitecture they actually fixed this and unaligned store is almost as fast as aligned store. And CPUs that have AVX instruction set, can handle unaligned lado/store pretty efficiently. See Agner optimizing_assembly.pdf file, "11.4 Alignment of data" and "13.5 Accessing unaligned data".

I asked Casey about this in Q&A of Day 117, ~1:24. His answer there was about the reasons why to keep alignment.

Edited by Mārtiņš Možeiko on
Thank you mmozeiko! :)