Kladdehelvete
Code is so fast, that it will only be limited by how often there is data available
You're making the invalid assumption that memory access is
always the bottleneck, as if program performance is entirely independent of the scope/complexity/efficiency of the code operating on the data (for that to be true, you'd
literally need a CPU with
infinite speed).
So if you're performing some minimum amount of calculations [for each piece of data], the speed at which you can do those calculations obviously matters. Using SIMD (SSE, AVX etc), you can typically do
four operations in parallel - which is patently faster than doing the same four operations sequentially, no matter what the memory access speed is. And to be able to take advantage of the SIMD paradigm, that data
must be 16 byte aligned.
If you're not using SIMD instructions (
https://software.intel.com/sites/landingpage/IntrinsicsGuide/) where applicable, you are already writing what Casey calls "slow code".
Then there's cache lines.. aligning data to cache line size (typically 64 bytes) can mean the difference between one cache miss and two (plus directly affect how well the caches are utilized / packed with relevant data over time), which in itself has performance implications.
But, assuming that the data layout and memory access aspect itself is properly optimized in either case: If certain memory alignments allows you to take advantage of instructions that are substantially faster than the alternative, isn't it quite clear that memory alignment has ties to performance oriented code?
The gain from all this is of course proportional to the amount of work that needs to be done (as well as the performance of the rest of your code).