Handmade Hero»Forums»Code
65 posts
Cache Question
Edited by Shastic on Reason: Initial post
I understand the cache fits a small amount of memory the CPU can access quickly, and there is a series of slower but bigger caches going back to RAM, and when I look at the wiki entry for Structure of Arrays they have this:

struct pointlist3D {
    float x[N];
    float y[N];
    float z[N];

This is how I think the data is laid out based on the above struct.

If I understand it correctly, if you have a function that needs to add x[i],y[i],z[i] together, they all have to be inside the fast cache for the operation to be fast. But if N is large, the first x will be far away from the first y and z, thus they can't all be in the same cache at the same time. Or does the CPU have more than one super fast cache, and it puts the xs, ys, and zs in them, so they can be accessed fast.

This isn't a question about data oriented design, its about the cache. Is there more than one fast cache, or is there just one, and they are doing something I am not seeing.
Mārtiņš Možeiko
2404 posts / 2 projects
Cache Question
Edited by Mārtiņš Možeiko on
If you're accessing only one x, y and z from each arrays to do something then yeah - this way will be slower.
But if you're processing many of them - then it does not matter. As CPU will need to load same amount of memory into cache, regardless of you storing this in xyzxyz... or xxx...yyy...zzz... way.

L1 cache typically is 32KB of data (on newer CPU's it is 64KB or more). So worrying about how three floats will fit there does not make much sense - they will fit all just fine. CPU copies from main memory into L1 cache data in cache-line granularity which is 64 or 128 bytes. That's why working on many elements is more important than 1. If your cache line is 64 bytes, and you're loading floats from array, then it will fit 16 of them into one cache line, or in other words - operating on 16 elements from each of arrays will cost you same "memory impact" as operating on 1 element from each.
Simon Anciaux
1240 posts
Cache Question
I think the question was not about adding a single xyz, but was that if you use struct of arrays and there is enough x, the first y and z could be out of the cache.

I'm not an expert and hopefully mmozeiko will correct me if I'm wrong:
As mmozeiko said, the cache is organized in cache lines which are 64 bytes (I didn't know until today that it could be 128), so you can fit 16 float in one cache line. When you try to access a x value, the processor will see if it's in a loaded cache line, and if it's not it will load the cache line. I believe cache lines are 64bytes aligned, so if x is not the first byte of a cache line, there will be a part of the cache line that will be "wasted" (you won't use the data it contains). It will do the same for the y and z in other cache lines. Which basically means that it will probably load a part of the x values in one cache line, a part of y values in another and the same for z.

In the case where you just want to add x, y, z it doesn't help to use structure of arrays as the value you need are already packed closely in memory (assuming you have an array of vec3, not vec3 scattered around in memory) and will probably make good use of the cache. I would help when using SIMD (I think).
505 posts
Cache Question
Edited by ratchetfreak on

Your understanding seems to be that there is just one block copy of memory and that is the cache, this is a wrong assumption.

Instead the cache is a multitude of little blocks (cachelines) each holding a copy of a small chunk of main memory but the chunks being cached don't have to be adjacent in any way.

42 posts
Cache Question
Edited by Opoiregwfetags on

To those who know: do cachelines adjacent to the one being used, get cached too? If this was true, iterating over 1MB would take 1 cache miss; otherwise it'd take 16384 (2^20 / 2^6 = 2^14). Maybe they get cached to L2 or something?

Mārtiņš Možeiko
2404 posts / 2 projects
Cache Question
Edited by Mārtiņš Možeiko on
Replying to Opoiregwfetags (#24969)

Caches lines are prefetched in advance when CPU detects some kind of access pattern. These patterns & situations when it happens are documented in Intel/AMD optimization guides. On Intel (haven't checked AMD but I would expect it is the same) cache prediction does not fetch across page boundaries. So only 4KB in advance.

See https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html page 132 "3.7.3 Hardware Prefetching for Second-Level Cache"