I wish I could have asked this during a pre-stream but I can't really seem to catch the live.
During HMH I learned why working along in small locality, with coherent patterns to help the cache is vital for performance. This was however mostly referred to reading memory (or so I interpreted it).
Caching and writing was discussed in conjunction with multithreading, MESI protocol, cache invalidation, but I think there might be more going on and I'm still a bit confused about it.
The questions are: what's the real cost of writing when there's no multithreading or no fighting over the cache lines going on?
What's the cost of a cache miss on write?
On one hand the cache line has to be read in order to be written - so cache misses on write should be as bad as those on read, but on the other the CPU shouln't stall on a write since in general it doesn't require what it's writing to proceed with the following operations.
I don't know if I expressed it clearly, so I'll give a practical example case:
This doubt has been haunting me since I started wondering about swizzling pixels of an image (in weird patterns sort of like GPUs do) in order to make filtering faster. This would require to read each pixel in order and write them on the swizzled image buffer in possibly not easily cache-predictable locations. I was wondering if it's generally worth moving the data in such ways, and if cache misses on write are as bad as those on read it wouldn't seem the case, maybe?