The cost of Writing

I wish I could have asked this during a pre-stream but I can't really seem to catch the live.

During HMH I learned why working along in small locality, with coherent patterns to help the cache is vital for performance. This was however mostly referred to reading memory (or so I interpreted it).
Caching and writing was discussed in conjunction with multithreading, MESI protocol, cache invalidation, but I think there might be more going on and I'm still a bit confused about it.

The questions are: what's the real cost of writing when there's no multithreading or no fighting over the cache lines going on?
What's the cost of a cache miss on write?

On one hand the cache line has to be read in order to be written - so cache misses on write should be as bad as those on read, but on the other the CPU shouln't stall on a write since in general it doesn't require what it's writing to proceed with the following operations.

I don't know if I expressed it clearly, so I'll give a practical example case:
This doubt has been haunting me since I started wondering about swizzling pixels of an image (in weird patterns sort of like GPUs do) in order to make filtering faster. This would require to read each pixel in order and write them on the swizzled image buffer in possibly not easily cache-predictable locations. I was wondering if it's generally worth moving the data in such ways, and if cache misses on write are as bad as those on read it wouldn't seem the case, maybe?
quien

On one hand the cache line has to be read in order to be written - so cache misses on write should be as bad as those on read, but on the other the CPU shouln't stall on a write since in general it doesn't require what it's writing to proceed with the following operations.


Yes, this comes close.

If the data that you'll want to modify is in the cache, the data will be changed in the cache and later submitted to main memory (write-hit).
So subsequent reads-writes of the same data that is potentially still in the cache is cheap.

It gets a bit tricky is multithreading comes into play, though you don't seem to be concerned with that.

(See also Intel 64 and IA-32 Architectures Software Developers Manual Vol. 3A, Chapter 11)