Cost of context switching?

I don't know how applicable this is to optimization via hyperthreading, but the two links below seem to suggest knowing the L2 cache size and staying below it yield shorter penalties for inevitable context switching that occurs during processing. It looks like the most optimal size of data to process on a core is about half the size of the L2? I wonder if this is relevant anymore...

http://www.cs.rochester.edu/u/cli/research/switch.pdf

http://blog.tsunanet.net/2010/11/...does-it-take-to-make-context.html

I can see the logic in keeping processing of chunks of data local to a single core. It would follow that less of the (presumably expensive) MESI propagation would kick-in, even with two hardware threads working on different parts of the same chunk.

I guess the concern with hardware threading is inadvertently slowing the performance in an attempt to improve it :)
The problem with any of this is that you really don't know until you actually test it with your workload. We obviously need to be multithreaded to take advantage of multiple cores, so at that point we should at least try hyperthreading to ensure that it doesn't help fill the work units.

In our particular case, our memory accesses are extremely coherent, so it's also possible that the L2 cache is less relevant, because we may always be able to pre-cache. We need to compute out our memory bandwidth and see a few things about that, though, before we make any decisions there. But either way I wanted to get hyperthreading in the mix, because we want to be able to turn it on and off and see what happens to the total throughput.

- Casey
It's hard to ignore 8 processors. After all, they're just sitting there, unused. I'm pretty sure hardware multithreading can be used to a performance advantage in this project, especially with iterators over large arrays as you initially explored.

I've found this part of optimization very interesting. I also liked the VLIW stuff you did earlier. I am learning a lot of new things I only read about in school.

Great series! Keep up the good work.
Just to be clear, it was SIMD stuff we did (Single Instruction Multiple Data), not VLIW (Very Long Instruction Word). The x64 doesn't really qualify as a VLIW architecture, I don't think - Itanium was Intel's VLIW technology, and in general it is not used in gaming or in desktop computing.

I am not a chip architect, but I believe the idea here is that unlike x64 where the processor is actively looking at an instruction stream, and those instructions go to various ports for execution, and it's up to the processor to try to maximally fill all the ports it has as often as it can, VLIW architectures are designed so that the instruction stream itself encodes what to issue to _all_ the ports on every cycle.

Well, that's probably overstating it. It's more just that the instruction stream itself is about issuing things to ports explicitly so that the compiler has explicit control of instruction-level parallelism, rather than having the processor do this.

- Casey
Yes, I got VLIW somehow confused with SIMD, not sure why.

SIMD executes one instruction on several streams of data.
Superscalar (and VLIW) execute several instruction streams.

Superscalar does instruction ordering/dispatching dynamically in hardware, while VLIW does it entirely in the compiler.

I think it's possible that either architecture could have SIMD units.