I don't know how applicable this is to optimization via hyperthreading, but the two links below seem to suggest knowing the L2 cache size and staying below it yield shorter penalties for inevitable context switching that occurs during processing. It looks like the most optimal size of data to process on a core is about half the size of the L2? I wonder if this is relevant anymore...
http://www.cs.rochester.edu/u/cli/research/switch.pdf
http://blog.tsunanet.net/2010/11/...does-it-take-to-make-context.html
I can see the logic in keeping processing of chunks of data local to a single core. It would follow that less of the (presumably expensive) MESI propagation would kick-in, even with two hardware threads working on different parts of the same chunk.
I guess the concern with hardware threading is inadvertently slowing the performance in an attempt to improve it :)