Using SIMD with hardware threads?

I was wondering if it is possible to combine HT with SIMD.

They are different types of parallelism, (instruction vs. data) so it sounds theoretically possible. Would there be any optimization advantage to doing this?
In theory, as long as the SIMD unit executes only one hyperthread's stuff at a time it'll be as fast as a thread running on a separate core. And since the code always contains other instructions too, it's quite likely the SIMD unit isn't busy 100% of the time from a single hyperthread. The devil is in the details, as in which exact bits can be shared between hyperthreads..

In any case, you'll definitely be faster using SIMD instructions than not.

Edited by Jari Komppa on
You can look at this a lot like the port pressure stuff we were doing with IACA. With HyperThreads, you're trying to combine code where one thread can fill the unblocked ports on a cycle where the other thread has no instructions for those ports, if that makes sense.

There is no such thing as "the SIMD unit", as the different types of operations SIMD performs on x64 are actually done on _different units_. So it's also important to know which port (unit) you are blocked on, etc., and whether you are waiting on memory, etc. IACA can help here, as can looking at a profile of your cache misses and so on (we'll be doing more of that later on in HH).

- Casey
You're right of course. If the hardware resource doing SIMD operation X is busy, the other hyperthread can't use it at the same time (except, of course, if said core happens to have two or more of said hardware resource). I don't know how how highly granular the different resources are, and that will probably also vary from one chip to another.

..but you're still better off trying to use them than not =)
I was under the impression there was more than 1 SIMD register set you could work with. If you could pin HTs to discrete sets of SIMD registers you could combine the two techniques for a kind of double whammy speed increase.

Also, what is IACA? I got in on Day 90, so I think I'm missing something here.
The problem is not the register file, it's the work units. For example, the CPU that we're using on stream has only 1 SIMD multiply unit. So if both hyperthreads were exclusively doing long strings of multiplications, you would literally get no speed increase from hyperthreading, because both threads would just sit there fighting over who gets to issue a multiple to the SIMD multiply unit.

However, the SIMD shift unit, for example, is in a different work unit, so if one hyperthread were doing all shifts and the other hyperthread were doing all multiplies, then you would get a 2x speed increase, because they would not contend with each other at all.

Does that make sense?

So the entire thing comes down to whether or not a single thread executing on the core has any stalls, either due to memory latency or execution port contention, and whether or not you can put a second thread on that core that will be able to fill in those stalls with instructions that themselves are not blocking on the same resources.

- Casey
OK, I see now. You're talking about the underlying contention for hardware resources between hardware threads on a single core, which completely makes sense.

Using hardware threads on different cores then might make better sense for something like this.

It totally makes sense that if Hyperthreads were used this way, they would have to work on some different SIMD operation that uses some other hardware unit on the core.
Allan Bowhill
It totally makes sense that if Hyperthreads were used this way, they would have to work on some different SIMD operation that uses some other hardware unit on the core.

Or some completely different operation altogether. And if the smart guys who design CPUs see from statistics that some operation is used a lot, they may add another copy of that in a future CPU, which is much cheaper than adding another complete core.
Another approach would be to simply split a string of SIMD operations into two approximately-same sized bundles, A and B, and let the threads operate on those bundles in sequence, like a 2 stage pipeline.

Let the hardware resource collisions occur where they may, as it can do no harm. Even if resource contention does cause some kind of delay, just subtract it from the efficiency gained from the parallelism, and you would probably still come out ahead.

Does that even sound realistic?