Handmade Hero » Forums » Code » Using SIMD with hardware threads?
Allan Bowhill
noxy_key
23 posts
#3794 Using SIMD with hardware threads?
3 years, 1 month ago

I was wondering if it is possible to combine HT with SIMD.

They are different types of parallelism, (instruction vs. data) so it sounds theoretically possible. Would there be any optimization advantage to doing this?
sol_hsa
Jari Komppa
40 posts / 1 project

Programmer, designer, writer, a lot of other things; http://iki.fi/sol

#3795 Using SIMD with hardware threads?
3 years, 1 month ago Edited by Jari Komppa on June 1, 2015, 4:41 p.m.

In theory, as long as the SIMD unit executes only one hyperthread's stuff at a time it'll be as fast as a thread running on a separate core. And since the code always contains other instructions too, it's quite likely the SIMD unit isn't busy 100% of the time from a single hyperthread. The devil is in the details, as in which exact bits can be shared between hyperthreads..

In any case, you'll definitely be faster using SIMD instructions than not.

http://iki.fi/sol - my schtuphh
cmuratori
Casey Muratori
810 posts / 1 project

Casey Muratori is a programmer at Molly Rocket on the game 1935 and is the host of the educational programming series Handmade Hero.

#3798 Using SIMD with hardware threads?
3 years, 1 month ago

You can look at this a lot like the port pressure stuff we were doing with IACA. With HyperThreads, you're trying to combine code where one thread can fill the unblocked ports on a cycle where the other thread has no instructions for those ports, if that makes sense.

There is no such thing as "the SIMD unit", as the different types of operations SIMD performs on x64 are actually done on _different units_. So it's also important to know which port (unit) you are blocked on, etc., and whether you are waiting on memory, etc. IACA can help here, as can looking at a profile of your cache misses and so on (we'll be doing more of that later on in HH).

- Casey
sol_hsa
Jari Komppa
40 posts / 1 project

Programmer, designer, writer, a lot of other things; http://iki.fi/sol

#3801 Using SIMD with hardware threads?
3 years, 1 month ago

You're right of course. If the hardware resource doing SIMD operation X is busy, the other hyperthread can't use it at the same time (except, of course, if said core happens to have two or more of said hardware resource). I don't know how how highly granular the different resources are, and that will probably also vary from one chip to another.

..but you're still better off trying to use them than not =)

http://iki.fi/sol - my schtuphh
Allan Bowhill
noxy_key
23 posts
#3810 Using SIMD with hardware threads?
3 years, 1 month ago

I was under the impression there was more than 1 SIMD register set you could work with. If you could pin HTs to discrete sets of SIMD registers you could combine the two techniques for a kind of double whammy speed increase.

Also, what is IACA? I got in on Day 90, so I think I'm missing something here.
cmuratori
Casey Muratori
810 posts / 1 project

Casey Muratori is a programmer at Molly Rocket on the game 1935 and is the host of the educational programming series Handmade Hero.

#3811 Using SIMD with hardware threads?
3 years, 1 month ago

The problem is not the register file, it's the work units. For example, the CPU that we're using on stream has only 1 SIMD multiply unit. So if both hyperthreads were exclusively doing long strings of multiplications, you would literally get no speed increase from hyperthreading, because both threads would just sit there fighting over who gets to issue a multiple to the SIMD multiply unit.

However, the SIMD shift unit, for example, is in a different work unit, so if one hyperthread were doing all shifts and the other hyperthread were doing all multiplies, then you would get a 2x speed increase, because they would not contend with each other at all.

Does that make sense?

So the entire thing comes down to whether or not a single thread executing on the core has any stalls, either due to memory latency or execution port contention, and whether or not you can put a second thread on that core that will be able to fill in those stalls with instructions that themselves are not blocking on the same resources.

- Casey
cmuratori
Casey Muratori
810 posts / 1 project

Casey Muratori is a programmer at Molly Rocket on the game 1935 and is the host of the educational programming series Handmade Hero.

#3812 Using SIMD with hardware threads?
3 years, 1 month ago

Allan Bowhill
noxy_key
23 posts
#3816 Using SIMD with hardware threads?
3 years, 1 month ago

OK, I see now. You're talking about the underlying contention for hardware resources between hardware threads on a single core, which completely makes sense.

Using hardware threads on different cores then might make better sense for something like this.

It totally makes sense that if Hyperthreads were used this way, they would have to work on some different SIMD operation that uses some other hardware unit on the core.
sol_hsa
Jari Komppa
40 posts / 1 project

Programmer, designer, writer, a lot of other things; http://iki.fi/sol

#3818 Using SIMD with hardware threads?
3 years, 1 month ago

Allan Bowhill
It totally makes sense that if Hyperthreads were used this way, they would have to work on some different SIMD operation that uses some other hardware unit on the core.

Or some completely different operation altogether. And if the smart guys who design CPUs see from statistics that some operation is used a lot, they may add another copy of that in a future CPU, which is much cheaper than adding another complete core.

http://iki.fi/sol - my schtuphh
Allan Bowhill
noxy_key
23 posts
#3821 Using SIMD with hardware threads?
3 years, 1 month ago

Another approach would be to simply split a string of SIMD operations into two approximately-same sized bundles, A and B, and let the threads operate on those bundles in sequence, like a 2 stage pipeline.

Let the hardware resource collisions occur where they may, as it can do no harm. Even if resource contention does cause some kind of delay, just subtract it from the efficiency gained from the parallelism, and you would probably still come out ahead.

Does that even sound realistic?