Multithreading vs hyperthreading

Livet Ersomen Strøm

#3746

May 28, 2015

just delete this

Edited by Livet Ersomen Strøm on May 28, 2015, 2:56am

Mārtiņš Možeiko

#3747

May 28, 2015

1. There is no such thing as 8.1 cores. So it is not correct to say that 1 thread is using 1/16 of the cpu. You can never get 16x speedup from 16 logical cores on an 8 core processor. At most 7.99

I think you are not understanding hyperthreading well. Using 4 CPU's 100% doesn't mean that CPU is 100% busy. Some of his execution units/ports are still free because instructions can not be ordered arbitrary because of dependencies or because of memory stalls. Executing additional thread in parallel will actually do some useful work. So you won't get just 4x speedup if you have 4 physical cores. You will get something from 4x to 8x speedup depending on workload you are executing.

This is pretty easy to show. Make 4 threads to do multiplies, and 4 threads to do just additions. And see how fast you it executes serially vs multi-threaded. Here's code:

#include <stdio.h>
#include <windows.h>

struct Arg
{
  float value;
  int count;
};

static volatile bool start = false;

static DWORD WINAPI MulThread(LPVOID param)
{
  Arg* arg = (Arg*)param;
  int count = arg->count;
  float value = arg->value;
  float result = 1.f;

  while (!start) {}

  for (int i=0; i<count; i++)
  {
    result *= value;
  }

  return (DWORD)result;
}

static DWORD WINAPI AddThread(LPVOID param)
{
  Arg* arg = (Arg*)param;
  int count = arg->count;
  float value = arg->value;
  float result = 0.f;

  while (!start) {}
  
  for (int i=0; i<count; i++)
  {
    result += value;
  }

  return (DWORD)result;
}


int main()
{
   LARGE_INTEGER f, t1, t2;
   QueryPerformanceFrequency(&f);

   volatile float value = 1.1f;
   volatile int count = 1000*1000*1000;
   
   Arg arg = { value, count };
   HANDLE threads[8];

   for (int i=0; i<8; i++)
   {
     threads[i] = CreateThread(NULL, 0, i < 4 ? MulThread : AddThread, &arg, 0, NULL);
   }

   // multiple threads
   {
     QueryPerformanceCounter(&t1);
     start = true;
     WaitForMultipleObjects(8, threads, TRUE, INFINITE);
     QueryPerformanceCounter(&t2);
   }

   double threadTime = double(t2.QuadPart - t1.QuadPart) / f.QuadPart;
   printf("Multiple Thread Time = %.2f sec\n", threadTime);

   float v = value;
   int c = count;
   volatile float result = 0;

   // single thread
   {
     QueryPerformanceCounter(&t1);

     float r = 1;
     for (int t=0; t<4; t++)
     {
       for (int i=0; i<count; i++)
       {
         r *= v;
       }
     }
     result += r;

     r = 0;
     for (int t=0; t<4; t++)
     {
       for (int i=0; i<count; i++)
       {
         r += v;
       }
     }
     result += r;

     QueryPerformanceCounter(&t2);
   }

   double singleTime = double(t2.QuadPart - t1.QuadPart) / f.QuadPart;
   printf("Single Thread Time = %.2f sec\n", singleTime);

   printf("Speedup = %.1f\n", singleTime / threadTime);
}

And here is me compiling it and executing on i7-4790K CPU which is 4 core Haswell with hyperthreading enabled - so 8 logical cores:

C:\test>g++ -O2 test.cpp -o test.exe

C:\test>test.exe
Multiple Thread Time = 1.22 sec
Single Thread Time = 7.75 sec
Speedup = 6.4

So while you don't get perfect 8x speedup (because of loop overhead and other integer instructions executing and taking free ports) you still got significantly larger speedup than just 4x.

Edited by Mārtiņš Možeiko on May 28, 2015, 3:24am

Livet Ersomen Strøm

#3751

May 28, 2015

mmozeiko
1. There is no such thing as 8.1 cores. So it is not correct to say that 1 thread is using 1/16 of the cpu. You can never get 16x speedup from 16 logical cores on an 8 core processor. At most 7.99

I think you are not understanding hyperthreading well. Using 4 CPU's 100% doesn't mean that CPU is 100% busy.

Yes it means exactly so. To confirm my results all you need to do is write a buffermemory in parallell on 4 cores, with no overlapping writes. And then you will find that 4 cores have a performance that cannot be improved by adding more threads. That would be a real example. I have already done this, both in testing and in a game.

My threads does not use ANY serialization primitives, at all. My threads NEVER writes to any shared memory. Ever. This should be forbidden in multithreaded code, and is less then optimal.

I run similar code to that in HMH, (conseptually) except i run a range of jobs before waiting for an event, instead of just one. (but if i need to i can issue just one). But no locks. No need for ordering of writes. Since threads never are allowed to share memory. Each thread is named after a core, and is assigned spesific tasks. While true that a task then could suffer from amdahl, this is shown by timing each core, and then take care of that spesific problem then. Not by bugging down the execution of all other jobs.

For 100% sure is at least that a 8 core system _cannot_ under any circumstance speed up wellwritten REAL code by more then a factor of 7,99 and it is factually incorrect to teach this.

Edited by Livet Ersomen Strøm on May 28, 2015, 1:27pm

Livet Ersomen Strøm

#3752

May 28, 2015

Please take note that I am 100% confident that Casey will do extremely well with his game eventually. He seems to be an EXCELLENT coder, way above my level. So please don't take this as some kind of critisism. But everyone makes mistakes and this one is one. It's just not correct to say that 16 hyperthreads can give a 16, or even a 9x speedup. Its just that windows will say 50% when you push all physical cores to 100%, but this is very misleading, as if you then double the work it will say 100% without actually doing more work per unit of time.

I posted about this before, but I kind of think it's a pretty significant "detail".

Casey Muratori

#3753

May 28, 2015

Hyperthreading definitely does allow you to potentially double the work done per core. The reason for this is that the processor core is actually not just one monolithic thing, it is a series of work units that can each accept one thing to do on each cycle. When a core has only one hyperthread to execute, it may find that its work units are often fallow because there are no instructions in that thread that are not serially dependent, and thus it cannot fill all its work units every cycle (well, up to the instruction-per-cycle issue limit, which I believe is 4). Adding a second hyperthread gives the processor a whole other instruction window from which to pull instructions so it can keep the work units full.

Obviously, the degree of speedup you get depends entirely on a) how underutilized the work units are in the first thread when executing by itself and b) how well the second thread can fill in the unused slots. But it is certainly theoretically possible to get 2x the performance, and in practice you can often get non-trivial speedups here (probably not the full 2x, but definitely not nothing).

Finally, none of this has anything to do with the "percentage of the CPU used" that Windows reports. All that tells you is whether you're at least stuffing all the CPUs with actual threads. It doesn't actually measure your performance, which is why we used rdtsc (and eventually other things) to look at how well we're doing optimization-wise. What we care about is the total time taken to perform work, and what we want to measure is whether adding a hypterthread decreases that time.

- Casey

Mārtiņš Možeiko

#3758

May 28, 2015

Kladdehelvete, did you read all the text I posted or just the first sentence?

If what you are saying is true, then on 4 core CPU for non I/O bound work you would never get more than 3.99x speedup from single-threaded case, right? Then how do you explain result with my code? I'm running it on 4 core CPU with hyper-threading enabled. Code that runs is not I/O bound - it just performs float addition and multiply. And it runs 6.25x faster than same code running in single thread. Where does 2.25 additional cores come from if not hyper-threading? IMHO it clearly shows than hyper-threading can significantly help (note word CAN, not WILL - it all depends on code you are executing). If you don't like 6.25 number I can figure out different sample code than will give lager speedup - something like 7.5x or so.

Edited by Mārtiņš Možeiko on May 28, 2015, 5:36pm

Andrew Bromage

#3765

May 29, 2015

cmuratori
Obviously, the degree of speedup you get depends entirely on a) how underutilized the work units are in the first thread when executing by itself and b) how well the second thread can fill in the unused slots. But it is certainly theoretically possible to get 2x the performance, and in practice you can often get non-trivial speedups here (probably not the full 2x, but definitely not nothing).

All of this is true. I've found that "1.5x speedup" is a reasonably good rule of thumb for most general-purpose workloads, though if you're fine-tuning your inner loops you may do better.

One other thing to keep in mind is that hyperthreading increases the number of instructions issues without increasing the size of your cache. Back when I was paid to write database servers, there were some workloads for which we'd advise customers to turn HT off. Yes, you got more cycles to do things with, but the increase in working set size could result in a net loss.

Admittedly, this was a long time ago, and the discrepancy may not be as big now. Also, hyperthreading was relatively new, and nobody really understood how to use it effectively.

Livet Ersomen Strøm

#3819

June 2, 2015

mmozeiko
Kladdehelvete, did you read all the text I posted or just the first sentence?

MMozeiko:

Yes. I definitly read everything you post. But we had this discussion before, and as I recall I already agreed then, to the fact that if a core is underutilized, then of course, running more work on it will have more work done.

But this is not the issue I bring up.

If hyperthreading brought more power to a CPU, Intel basically could restart
moores law, by just subdividing each core into twice as many new hyper threads, every 18 months! Wouldn't that be just marvellous?

Now I do not only read every post you make in this forum. I code programs
to actually test various claims being made. And right now I come to you
with depressing results.

Right now, I am close to questioning if my Nehalem 4 CORE i7 cpu
is even working correctly.

The result I am staring at now freaks me out. I am like: "noone is going to believe me".

I post 4 short slightly different asm routines. (to make things simple).

Each of them has been duplicated into 4 "unique" versions where each version uses a seperate memoryarea containing 10 million 16 bytes vectors.
So even if I post 1 version of each, there is 4 of each in the testcode.

There is 4 versions of [DoMulPsWork1] called
DoMulPsWork1 - DoMulPsWork2 - DoMulPsWork3 - DoMulPsWork4

Same apply to the other routines. But I post only the 1 version of each here, as they are the same except for working on a different data area.

ABOUT THE DATA.
Each memory area contains the exact same data, but are SEPERATE areas.
The data has been generated randomly using whole numbers between 0 and 3000
that has then been converted to floats, and stored into each vector component.

That memory has then been copied to 8(7) seperate areas

(I chose this range to make sure the muls dont overflow.
But slightly higher ranges yields the same result, as long as no INF is happening.

If INF happens the routine will seem to complete *much* faster
but timings will then be very wrong).

DoMulPsWork 1 - 4:
does a simple MULPS of 2 x 16 bytes vectors, and stores result back into
the first vector in the data. It then skips both vectors (ECX+32) and does it again until the end of the array. (5 million times in a loop).

DoFPUMulWork1 - 4:
does the exact same thing as DoMulPsWork, except it uses the FPU. Therefore it must do 4 seperate muls, loads and stores to achieve the same thing.

The two other routines below does the same, except they do more work.
This was to give the SSE code and FPU more todo, so that the SSE code
would have a stronger advantage....it IS faster after all. So in addition to the MULPS, they also do a ADDPS and one ANDPS.

But the results are still depressing. Not only is the FPU code sometimes almost as fast as the SSE code, espesially on cached data, and if calling "Refreshdata" afterwords, or inbetween several runs (not shown)
but neighter the FPU code NOR the SSE code takes almost any advantage from running on multiple cores. The time taken by 2 cores, is twice that of one core, and so on. Not exactly, but it is still depressing.

So the time taken for 4 cores, is almost exactly 4 times the time taken on one core. I am like:

"I dont believe this, I must have been making some some mistake and I just cant see it".

So if you please. Can you verify and hopefully dispute my results? Or explain them?

First the 4 code samples. Then the results further down.

;MaxVectors = 10 million
;5 million x 4 muls

DoMulPsWork1:
    mov D$NumSSE2Muls1 0
    mov eax D$MulPsData1
    mov ecx 0
    While ecx < ((MaxVectors*16) / 2)
        movaps xmm0 O$eax+ecx
        movaps xmm1 O$eax+ecx+16
        mulps xmm0 xmm1
        movaps O$eax+ecx xmm0
        add ecx 32
        add D$NumSSE2Muls1 4
    End_While
    ;call Refreshdata
ret


DoFPUMulWork1:
    mov D$NumFPUMuls1 0
    mov eax D$FPUMulData1
    mov ecx 0
    While ecx < ((MaxVectors*16)/2)
        fld F$eax+ecx
        fmul F$eax+ecx+16
        fstp F$eax+ecx
        add ecx 4
        fld F$eax+ecx
        fmul F$eax+ecx+16
        fstp F$eax+ecx
        add ecx 4
        fld F$eax+ecx
        fmul F$eax+ecx+16
        fstp F$eax+ecx
        add ecx 4
        fld F$eax+ecx
        fmul F$eax+ecx+16
        fstp F$eax+ecx
        add ecx (4+16)
        add D$NumFPUMuls1 4
    End_While
    ;call Refreshdata
ret

Because I was flabbergasted by timing these on 1 CORE, 2 CORES 3 CORES and 4 CORES i added more code into the mix, to give the SSE code more advantage and created the following routines:

DoMOREMulPsWork1:
    mov D$NumSSE2More1 0
    mov eax D$MulPsData1
    mov ecx 0
    While ecx < ((MaxVectors*16) / 2)
        movaps xmm0 O$eax+ecx
        movaps xmm1 O$eax+ecx+16
        mulps xmm0 xmm1
        addps xmm0 xmm1
        andps xmm0 xmm1
        movaps O$eax+ecx xmm0
        add ecx 32
        add D$NumSSE2More1 4
    End_While
    ;call Refreshdata
ret



DoMoreFPUMulWork1:
    mov D$NumFPUMore1 0
    mov eax D$FPUMulData1
    mov ecx 0
    push 0
    While ecx < ((MaxVectors*16)/2)
        fld F$eax+ecx
        fmul F$eax+ecx+16
        fadd F$eax+ecx+16
        fstp D$esp
        mov edx D$esp
        and edx F$eax+ecx+16
        mov F$eax+ecx edx
        add ecx 4

        fld F$eax+ecx
        fmul F$eax+ecx+16
        fadd F$eax+ecx+16
        fstp D$esp
        mov edx D$esp
        and edx F$eax+ecx+16
        mov F$eax+ecx edx
        add ecx 4

        fld F$eax+ecx
        fmul F$eax+ecx+16
        fadd F$eax+ecx+16
        fstp D$esp
        mov edx D$esp
        and edx F$eax+ecx+16
        mov F$eax+ecx edx
        add ecx 4

        fld F$eax+ecx
        fmul F$eax+ecx+16
        fadd F$eax+ecx+16
        fstp D$esp
        mov edx D$esp
        and edx F$eax+ecx+16
        mov F$eax+ecx edx

        add ecx (4+16)
        add D$NumFPUMore1 4
    End_While
    add esp 4
    ;call Refreshdata
ret

SSE SIMPLE
RESULTS
CODE DoMulPsWork1
All threads took 13.41 millseconds.

Thread1: 13.41
Thread2: 0
Thread3: 0
Thread4: 0

DoMulPsWork1+2
All threads took 18.31 millseconds.

Thread1: 18.23
Thread2: 18.31
Thread3: 0
Thread4: 0

DoMulPsWork1->3
All threads took 29.95 millseconds.

Thread1: 29.63
Thread2: 29.79
Thread3: 29.95
Thread4: 0

DoMulPsWork1->4

All threads took 37.16 millseconds.

Thread1: 31.30
Thread2: 36.38
Thread3: 36.84
Thread4: 37.16

**************

FPU SIMPLE
RESULTS
CODE DoFPUMulWork1 alone
All threads took 18.92 millseconds.

Thread1: 18.92
Thread2: 0
Thread3: 0
Thread4: 0

CODE DoFPUMulWork1 + 2
All threads took 34.03 millseconds.

Thread1: 33.83
Thread2: 34.03
Thread3: 0
Thread4: 0

CODE DoFPUMulWork1 + 2 + 3
All threads took 39.52 millseconds.

Thread1: 39.37
Thread2: 39.45
Thread3: 39.52
Thread4: 0

CODE DoFPUMulWork1 + 2 + 3 + 4
All threads took 53.70 millseconds.

Thread1: 48.62
Thread2: 49.01
Thread3: 53.52
Thread4: 53.70

--------

SSE2 MORE
RESULTS
CODE DoMOREMulPsWork1
All threads took 14.10 millseconds.

Thread1: 14.10
Thread2: 0
Thread3: 0
Thread4: 0

DoMOREMulPsWork1-2
All threads took 23.06 millseconds.

Thread1: 22.66
Thread2: 23.06
Thread3: 0
Thread4: 0

DoMOREMulPsWork1-3
All threads took 32.05 millseconds.

Thread1: 29.80
Thread2: 30.55
Thread3: 32.05
Thread4: 0

DoMOREMulPsWork1-4

All threads took 40.44 millseconds.

Thread1: 38.44
Thread2: 38.84
Thread3: 39.29
Thread4: 40.44

FPU MORE
RESULTS
CODE DoMoreFPUMulWork1
All threads took 32.83 millseconds.

Thread1: 32.83
Thread2: 0
Thread3: 0
Thread4: 0

DoMoreFPUMulWork1-2
All threads took 52.24 millseconds.

Thread1: 51.87
Thread2: 52.24
Thread3: 0
Thread4: 0

DoMoreFPUMulWork1-3
All threads took 60.15 millseconds.

Thread1: 59.64
Thread2: 59.76
Thread3: 60.15
Thread4: 0

DoMoreFPUMulWork1-4
All threads took 98.43 millseconds.

Thread1: 80.55
Thread2: 94.56
Thread3: 98.17
Thread4: 98.43

Edited by Livet Ersomen Strøm on June 2, 2015, 1:44pm

Ossi Waenerberg

#3820

June 2, 2015

I don't think there's anything that surprising about the results - you're just bound by available memory bandwidth. 10 000 000 xmm vectors, at 16 bytes each, is ~152 Mb, that's not gonna fit even in L3 cache (and besides, you don't mention anything about warming caches beforehand - but with the size of the dataset, it's a moot point). And when you use multiple threads, you multiply the data size by their amount, so no wonder it's going to take longer, since all the cores share the same bandwidth from memory to the cpu. And besides, when you actually do more work, the SSE code is demonstrably faster - isn't that just what we would like to see?

Mārtiņš Možeiko

#3822

June 2, 2015

@Kladdehelvete - You clearly don't understand hyperthreading. Hyperthreading is about using underutilized units. It won't magically create new caches or allow to work around memory bounded workload. It's about using mul unit when only add unit is busy (and similar).

And you are clearly not reading what I'm writing. You said that you can NEVER get more speedup that 4x on 4 core cpu (even when HT is enabled). I showed program that DOES 6.25x speedup on 4x core CPU. That obviously disproves what you are saying. You can run it on your machine and see for yourself. I said that there are programs where you CAN get >Nx speedup on N core CPU if HT is enabled. and that is what I showed. I'm not claiming that you can ALWAYS get >Nx speedup. Just that there are class of programs that can do that.

So if you please, run my code and explain 6.25x (or any other number you get on your CPU) before continuing this discussion.

So if you please. Can you verify and hopefully dispute my results?

How can I do that? What you posted is not fully working code (like I did). And it uses some obscure asm syntax. How do I know how to compile it?? When proving something with code you need to post short, self contained complete code. Not just a fragment of something.

So putting the talk about memory bound workload aside - you are using float multiply operation in all threads. If your CPU model has only 1 execution unit for float multiply operation then hyper-threading won't help much (it still could, but not as much). As far as I can tell Nehalem has only one port that does float multiply (port 0): https://upload.wikimedia.org/wiki.../800px-Intel_Nehalem_arch.svg.png

Edited by Mārtiņš Možeiko on June 2, 2015, 9:54pm

Livet Ersomen Strøm

#3883

June 7, 2015

owaenerberg
I don't think there's anything that surprising about the results - you're just bound by available memory bandwidth. 10 000 000 xmm vectors, at 16 bytes each, is ~152 Mb, that's not gonna fit even in L3 cache (and besides, you don't mention anything about warming caches beforehand - but with the size of the dataset, it's a moot point). And when you use multiple threads, you multiply the data size by their amount, so no wonder it's going to take longer, since all the cores share the same bandwidth from memory to the cpu. And besides, when you actually do more work, the SSE code is demonstrably faster - isn't that just what we would like to see?

That is not how a cache works.
All code that works on registers or performs a computation is
always doing it on data in the cache. This code don't benefit from
working with less data, because its only doing a few computations,
and never sees that data again. So if data is cut by 10, the time
moves a decimal to the left, and so on.

If I had repeatedly done many operations on the same data you'd be correct. But in that case both versions should benefit equally.

Sse would benefit from performing more work. Like we did with DrawRectangleQuickly. But not a straight 4x, more like 2x or a little more then 2x perhaps.(in a 4x core cpu)

But the reason I posted it was I was expecting an additional 4x speedup from using 4 physical cores. I dont think I can expect that, anymore.

If you do something simple like just writing a 1/4 of a bitmap (on each core) without any computations, the 4x speedup is easy to reach. But this I suspect, requires each core only need to see 1/4 of the data.

Both sse and multithreads are still worth it of course. It's just not always clear how much. It depends on what you do.

But I think it clear that if you are streaming data, then 4x is possible (3.99) on 4 physical cores. But if doing both streaming, and a modest amount of floating point operations, the 4x becomes harder to reach. Then doing more computations, per data item will help.

If I am not very mistaken, the frametime for HMH went from 46m/s to 16-17 m/s, once multicores are enabled in day 126. Thats a 3,5x speedup. Not 8.0x. And certainly not 16x.

And this is my point.

Edited by Livet Ersomen Strøm on June 7, 2015, 7:03am

Mārtiņš Možeiko

#3884

June 7, 2015

If I am not very mistaken, the frametime for HMH went from 46m/s to 16-17 m/s, once multicores are enabled in day 126. Thats a 3,5x speedup. Not 8.0x. And certainly not 16x.

That's because GameUpdateAndRender is limited to 60Hz (by calling Sleep in win32_handmade.cpp). 60Hz = 1/60ms = 16.7ms. If Casey would disable Sleep call, you would see smaller number than 16ms.

Livet Ersomen Strøm

#3885

June 7, 2015

mmozeiko
If I am not very mistaken, the frametime for HMH went from 46m/s to 16-17 m/s, once multicores are enabled in day 126. Thats a 3,5x speedup. Not 8.0x. And certainly not 16x.

That's because GameUpdateAndRender is limited to 60Hz (by calling Sleep in win32_handmade.cpp). 60Hz = 1/60ms = 16.7ms. If Casey would disable Sleep call, you would see smaller number than 16ms.

All right. But would be nice if he spent more time to document the exact improvements.

Imo, it should be on the framebuffer always, details on times. max, min and avg. m/s, and % of frametime going to physics, sound and drawing, for instance. As we move forward. As you can discover sometimes things that way that otherwise would pass by unnoticed. And it has great interest to me as the viewer. Just a suggestion, btw, but I find it very important.

Anyway I calculated it wrong. (2,875)

btw. The reason I made this assumption is I saw a 17 m/s which made me think it would not reach 60fps

Edited by Livet Ersomen Strøm on June 7, 2015, 7:31am

Mārtiņš Možeiko

#3887

June 7, 2015

Yes, we'll see those numbers. Casey said he'll do more measurements once the time comes.
Then you'll see that you're wrong :) We will get 4x speedup easily. More than 4x I'm sure (as my small sample code proves).

Edited by Mārtiņš Možeiko on June 7, 2015, 7:41am

Ossi Waenerberg

#3888

June 7, 2015

@Kladdehelvete - I mentioned caches only to point out that there are multiple reasons why the data is not in any level of cache. It has to be fetched from main memory, which gives a reasonable explanation for your results (time taken scales almost linearly with multiple cores/multiplied data, do more work SSE versions take about the same time as their do less work counterparts since the work can more or less be done while waiting for more memory to arrive).

The thing is, you have to keep all the possible resources/bottlenecks in a processor in mind while thinking about performance. In your (artificial) example, the bottleneck looks to be memory bandwidth. In Casey's case, there already has been some talk on how utilize caches efficiently to minimize its impact (what should the tile size be, how to use hyperthreads).