mmozeiko
Kladdehelvete, did you read all the text I posted or just the first sentence?
MMozeiko:
Yes. I definitly read everything you post. But we had this discussion before, and as I recall I already agreed then, to the fact that if a core is underutilized, then of course, running more work on it will have more work done.
But this is not the issue I bring up.
If hyperthreading brought more power to a CPU, Intel basically could restart
moores law, by just subdividing each core into twice as many new hyper threads, every 18 months! Wouldn't that be just marvellous?
Now I do not only read every post you make in this forum. I code programs
to actually test various claims being made. And right now I come to you
with depressing results.
Right now, I am close to questioning if my Nehalem 4 CORE i7 cpu
is even working correctly.
The result I am staring at now freaks me out. I am like: "noone is going to believe me".
I post 4 short slightly different asm routines. (to make things simple).
Each of them has been duplicated into 4 "unique" versions where each version uses a seperate memoryarea containing 10 million 16 bytes vectors.
So even if I post 1 version of each, there is 4 of each in the testcode.
There is 4 versions of [DoMulPsWork1] called
DoMulPsWork1 - DoMulPsWork2 - DoMulPsWork3 - DoMulPsWork4
Same apply to the other routines. But I post only the 1 version of each here, as they are the same except for working on a different data area.
ABOUT THE DATA.
Each memory area contains the exact same data, but are SEPERATE areas.
The data has been generated randomly using whole numbers between 0 and 3000
that has then been converted to floats, and stored into each vector component.
That memory has then been copied to 8(7) seperate areas
(I chose this range to make sure the muls dont overflow.
But slightly higher ranges yields the same result, as long as no INF is happening.
If INF happens the routine will seem to complete *much* faster
but timings will then be very wrong).
DoMulPsWork 1 - 4:
does a simple MULPS of 2 x 16 bytes vectors, and stores result back into
the first vector in the data. It then skips both vectors (ECX+32) and does it again until the end of the array. (5 million times in a loop).
DoFPUMulWork1 - 4:
does the exact same thing as DoMulPsWork, except it uses the FPU. Therefore it must do 4 seperate muls, loads and stores to achieve the same thing.
The two other routines below does the same, except they do more work.
This was to give the SSE code and FPU more todo, so that the SSE code
would have a stronger advantage....it IS faster after all. So in addition to the MULPS, they also do a ADDPS and one ANDPS.
But the results are still depressing. Not only is the FPU code sometimes almost as fast as the SSE code, espesially on cached data, and if calling "Refreshdata" afterwords, or inbetween several runs (not shown)
but neighter the FPU code NOR the SSE code takes almost any advantage from running on multiple cores. The time taken by 2 cores, is twice that of one core, and so on. Not exactly, but it is still depressing.
So the time taken for 4 cores, is almost exactly 4 times the time taken on one core. I am like:
"I dont believe this, I must have been making some some mistake and I just cant see it".
So if you please. Can you verify and hopefully dispute my results? Or explain them?
First the 4 code samples. Then the results further down.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44 | ;MaxVectors = 10 million
;5 million x 4 muls
DoMulPsWork1:
mov D$NumSSE2Muls1 0
mov eax D$MulPsData1
mov ecx 0
While ecx < ((MaxVectors*16) / 2)
movaps xmm0 O$eax+ecx
movaps xmm1 O$eax+ecx+16
mulps xmm0 xmm1
movaps O$eax+ecx xmm0
add ecx 32
add D$NumSSE2Muls1 4
End_While
;call Refreshdata
ret
DoFPUMulWork1:
mov D$NumFPUMuls1 0
mov eax D$FPUMulData1
mov ecx 0
While ecx < ((MaxVectors*16)/2)
fld F$eax+ecx
fmul F$eax+ecx+16
fstp F$eax+ecx
add ecx 4
fld F$eax+ecx
fmul F$eax+ecx+16
fstp F$eax+ecx
add ecx 4
fld F$eax+ecx
fmul F$eax+ecx+16
fstp F$eax+ecx
add ecx 4
fld F$eax+ecx
fmul F$eax+ecx+16
fstp F$eax+ecx
add ecx (4+16)
add D$NumFPUMuls1 4
End_While
;call Refreshdata
ret
|
Because I was flabbergasted by timing these on 1 CORE, 2 CORES 3 CORES and 4 CORES i added more code into the mix, to give the SSE code more advantage and created the following routines:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66 | DoMOREMulPsWork1:
mov D$NumSSE2More1 0
mov eax D$MulPsData1
mov ecx 0
While ecx < ((MaxVectors*16) / 2)
movaps xmm0 O$eax+ecx
movaps xmm1 O$eax+ecx+16
mulps xmm0 xmm1
addps xmm0 xmm1
andps xmm0 xmm1
movaps O$eax+ecx xmm0
add ecx 32
add D$NumSSE2More1 4
End_While
;call Refreshdata
ret
DoMoreFPUMulWork1:
mov D$NumFPUMore1 0
mov eax D$FPUMulData1
mov ecx 0
push 0
While ecx < ((MaxVectors*16)/2)
fld F$eax+ecx
fmul F$eax+ecx+16
fadd F$eax+ecx+16
fstp D$esp
mov edx D$esp
and edx F$eax+ecx+16
mov F$eax+ecx edx
add ecx 4
fld F$eax+ecx
fmul F$eax+ecx+16
fadd F$eax+ecx+16
fstp D$esp
mov edx D$esp
and edx F$eax+ecx+16
mov F$eax+ecx edx
add ecx 4
fld F$eax+ecx
fmul F$eax+ecx+16
fadd F$eax+ecx+16
fstp D$esp
mov edx D$esp
and edx F$eax+ecx+16
mov F$eax+ecx edx
add ecx 4
fld F$eax+ecx
fmul F$eax+ecx+16
fadd F$eax+ecx+16
fstp D$esp
mov edx D$esp
and edx F$eax+ecx+16
mov F$eax+ecx edx
add ecx (4+16)
add D$NumFPUMore1 4
End_While
add esp 4
;call Refreshdata
ret
|
SSE SIMPLE
RESULTS
CODE DoMulPsWork1
All threads took 13.41 millseconds.
Thread1: 13.41
Thread2: 0
Thread3: 0
Thread4: 0
DoMulPsWork1+2
All threads took 18.31 millseconds.
Thread1: 18.23
Thread2: 18.31
Thread3: 0
Thread4: 0
DoMulPsWork1->3
All threads took 29.95 millseconds.
Thread1: 29.63
Thread2: 29.79
Thread3: 29.95
Thread4: 0
DoMulPsWork1->4
All threads took 37.16 millseconds.
Thread1: 31.30
Thread2: 36.38
Thread3: 36.84
Thread4: 37.16
**************
FPU SIMPLE
RESULTS
CODE DoFPUMulWork1 alone
All threads took 18.92 millseconds.
Thread1: 18.92
Thread2: 0
Thread3: 0
Thread4: 0
CODE DoFPUMulWork1 + 2
All threads took 34.03 millseconds.
Thread1: 33.83
Thread2: 34.03
Thread3: 0
Thread4: 0
CODE DoFPUMulWork1 + 2 + 3
All threads took 39.52 millseconds.
Thread1: 39.37
Thread2: 39.45
Thread3: 39.52
Thread4: 0
CODE DoFPUMulWork1 + 2 + 3 + 4
All threads took 53.70 millseconds.
Thread1: 48.62
Thread2: 49.01
Thread3: 53.52
Thread4: 53.70
--------
SSE2 MORE
RESULTS
CODE DoMOREMulPsWork1
All threads took 14.10 millseconds.
Thread1: 14.10
Thread2: 0
Thread3: 0
Thread4: 0
DoMOREMulPsWork1-2
All threads took 23.06 millseconds.
Thread1: 22.66
Thread2: 23.06
Thread3: 0
Thread4: 0
DoMOREMulPsWork1-3
All threads took 32.05 millseconds.
Thread1: 29.80
Thread2: 30.55
Thread3: 32.05
Thread4: 0
DoMOREMulPsWork1-4
All threads took 40.44 millseconds.
Thread1: 38.44
Thread2: 38.84
Thread3: 39.29
Thread4: 40.44
FPU MORE
RESULTS
CODE DoMoreFPUMulWork1
All threads took 32.83 millseconds.
Thread1: 32.83
Thread2: 0
Thread3: 0
Thread4: 0
DoMoreFPUMulWork1-2
All threads took 52.24 millseconds.
Thread1: 51.87
Thread2: 52.24
Thread3: 0
Thread4: 0
DoMoreFPUMulWork1-3
All threads took 60.15 millseconds.
Thread1: 59.64
Thread2: 59.76
Thread3: 60.15
Thread4: 0
DoMoreFPUMulWork1-4
All threads took 98.43 millseconds.
Thread1: 80.55
Thread2: 94.56
Thread3: 98.17
Thread4: 98.43