Multicore tests

Livet Ersomen Strøm

#3000

March 14, 2015

If you are interested, I'd like you to write code that does the following, on 1, 4 and 8 threads:

Let each thread allocate 1 mega of memory EACH, and increment in a loop each 32 bit DWORD of that memory until all equals 0xffffffff.

Do this then for 1 core/thread, 4 cores(Threads) and then for 8 (threads) on a multicore processor.
(more specifically, do it for as many cores as you have, and then for as many threads that you have = hardware threads)

On my i7 it takes about 22000 ms to complete the counting on 1 thread, and a little more to do it on 4 threads (but roughly the same). On 8 threads it takes almost exactly twice the time. Like roughly 44000 ms.

The thing I find strange is, that when this is happening on 4 cores, windows shows almost exactly 50% cpu usage. But when doing it on 8 threads it shows exactly 100% usage.

What I would expect is that cpu was 100% in both those 2 cases: 4 and 8 cores. And that the time would be roughly the same. While true that the 8 cores does twice the work, it also uses twice the CPU. And thus it should be able to finish sooner. Or the other way around: That the 4 cores that use just 50% of the CPU should take longer.

I dont understand these result. Be happy if anyone could fill me in.

ps. Thanks for an excellent and highly interesting show, with handmade hero. I follow every one, with a deep interesst.

Mārtiņš Možeiko

#3001

March 14, 2015

Most likely you have quad-code CPU with Hyper-Threading enabled in BIOS. That's why 4 threads uses only 50% of all logical CPU's (which are 8 ). Threads with this kind of computation work don't benefit from Hyper-Threading, they are competing for same resources, so doing twice as much work will take 2x time. By using advanced profiler (like Intel VTune or AMD CodeXL) you should be see that exactly.

Edited by Mārtiņš Možeiko on March 14, 2015, 7:26am

Livet Ersomen Strøm

#3002

March 14, 2015

Thanks.

But are you saying that the 50% reported is thus wrong? (That they really utilize 100%) Or are you saying something else?

These are default bios settings, so therefore I would expect that to be the case on most client computers as well. And that this then is the usual case.

If the 8 threads cant do more work then 4 threads, why then ever use them in a game? The only effect would be to show the user that more work is done? I guess I understand what you are saying. Just not what its implications are.

About competing for resources. There's an interesting series from Carnegie Mellon, on computer architecture that goes into those issues. But I haven't completely grasped all of the implications yet. But there is an issue discovered by MS research, a paper that states that a core can starve all the other cores by getting cache priority. However I am unsure that should be the case in my case, since it seems parallelization in that case is actually almost suspiciously symmetric. And 8 megabytes should fit amply in the cache, as well. And now, the thread do not "compete" for this memory, since they have each a megabyte. They only compete for CPU resources. So I guess I am still a questionmark.

What kinds of work would benefit from this setup then?

Regards.

https://www.youtube.com/user/cmu18447/videos

Mārtiņš Možeiko

#3003

March 14, 2015

The number is correct. Your code uses 50% of logical CPU cores.

Using 8 cores still makes sense because no game uses all 8 cores to the max. There will be always some threads doing some light work or stalling on branch prediction, or executing operations in long pipelines or something similar where Hyper-Threading helps. Or threads will be simply sleeping because some driver decided to wait on disk I/O or GPU synchronization.

8MB is only L3 cache. L2 cache is smaller than that (something around 256KB). And L1 cache is even smaller - typically 32KB.

And then there are cache-lines. This means that real memory can not simply be placed in any location of cache. It can be mapped to one or few (2 or 4) cache line locations. That mean relative alignment matters for those 1MB buffers!

Did you mean thread "priority inversion"? That is a thing, but it doesn't apply to this case. It matters only for thread scheduling where threads run with different priorities and are accessing same resource (handle/mutex/etc..). As long as you haven't used SetThreadPriority or similar posix call, it doesn't apply here.

Edited by Mārtiņš Možeiko on March 14, 2015, 8:09am

Livet Ersomen Strøm

#3005

March 14, 2015

mmozeiko
The number is correct. Your code uses 50% of logical CPU cores.

Yes. That is correct. however, if I try to use 100%, nothing is gained. And thats why I posted this.

mmozeiko

Using 8 cores still makes sense because no game uses all 8 cores to the max.

Yes I suspect this is true. But.. What game reuses less then 16 megabytes, again and again within a timeframe of 22 seconds? I would suspect a game would use 100s of megabytes per second or even per frame? In other words, whats to gain from spreading work on more cores, if the problem is cache utilization? A game will use much more memory per timeunit still? And all that must come from mainmemory? The 8 megabytes are all in the L3 cache. I guess I am missing something here?

mmozeiko

There will be always some threads doing some light work or stalling on branch prediction, or executing operations in long pipelines or something similar where Hyper-Threading helps. Or threads will be simply sleeping because some driver decided to wait on disk I/O or GPU synchronization.

This I have already had problems with. But the effect I saw, was that during I/O all other threads was simply almost dead. I don't know the reason because the IO happend inside a library ( gdiplus ). And much like you say, setting relative priorities didn't help at all. In fact, it worsened the problem. I guess if I could write those libraryfunctions myself, I could make sure that no IO read would be allowed to read more then a small amount of memory at a time, so it would not stall other threads, for too long. But I don't know. And I hate not knowing^^

mmozeiko

8MB is only L3 cache. L2 cache is smaller than that (something around 256KB). And L1 cache is even smaller - typically 32KB.

Yes. Ok. I believe the L3 caches are actually 16 megs or something. However, each core has its own L1 cache? L2? I need to find out... But why I see supernice scaling to 4 cores, and then completely failed scaling to 8 logical cores? I mean, what I see is that 8 cores actually adds nothing, but heat. And don't you find that at least a little peculiar? 4 cores does exactly the same work per unit of time, as all 8 logical cores.

mmozeiko

And then there are cache-lines. This means that real memory can not simply be placed in any location of cache. It can be mapped to one or few (2 or 4) cache line locations. That mean relative alignment matters for those 1MB buffers!

The L1 cache is integrated into each core. Also the L2 cache is often private to each core today. Much likely the i7 uses a cache strategy called "full associativity". Which, unless I am mistaken, means that it can cache any address without collision. (EDIT: Nope this was wrong.)

mmozeiko

Did you mean thread "priority inversion"? That is a thing, but it doesn't apply to this case. It matters only for thread scheduling where threads run with different priorities and are accessing same resource (handle/mutex/etc..). As long as you haven't used SetThreadPriority or similar posix call, it doesn't apply here.

I am talking about "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems". A very serious issue. I was frankly shocked to read it. Is this the same as what you are refering to?

The paper is here:

http://users.ece.cmu.edu/~omutlu/pub/mph_usenix_security07.pdf

Edited by Livet Ersomen Strøm on March 14, 2015, 5:19pm

Mārtiņš Možeiko

#3007

March 14, 2015

But the effect I saw, was that during I/O all other threads was simply almost dead.

That could mean your other threads were waiting on some resource (mutex, critical section, etc..) that was locked by I/O thread. Or some other issue, who knows what code you wrote... Without seeing actual code it's just a lot of guessing and "what if". Work must be correctly divided between threads to benefit from multithreading.

Edited by Mārtiņš Možeiko on March 14, 2015, 8:13pm

Livet Ersomen Strøm

#3030

March 16, 2015

mmozeiko

That could mean your other threads were waiting on some resource (mutex, critical section, etc..) that was locked by I/O thread.

I forgot to mention that the original implementation worked well on the i7, under win7. (IF I did set other priorities than the "intuitive" once. Like I ended up using Above normal for both threads, then it worked like butter. I seem to recall it also worked well if I did not set any priorities).

However, when running in XP the entire thing broke down, whatever I did. It slowed to a crawl. So that's why I even mention it. On the i7, threading usually works very well. Surprisingly well I would say. And I have now since long changed the code so it works for both cases. I was just recalling it, when reading your previous reply.

mmozeiko

Or some other issue, who knows what code you wrote... Without seeing actual code it's just a lot of guessing and "what if".

Yes true. But the problem is that if I let you see the code, you would probably ask me to be payed to read it, and then you would ask for a raise. Because its assembly code, and it's pretty complex. The task it is doing is relatively simple to explain, but the code is optimized on many levels, including dynamic queuing and requeuing of graphical elements. Like it can abort itself at once, and reprioritize which graphical elements are drawn, based on user input. And it is doing this inside a component that is a tree-based structure, that has also a treebased graphical structure. It has one tree for its data, and another for displaying of the graphics. In short. You just have to take my word for it.

And this is a problem. Implementing simple threads is very easy to get right. And talking about complex things is close to impossible. Just like I think you are saying.

mmozeiko

Work must be correctly divided between threads to benefit from multithreading.

This last statement left me a little confused. On the one hand, you say that my usecase from above (The test code) is typically code that does not scale well to paralellization. Yet the paralellization seems damn near perfectly balanced, to my eyes. Which is also what I would expect.

And it would also resemple the task I wrote it for testing, writing a software backbuffer in paralell. It seems I can get a 4 times speedup of updating the backbuffer this way. Which right now would give my code a significant needed boost.

(I forgot to mention that this testcode also updates a progressbar for each thread, each time it has finished the loop. So it adds 1 to each integer in a loop, then updates the progress. This will explain if the code is slower than you expect, but I think it didn't really matter to mention it, since the threads still scale almost perfectly, on 4 cores.

I like to ask a few questions on the use of Events. If you care to consider them.

Yesterday I found I could with much ease port my physics code to 4 cores. This made a significant difference, and I havent even begun the work to optimize how it is scaled. I just devided the tasks on 4 cores, and then I wait for the cores to finish.

When they have finished their work, they SET an event, and the mainthread waits for those 4 events to be set, before it continues. This initially seems to work perfectly. And it made the physics so fast, that I now could easily double the amount of work done. Therefore my drawingcode went from using around 20-30% of the frametime, to using about 50% of the frametime. But it still not only pushed the FPS to a steady 30hz, but it also reached 60, doing 4000 units on the screen.

This made my head boil with excitement when I saw it. So I immidiatly went ahead and did the same for the sorting of the entities, which is done just before drawing them.

But now I suddenly got strange behaviour. It actually seemed like the cores (threads) was continuing doing physics even though they should be waiting. (I use another Event in each thread for the thread to wait until its workqueue has been refilled).

My question for you is if Event are the right way to do this kind of thing?

Is it for instance possible for a thread to first CLEAR the "working" flag, and then SET the event, and then for the mainthread, waiting for the event to become signaled (set) to still read the "working" flag as still set? (While the thread now stuck waiting for the QUUEFILLED event?)

Because that is what is appear to be happening. But it happens very very seldom. For several seconds, everything works well, and then suddenly the mainthread continues after WaitForSingleObject, and still finds the "working" flag set.

Mārtiņš Možeiko

#3031

March 16, 2015

In short. You just have to take my word for it.

If I had penny every time people say this about their "optimized" code...

You should really write code in C (or whatever higher level language you use) before optimizing stuff in assembly. It is much easier to tweak, profile or diagnose multi-threading problems in higher level language. Only when you are satisfied with multi-threading approach/algorithm you chose, then you profile and optimize bottlenecks with assembly code, if that is your choice.

If you want to discuss more on problems you are having, then please create small and standalone code that reproduces behavior or problem and we can discuss how improve performance or fix the problem. Otherwise it is more guessing...

This last statement left me a little confused. On the one hand, you say that my usecase from above (The test code) is typically code that does not scale well to paralellization.

No, I said that work doesn't scale for hyper-threading. If you set thread count = physical core count, it will scale perfectly.

Events on Windows can be auto-reset or manual reset. Auto reset means they automatically become unsignaled after WaitFor..Object call succeeds. Manual reset you must manually put them in unsignaled state with ResetEvent after WaitFor..Object call (otherwise they will still be signaled and next WaitFor..Object call will essentially become no-op).

Long story short, don't take this the wrong way but you should really focus one specific topic in this forum. Instead you start telling long and irrelevant (to handmade hero or topic) stories, which are not useful to anybody. Seriously, get a blog, Google+ page or Twitter. People will respond to that much better.

Edited by Mārtiņš Možeiko on March 16, 2015, 3:38am

Livet Ersomen Strøm

#3033

March 16, 2015

mmozeiko

Long story short, don't take this the wrong way but you should really focus one specific topic in this forum. Instead you start telling long and irrelevant (to handmade hero or topic) stories, which are not useful to anybody. Seriously, get a blog, Google+ page or Twitter. People will respond to that much better.

Yes. You're right. This is about handmade hero. However I thought it could be interesting to debate these things. Because everything I do now, is because of handmade hero. I try to use the things I learn there, in my own code.

No, I never will take it the wrong way. I love and appreciate all of the things I learned from the handmade series, and I think it is an extremely important and well playedout idea. All of it. I am glad you are frank. I was kind of thinking the same thing actually, that I may have gone to much of topic. It's just that I guess I wanted or hoped this could be a vibrant community for all kinds of game/code related questions.

But I will not post more on these issues then. Sorry for wasting your time.