Possible bug with thread queue (Episode 126)

Apologies if I'm mistaken, but I just watched episode 126 and spotted what might be a possible rare bug. Hopefully I'm right and saving the pain from one of those nasty threading bugs that could cause a lot of grief later.

In win32_handmade.cpp in ThreadProc you're doing the WaitForSingleObjectEx in order to do Win32DoNextWorkQueueEntry which might fail to operate on the job if InterlockedCompareExchange fails.

So my thought is that the following could happen:

1) 2 jobs added to queue
2) 2 threads pass WaitForSingleObjectEx and enter Win32DoNextWorkQueueEntry at the same time
3) 1 thread fails the InterlockedCompareExchange and doesn't do any job
4) Now the semaphore is at 0 so WaitForSingleObjectEx won't pass for another thread but there's still 1 job in the queue

So the solution to this would be when InterlockedCompareExchange fails, the thread would need to do ReleaseSemaphore(Queue->SemaphoreHandle, 1, 0); so that another thread can do the job that was missed.
Why do you think semaphore is at 0 at 4th step? Semaphore decreases only when WaitForSingleObjectEx is called. So after 3rd step when one thread skips job because other thread fetched it first, assuming there are two threads in total then the semaphore value is still 2 (if other thread is not "sleeping") or 1 (if other thread went to "sleep"). If fetching job failed nobody calls WaitForSingleObjectEx because WeShouldSleep == false. In any case the thread that skipped job will try again to fetch new job and will succeed or fail again or will go to sleep.

Edited by Mārtiņš Možeiko on
Ok, well I grabbed the source code and looking through it properly I can see there's no bug. I don't think you're entirely correct in your idea of the semaphore counts at the steps because on step 2 the semaphore value is 0 because WaitForSingleObjectEx went twice. But I see now that WeShouldSleep is not set to true for the thread that has (Index == OriginalNextEntryToRead) be false so it will indeed loop around again without waiting and do the work on the other entry.

I wasn't really thinking about the WeShouldSleep part and was expecting the thread to be waiting on the semaphore every iteration of the loop. Glad I was wrong anyway. It does seem a little odd that they don't perform the check every iteration because it means after performing an entry they'll always attempt to do work the next time even though there may be nothing to do or another thread may have done any other work. Maybe doing that check again is faster than doing WaitForSingleObjectEx though?

I think the idea of the semaphore count being the same as the number of entries left to do which is how it'd be the way I was describing is "cleaner"/more obvious in terms of how I'd think about it and would lead to less unneeded checks for more work to do, but it's definitely not a bug the way it is.

Edited by Matthew Carr on
Oh, you're right. Value will be 0 if no other jobs are being added. I don't know what I was thinking when I wrote that :)

Maybe doing that check again is faster than doing WaitForSingleObjectEx though?
Checking if there is work involves were few instructions (main one is InterlockedCompareAndExchange). WaitForSingleObject involves potential context switch, depends on what OS wants.

Edited by Mārtiņš Možeiko on
If I was to implement something like this I'd probably go the route of the WaitForSingleObjectEx every loop iteration (and release on the fail condition as I said in my first post) so that the semaphore count matches the number of entries in the queue.

Since the threads will continue to operate on jobs while there are jobs in the queue and not do a wait until the queue is empty (which is when WeShouldSleep is set to true), it means the semaphore count will keep rising and then once all the jobs are done the threads will loop around the amount the semaphore count is. So say there were 8 worker threads and 100,000 jobs added to the queue and the threads took longer to complete them than they all took to be added, the 8 worker threads might have been sleeping initlaly and hit WaitForSingleObjectEx once each, but then wouldn't again until there were no jobs in the queue so the semaphore count would be 99,992 still so they'd loop around that many more time cumulatively.

I think in the handmade hero case there the semaphore limit is set to the number of threads so that's not going to be that bad at all and is probably fine as is. I'll have to test and look into if there's any significant performance negatives in having a large semaphore limit (e.g. setting it to the size of the queue) because that could be another point against the way I'm thinking of doing it.

Edited by Matthew Carr on
Such approach is fine. It will work. Just don't forget to increase max count. Semaphore value has max amount it can take.

But AFAIK if you call WaitForSingleObj OS sees that as good point to do context switch to another process thread if it had not been running recently. So this way you are giving OS chance to do more context switches to thousand other processes running on your machine (Chrome, Skype, etc..) Without WaitForSingleObj call OS will theoretically do a little bit less context switching. But who knows how much... it may be actually that this is very hard to measure difference.
The point of the queue here is not to maximize the amount of sleeping, but rather minimize the amount of time spent dealing with the queue, if that makes sense. This is a _performance oriented queue_, which means you do not want to waste time checking a semaphore for no reason. It's totally fine for someone to do an extra iteration of a loop (which takes a scant few cycles) rather than calling into the operating system every time and doing whatever Wait* does, which could be a significant amount of work (we don't even know).

Put another way, we want to make sure that while there is work in the queue, we spend _no time_ waiting for the operating system to do things. That is why the queue is currently set up the way it is. We don't care about the fact that once the queue is empty, threads may loop extra times with no work to count down the semaphore.

Does that make sense?

- Casey
Yep, it makes sense and I agree it's the better option. I was initially thinking the implementation was different than it is.

Do you think there might be any performance gain then from always releasing the full semaphore count? If you're typically adding multiple jobs at a time then that might result in the threads more quickly responding to the 2nd, 3rd, etc jobs that are added. If it takes longer for one of them to wake up and do the check than it takes to add the 2nd job then I'd think there'd be a slight gain. If not then it wouldn't be any slower unless releasing more than 1 at a time from a semaphore is slower than doing 1. That might be more likely if/once other threads can queue jobs too.
If there's a performance gain to be had, it's more likely to be in having the job queue take n job additions at once, so that the OS doesn't have to be told as often. Ie., if a producer knows they will be adding 8 jobs, they can call AddMultipleJobs() and pass 8 and an array, so that the jobs get added to the queue and the OS only has to be told once to release 8 semaphore counts.

- Casey
Makes sense. I guess I'm more curious about there being some OS magic in between releasing semaphore counts and the threads passing the waits. Like if you release 8 counts, would there be some variable time for each thread passing the wait as determined by the OS?

That's where my thinking was in sort've pre-firing the other threads by releasing the semaphore n*ThreadCount early so they'll start checking and looping in advance. You would still call the release after adding the jobs to ensure everything gets operated on, but if the time between the pre-release and adding the jobs to the queue is less than however long the OS takes to pass the wait for the threads then it could save that time.

It sounds pretty unlikely, but I only raise it because I don't know what goes on between the semaphore release and the wait passing on the OS side. I doubt it's worth worrying about and for all I know the extra release (the pre-release) call that would be required in this multiple job add version might be slower than what would be saved anyway.

Edited by Matthew Carr on
The point here is that the only possible way to actually have what you're talking about work is _if you add more things to the queue first_, hence my "push n" suggestion. If you _do not_, then the threads will likely wakeup and see that there's no work, then go back to sleep, by the time you round trip back to the application and have it call you back to add another job. Make sense? So if you're going to wake up more threads, you have to make sure they're going to actually find something.

Alternatively, you could have the threads spin-look for some count before sleeping, like "check the queue 100 times" or something, but that sounds like a dicey idea to me for other reasons.

- Casey
Right now in the doNextWorkQueueEntry function, we first check if nextEntry < entryCount and if this is not the case, "weShouldSleep" is set to true. But as far as I understand, that means that when multiple threads start the same entry simultaneously, "weShouldSleep" will be set to true, and the semaphore will be decremented by 1, even though the entry hasn't necessarily completed.

Shouldn't "weShouldSleep" be set to true, only if the result of interlockedCompareExchange == originalNextEntry?
[quote=elle]Right now in the doNextWorkQueueEntry function, we first check if nextEntry < entryCount and if this is not the case, "weShouldSleep" is set to true.

I don't have the code in front of me right now, but I believe WeShouldSleep is only set to true in the case that there are no more entries to perform in the queue. So if say 2 entries were added and 2 semaphore counts were released then 2 threads would start going. If they both tried to operate on the same entry then WeShouldSleep would be false still for the one that failed the InetlockedCompareExchange. This is correct because it wouldn't sleep and would loop back around and do the other entry. I think that must be how it is because initially I thought something similar to you but upon checking the code it was returning false when it needed to.


[quote=cmuratori]Alternatively, you could have the threads spin-look for some count before sleeping, like "check the queue 100 times" or something, but that sounds like a dicey idea to me for other reasons.

Agreed, but I guess if it was an optimisation you were doing for a certain project on certain platform where the wait had been identified as an issue you could have a queue variable "ExpectedEntryCount" that the main thread that's queueing up jobs could set before if starts the work of queueing jobs and it could then release the threads. ExpectedEntryCount would be decremented when jobs were added. Threads would then only sleep if ExpectedEntryCount was 0. As long as it was properly set (and likely forced to 0 when everything was finished being added in that collection of jobs) then it would probably be a less dicey/ambiguous way of having the threads spin for a bit in anticipation while the main thread adds the jobs. It could also just be a bool like ImAddingJobsNowSoDontSleep that's set to true while the queueing of jobs is happening.

Edited by Matthew Carr on
But as far as I understand, that means that when multiple threads start the same entry simultaneously
Yes, but they will try to do it and fail. That's why we use InterlockedCompareExchange instruction. It is there to guarantee that only one thread gets job with number N. If other thread also tries to get job with number N it will fail and then retry later with job N+1.
Yes, I understand, but if a thread fails to get an entry, now, it will also decrement the semaphore count, I think. So, if a lot of threads fail, the semaphore might get to 0, before all the entries can be completed, or am I wrong?

Oh nevermind, the semaphore can only be decremented when nextEntry >= entryCount.

Edited by elle on