Atomic add necessary for EndTicketMutex?

After watching day 325 (Ticket Mutexes), I'm left wondering why is an AtomicAddU64 necessary to release ticket mutexes:

1
2
3
4
5
inline void
EndTicketMutex(ticket_mutex *Mutex)
{
    AtomicAddU64(&Mutex->Serving, 1);
}


Isn't the access to Mutex.Serving already protected by the mutex itself? Wouldn't a regular increment followed by a compiler write barrier do the trick?
Yes, that was probably me just not thinking it through. I double-checked on my Molly engine code and I indeed just do a compiler barrier plus a regular increment!

I'm not 100% sure about whether this is OK on older platforms like PowerPC et al which had, IIRC, weaker memory ordering and so on. But I believe on x64 and Neon the memory ordering is such that there is no need for another atomic. Whether there might be platforms where some additional CPU fence is necessary there to ensure all pending memory ops complete before the ticket increment, I don't know!

- Casey
I am not sure either, but I would guess the second lock is important.

Not because of ordering. But because of visibility.
The other cores won't see changes done on another core until they are written back. (Fences)

I would keep it in, and hope that this would cause the waiting thread to exit the
loop possibly earlier. (and maybe gain something in terms of performance - if not in this particular case then in general).
I was unaware that locks did anything to change the visibility of a memory access, though - are you sure that is the case? Do you have a citation for that somewhere?

- Casey
From MSDN

"This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order."

https://msdn.microsoft.com/en-us/...esktop/ms683590%28v=vs.85%29.aspx


In my opinion, based on blogs I have read, and my own experience, this "order" thing is confused with visibility.

Fences make sure other cores can see the change, while ordering is about the sequence of events.
(on a single core)

Why on a single core? Well if you could always ensure sequence on multiple cores, the thing would be kind of bust.
Because the advantage is that things can happen at the same time.

Fences allow the cores to be synchronized only _when needed_. If there is no need for a core to see results from
another core, then this allows better performance, since they don't have to share that particular data, immidiatly.


I am not an expert on x64 MESI architecture, but I'm pretty sure none of that is true. The cache protocol ensures ownership of cache lines as appropriate and knows when one core is trying to read the memory that another core owns. To the best of my knowledge, there is no difference in visibility between an atomic add and a non-atomic add other than the automatic fencing, which you would need to add in this case if you removed the atomic (eg., you need a write barrier). This is not for visibility, it's to ensure that any prior writes aren't reordered past the ticket increment.

However, note that we're talking about a compiler fence here, not a CPU fence. My understanding is that you also don't need a CPU fence in this case, because unless you are writing to a write-combined memory region, you don't actually have to fence CPU writes since they are always appear to be strongly ordered to other cores AFAIK (meaning that while their order of completion is not defined, no earlier write can ever _appear_ to be incomplete to another core if some later write _does_ appear to be complete at that time).

- Casey
Yes, after re-reading Debiatan's question, I realize he also add the barrier.

If I understand you correctly, both the LOCK and the FENCE instruction makes the changes "visible globally" (Intel manual).
That's why a simple add, followed by a write fence, would make the change visible?

But we do need either the LOCK or the FENCE?

And yes those are to make the change "visible globally" that is more or less exactly what the
Intel manual states. Otherwise we would not need to consider shared memory as programmers. It would be free by the CPU.
Or am I wrong thinking that? Testing suggest I am not.

My guess now, after reading Casey's reply, is that LOCK (maybe) is more specific than MFENCE (which is what a full memory barrier means).

The Intel Manual does not spell this out very strongly, except leave us to interpret that SFENCE, LFENCE and MFENCE(BOTH)
conserns ALL prior memoryreads/writes, before the FENCE, while LOCK *maybe*? only considers the specific instruction and data
during the lock. ( A BUS LOCK ) It may say this elsewhere in the manual though, and I will look for it.

So therefore I would choose the lock (if like you say it has a barrier implicit).

I don't know what the difference between a compiler fence and a CPU fence is, unfortunately.

Anyway. Thanks for an interesting series and discussion.

Edited by Unlegushwanion on
Well, again, I'm not the expert on these things, so take it with a grain of salt but:

On x64, you do not need CPU fences unless you are working with "special" memory. What I mean by "special" is that it is not going to another core, but rather to another type of processor entirely, like a graphics card or a network card, which does not share the memory controller. If you are only talking about multiple x64 cores on the main motherboard, then you never need a CPU fence (_mm_fence, etc.)

The reason that CPU fences exist in x64, as far as I understand it, is just to deal with write-combining buffers on memory shared by other things besides core. In this case, my understanding is that the interface between the x64 memory controller and whatever bus is involved has buffer flush rules that allow it to hold writes as part of a write-combining buffer. These writes can be held until after other, later writes complete as part of other buffers. So the fencing is there to prevent writes (or reads) from moving around in this kind of scenario only - they have no meaningful effect (that I'm aware of) if you're only writing to or reading from memory shared between x64 cores on the same memory controller.

The reason for the atomic (the LOCK prefix) is because in order to do an increment, the x64 needs to read the value of a cache line and increment one of the values while ensuring that no other core takes the cache line back before the increment occurs. So the LOCK just means "I want to read and modify this value without any possibility of another core doing something in between". That is why the LOCK is necessary - it has nothing to do with visibility. It is to prevent another core from modifying that cache line while we have the value in our register doing the add.

Finally, the reason we need a compiler fence - not a CPU fence - on the exit of the ticket mutex is because without a compiler fence, the compiler itself could reorder the ticket increment to be before some of the work we were doing inside the mutex! This is because, without the fence, the compiler has no idea that we are using that ticket to prevent other cores from accessing the thing we're accessing. It just sees an atomic increment and a (non-atomic) increment. For the atomic increment, it is smart enough to know that it should treat that as a barrier (because obviously it is something multithread-related), but for the regular non-atomic increment at the end, it just sees an increment of a global variable. It has no idea it is related to multi-thread synchronization! So it might just move it to a more convenient location that is better for optimized code, not realizing that moving it may cause a bug in our program because of the hidden meaning of the increment (that it tells other threads they can continue).

Does that help?

- Casey
cmuratori
"Does that help?
- Casey "

Yes it does help. It helped me to start read the "64-ia-32-architectures-software-developer-manual-325462", the latest I could find.

So far it seems a little confusing to me.

But I will read and reread chapter 8 on multiprocessor management, until I get a clearer picture.

Thanks for listening.


Edited by Jeroen van Rijn on Reason: fix broken quote
If you haven't seen these blog posts, take a look:
https://fgiesen.wordpress.com/2014/07/07/cache-coherency/
https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/

They have good information on this topic and will give you relevant keywords/topics to look up in Intel manual.


If you haven't seen these blog posts, take a look:
https://fgiesen.wordpress.com/2014/07/07/cache-coherency/
https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/

They have good information on this topic and will give you relevant keywords/topics to look up in Intel manual.


Excellent.

This blog answers my questions.

Especially this segment:

"Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back."

....

"In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).

....

Some (mostly older) CPUs use write-through caches everywhere..."


I have two very important cases in the game I am making, inspired by handmade heroes and everything I learned here, that supports use of fences at key points in the code. Key points being, whenever isolation of threads are required, and key data is shared.

And both cases strongly supports what Giesen is saying here. Namely that fences are absolutely required, where immidiate data updates between cores are needed. The sad part is that it took me testing the game for many many hours to find one of the crashes. But after adding fences to my exclusive locks, the game has been running for now 23 hours and 37 minutes, and I am eager to see if it reaches the 24 hours mark. Which is maybe why I write this at all. Little excited, 20 minutes left. Probably crashing as I write this :)


Please forgive me if I am rambling. I find this topic to be very interesting, and important.

Reading more of Giesen's blog, make me realize that he believes the x86 has a very strong memory ordering protocol that makes guaranties. Like also Casey seem to argue? Correct me if I misunderstood. And that this is somehow a good thing. I read more about Giesens thoughts on MESI and the like.

"The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though."

But in my view, a "very strong" is not the same as "complete" making this "guarantee" not much worth in some important edge cases, sharing hot data on multiple cores, if it's not sequential.

But I also disagree that a consistent guarantee (on the die) is what we want. Want we want is optimal speed. And optimal control, but only when required. To my mind this is exactly what an x86 provides. It does not always provide guaranties of ordering.

see "8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations"
This case is, to my mind, what we talk about here (look at the example in the manual)
If you don't look at the example, it is possible to misinterpret the full meaning of "different locations".

It is also the thing that has (very likely) burned me in my own code, and was solved by adding fences. Earlier I had regular but "random" crashes between 3,5h and 8 hours of auto-played-game. After adding fences, I reached the 24 hour mark on the first try. Fences also solved other problems in my code.

To me the requirement of fences for a 100% guarantee is a good thing. Because it allows me as the coder, to know when there is a guarantee, and to know when there is not. And to know that when I am not in need of fences, then performance is gained.

Not all Intel families provide the same rules, and that in an earlier version of the manuals, they state that they recommend to not rely on strong guaranties in future architectures.

So yes. It is sufficient to use an add, with fences afterwards.
Which is exactly what Debiatan provided, so yes, I agree fully that this is sufficient.

I also agree that this particular code would work, also without the fences. Since sooner or later the loop would read the updated value.

But that in other cases, where the _first read_ needs to be updated, like a shared counter for a list for instance, a fence is absolutely required.




Edited by Unlegushwanion on
Unlegushwanion

To me the requirement of fences for a 100% guarantee is a good thing. Because it allows me as the coder, to know when there is a guarantee, and to know when there is not. And to know that when I am not in need of fences, then performance is gained.


Fences don't necessarily impact performance. They prevent the compiler to reorder some code, yes, and thus some memory operations may not be scheduled in the optimal sequence. But the end game here, is to comunicate between CPU cores correctly (i.e. visibility of data changes) and at that, fences are the optimal solution.

Unlegushwanion

I also agree that this particular code would work, also without the fences. Since sooner or later the loop would read the updated value.

But that in other cases, where the _first read_ needs to be updated, like a shared counter for a list for instance, a fence is absolutely required.


In multithreading there is correct code and incorrect code. Using fences to make sure the visibility of your changes is propagated in the correct order to other cores makes correct code. Not using fences, in the apropriate places, makes the code incorrect. It won't always show, and that makes multithreading programming very hard, but the code will still be wrong.

Fences will _not_ ensure that the _first read_ is the updated value, since in a multithreading scenario what constitutes the first read is dependent on the thread the OS will schedule first, preemption, machine load, etc... What fences ensure is that when you actually read that value, previous memory operations (stores or other loads) will have also finished.

Jeff Preshing has a set of very interesting and well written posts about multithreading and atomics like this one.
Fences will _not_ ensure that the... loads) will have also finished.


Yes, important point. But I mean the first read, after the write. There are ways to know if this is the case, like for instance WaitForMultipleObjects in some cases, or locking in other cases.

Or is even this wrong?
Unlegushwanion

Yes, important point. But I mean the first read, after the write. There are ways to know if this is the case, like for instance WaitForMultipleObjects in some cases, or locking in other cases.

Or is even this wrong?


Again, compiler fences don't offer any guarantees on when the updated value will be read; they just guarantee that whenever you read the value, previous memory operations will have completed (e.g. after you read the "shared counter" you know that previous writes to the data structure it protects have finished).

In x86(_64), what guarantees that you will read the value after it has been written to is the strong memory model enforced by the cache subsystem. When you want to read a value, the CPU will check with the cache which will have an Invalid line (since the CPU that wrote to the value will have the same line in Modified mode; or Exclusive if it wrote it back to higher level cache or main RAM) or won't have the line at all, and will have to fetch it.

The distinction is important since different CPU architectures have different memory models and, while compiler fences are enough in x86(_64) they might not be in other ones (e.g. ARM) which might need CPU fences to properly synchronize data between CPU cores.

=======

I'm not sure what you mean by
Unlegushwanion
There are ways to know if this is the case, like for instance WaitForMultipleObjects in some cases, or locking in other cases.
but if you are referring to Semaphores, Mutexes and the like, they are higher level synchronization structures implemented in terms of fences and/or atomics at the low level (+ OS/kernel space data structures to keep a record to what to notify, etc.).