OK, I went back to re-read this thread to try to see where it went off the rails. I think the problem might be that if you actually thought the example in 8.2.3.4 is the one you quoted here, that's the issue right there :) The actual manual contains no such example. I never read the example as you posted it because I just went and read the actual manual, since I wanted to see all the text.
The example in the actual manual is:
| Proc 0
mov [ _x], 1
mov r1, [ _y]
Proc 1
mov [ _y], 1
mov r2, [ _x]
Initially x = y = 0
r1 = 0 and r2 = 0 is allowed
|
So, hopefully you can see from this example that fences have nothing to do with visibility even in this case. The fences you might employ here (if for some reason you cared, but I'm not sure why you would) are about preventing the out-of-order instruction window on a
single core from moving a load to a place you didn't want it. It is
not actually anything to do with multi-threading or cache line contention - it's about allowing the out-of-order window on the core to issue loads early, which you generally want.
Also hopefully, it is clear that this in no way impacts a ticket mutex. Since it is
entry to the ticket mutex that it is important, and that is an atomic op and therefore implicitly fenced, on exit there is no need for an additional barrier since
only reads can be moved, and they can only be moved earlier, and thus only as early as the ticket take.
Does that help make it clear why there is no bug if EndTicketMutex doesn't use a CPU fence? It still needs a compiler fence, of course, because the compiler
does allow loads to move
after stores, unlike the x64, and that
is a bug because the loads can't happen outside the mutex. So that's why we said that if we remove the atomic, we need a compiler fence.
Again, I really want to emphasize here that 8.2.3.4 has nothing to do with whether cores see each other's operations. It's just about a single core's normal operation. In fact it is something that will happen even in a non-multithreading scenario - so for example if you were counting on the first access to a memory location being the write (due to some kind of a page fault scenario or something?), you would be in trouble without a barrier because the load could be reordered by the processor and happen first.
Yet another way to say why this isn't about visiblity is to say that x64 provides strong memory coherence
for the order in which the cores choose to execute their commands. But the out-of-order window still allows sane reordering of things that don't affect multithreading, such as moving loads before stores.
- Casey