Atomic add necessary for EndTicketMutex?

Unlegushwanion

#8263

August 26, 2016

Again, compiler fences don't offe...pleted (e.g. after you read the

Yes. (Agree)

In x86(_64), what guarantees that...enforced by the cache subsystem.

As I understand it, what you say here is not correct, and is why I posted this:

"8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations"

There is the example to follow this section:

Processor_0: ;writes
mov [ _x], 1
mov [ _y], 1

Processor_1: ;reads
mov r1, [ _y]
mov r2, [ _x]

Initially _x = _y = 0
r1 = 0 and r2 = 0 is allowed

What this means, the way I see it, is that unless processor 0, use a write fence, after the two write instructions, errors will sooner or later happen. But if he does, then R2 and R1 will both read a 1. It may even be the case that processor 2 also needs a loadfence (Lfence) to be 100% sure to read the updated values.

When you want to read a value, th... all, and will have to fetch it.

Dont forget that CORES do not share caches. Fences are to make the local cache changes, "Globally Visible"

Citation:
SFENCE
"Description
Performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction.
This serializing operation guarantees that every store instruction that precedes the SFENCE instruction in program
order becomes *globally visible*
"

I'm not sure what you mean by
...record to what to notify, etc.).

If you have 4 cores running, each working on the backbuffer in a software renderer, then you can use EVENT signaling in each
at the end of work, to say they are ready (done). You can then wait for this state to appear in all of them.
This way, you know that any read after the wait, can only proceed when all cores are done.

But unless each core has issued an SFENCE to make those last writes visible, then you will have
a high probability of "8.2.3.4" (see above).

Edited by Unlegushwanion on August 26, 2016, 6:54pm

Casey Muratori

#8272

August 27, 2016

1) NO NO NO NO. We're beating a dead horse here. Fences ARE NOT to make changes "globally visible". Fences are to prevent reordering of operations. Operations are always going to become visible. Fences just change the order in which they could become visible, not whether they become visible. With regular x64 main memory, the only way I know of to ever make changes that are not globally visible is to use the transactional memory extensions, which will "undo" changes to contested cache lines.

2) Cores DO share caches. Cores usually have their own L1 cache but often share caches above that. But that is not point anyway, the point is that the caches talk to each other. This is how the x64 memory subsystem works. The caches communicate through the memory controller to ensure that they know who has modified which cache lines and when. Look at the MESI protocol for more information about this - it's very specific.

3) Section 8.2.3.4 is just talking about allowing loads to move before stores, which is allowed because it is generally harmless. Since a core is never going to be waiting on a mutex that it itself is holding, there's no reason why it can't move a load before an unrelated store. If this was going to be a cross-core bug, it would already be a bug without the reordering because the normal thing that will happen in this code path is that the CPU will load the value immediately after the store without another core interceding anyway. Moving it before the store could only make it less likely you'll hit a multithread bug here, because if that write was a mutex unlock, now you've got one more load that was protected by the mutex that wasn't protected before!

- Casey

Edited by Casey Muratori on August 27, 2016, 2:49am

Unlegushwanion

#8310

August 29, 2016

cmuratori
1) NO NO NO NO"

As much as I respect your craft Casey, I find it hard to take your words over what the Intel manual says and my own experience, as mentioned. (I literally quoted the newest Intel manual).

But I realize even then, I could still be wrong.

So is there anyway in which I can write code to "prove" that your claims here are correct?

EDIT: I mean, if you read the things I wrote again, you should see I am not really
disagreeing on the fact that things becomes visible, of course they do. That is not the issue.
But the key issue is when. If it is possible for a write to happen on one core, and
then to be not read as updated on another core, immediately afterwards, when you know
the READ will be AFTERWARDS, then my code is broken.

But as I said, a write fence, seems to fix this for me in that situation. But I don't know
for sure. Maybe it will crash after 48 hours.

Edited by Jeroen van Rijn on August 29, 2016, 11:44am Reason: fix malformed quote

Casey Muratori

#8313

August 29, 2016

No one is disputing the accuracy of the Intel manual. It's just that you seem to think it says something that it doesn't say? Because the part you quoted definitely doesn't have any bearing on the code for ticket mutexes as I implemented them on Handmade Hero.

If it is possible for a write to happen on one core, and
then to be not read as updated on another core, immediately afterwards, when you know
the READ will be AFTERWARDS, then my code is broken.

When you know the READ will be after what?

- Casey

Casey Muratori

#8316

August 29, 2016

OK, I went back to re-read this thread to try to see where it went off the rails. I think the problem might be that if you actually thought the example in 8.2.3.4 is the one you quoted here, that's the issue right there :) The actual manual contains no such example. I never read the example as you posted it because I just went and read the actual manual, since I wanted to see all the text.

The example in the actual manual is:

Proc 0
mov [ _x], 1 
mov r1, [ _y] 

Proc 1
mov [ _y], 1
mov r2, [ _x]

Initially x = y = 0
r1 = 0 and r2 = 0 is allowed

So, hopefully you can see from this example that fences have nothing to do with visibility even in this case. The fences you might employ here (if for some reason you cared, but I'm not sure why you would) are about preventing the out-of-order instruction window on a single core from moving a load to a place you didn't want it. It is not actually anything to do with multi-threading or cache line contention - it's about allowing the out-of-order window on the core to issue loads early, which you generally want.

Also hopefully, it is clear that this in no way impacts a ticket mutex. Since it is entry to the ticket mutex that it is important, and that is an atomic op and therefore implicitly fenced, on exit there is no need for an additional barrier since only reads can be moved, and they can only be moved earlier, and thus only as early as the ticket take.

Does that help make it clear why there is no bug if EndTicketMutex doesn't use a CPU fence? It still needs a compiler fence, of course, because the compiler does allow loads to move after stores, unlike the x64, and that is a bug because the loads can't happen outside the mutex. So that's why we said that if we remove the atomic, we need a compiler fence.

Again, I really want to emphasize here that 8.2.3.4 has nothing to do with whether cores see each other's operations. It's just about a single core's normal operation. In fact it is something that will happen even in a non-multithreading scenario - so for example if you were counting on the first access to a memory location being the write (due to some kind of a page fault scenario or something?), you would be in trouble without a barrier because the load could be reordered by the processor and happen first.

Yet another way to say why this isn't about visiblity is to say that x64 provides strong memory coherence for the order in which the cores choose to execute their commands. But the out-of-order window still allows sane reordering of things that don't affect multithreading, such as moving loads before stores.

- Casey

Edited by Casey Muratori on August 29, 2016, 12:59pm