Day 133: GCC WriteBarrier/Interlocked equivalents

In order to get HH compiling again I would like to have CompletePreviousWritesBeforeFutureWrites and AtomicCompareExchangeUInt32 in handmade_intrinsics.h defined for GCC. I guess the following would do the trick:

1
2
3
4
5
6
7
#define CompletePreviousWritesBeforeFutureWrites asm volatile("" ::: "memory");

inline uint32 AtomicCompareExchangeUInt32(uint32 volatile *Value, uint32 Expected, uint32 New)
{
    uint32 Result = __sync_bool_compare_and_swap(Value, Expected, New);
    return(Result);
}


asm volatile("" ::: "memory") seems to be a equivalent to _ReadWriteBarrier rather than _WriteBarrier. Is there a better option for GCC? Also I am not sure if this is supported by LLVM.

/Kim

Edited by Kim Jørgensen on
I'm don't think there is better way for GCC to do just write barrier.
Clang should support both of these just fine.
In the first case, you want a store fence:

1
#define CompletePreviousWritesBeforeFutureWrites asm volatile("sfence" ::: "memory");


In most Intel CPUs, writes are not reordered with other writes in most circumstances (there are some very interesting extensions). However, this is not true of AMD 64-bit CPUs, which have a slightly different memory model.

The sfence instruction is supported in Pentium III and later (lfence and mfence are Pentium IV and later), so will work on any CPU you care about. This is the officially blessed way of doing what you want.
Pseudonym73, are you sure? Wouldn't we see a lot of crashing software on AMD 64-bit CPU's if they behave differently than Intel ones regarding memory ordering?

This https://www.cs.cmu.edu/~410/doc/Intel_Reordering_318147.pdf says:
Stores are not reordered with other stores.

And this http://developer.amd.com/wordpress/media/2012/10/24593_APM_v21.pdf says:
Generally, out-of-order writes are not allowed. Write instructions executed out of order cannot commit (write) their result to memory until all previous instructions have completed in program order

I don't think you need to use sfence here. Casey uses CompletePreviousWritesBeforeFutureWrites macro in places where only compiler write barrier is needed (that's what I understood from stream). He previously had actual sfence, but removed it, because it is not needed. That's why on MSVC he uses _WriteBarrier. So asm volatile("" : : : "memory") is OK to use.

But if you really want to use sfence, then it would be nicer to use intrinsic instead - that would work for all x86/x64 compilers (msvc, gcc, clang):
1
#define CompletePreviousWritesBeforeFutureWrites _mm_sfence();


But better would be to support more architectures:
1
2
3
4
5
6
7
#if defined(_M_AMD64) || defined(_M_IX86) || defined(__i386__) || defined(__x86_64__)
  #define CompletePreviousWritesBeforeFutureWrites _mm_sfence();
#elif defined(__arm__)
  #define CompletePreviousWritesBeforeFutureWrites __asm__ __volatile__("dsb" ::: "memory"); // or "dmb", not sure
#else
  #error Don't know how to do memory barrier
#endif

Edited by Mārtiņš Možeiko on
Just a quick note: you do not want sfence for write barriers on x64. x64's always complete writes in order to normal memory, so there is no need for an sfence instruction. It's for non-temporal stores and write-combining memory only.

The write barrier that we're talking about here is a compiler write barrier. So you need whatever LLVM's equivalent of a write barrier is - something that does not emit any ASM, but that just tells the compiler not to reorder the stores around that point.

Hope that clears things up.

- Casey
Thank you for the answers.

Casey are you willing to implement CompletePreviousWritesBeforeFutureWrites and AtomicCompareExchangeUInt32 for GCC/LLVM, so that we can have the game compile again?

/Kim
Oops! I just realized that my implementation of AtomicCompareExchangeUInt32 is bogus. I shouldn't return a boolean value.

I guess I could use __sync_val_compare_and_swap but that doesn't seem work (no trees show up). Is the return value semantics different from _InterlockedCompareExchange?

Edited by Kim Jørgensen on
Oh, hes bool variant is the wrong one. But __sync_val_compare_and_swap should work fine.

_InterlockedCompareExchange returns old value, sp does __sync_val_compare_and_swap.
Have you checked what happens with values when you step through this function with debugger? (set thread count to 1 for work queue for easier debugging)
I think I found the bug. In LoadBitmap starting a task can fail but the state is not reset if this happens.

/Kim
I think the "InterlockedCompareExchange" call in the "AtomicCompareExchangeUInt32" function has it's second and third arguments reversed. msdn link

This causes the state to never get updated.

This obscures the bug Kim mentioned, where the task can fail to be acquired by "BeginTaskWithMemory" in "LoadBitmap", but the state is not returned to AssetState_Unloaded", so it is never retried.

Adding an else onto the "if(Task)" that returns the state to "AssetState_Unloaded" seems to fix it.

I have attached a diff against after day 135, it includes the fix plus all the crossplatform stuff for anyone who wants it.

Also, for the compiler memory barrier, there is "__sync_synchronize()" that I think works with gcc/clang, but I can't find out if it is just a compiler barrier or a cpu barrier as well.
So we got two nasty bugs.
I think we need to get an owl of shame emoji added to the forum chat :)

As far as I can tell __sync_synchronize is not a compiler write barrier, see wikipedia

/Kim
you're right about "__sync_synchonize", it outputs an mfence, I should have looked at the asm output at the start :)

and looking at gcc asm clobber parameter, and the asm it generates, the "asm volatile("" ::: "memory")" looks like the right thing, for a compiler read/write barrier.
cmuratori
Just a quick note: you do not want sfence for write barriers on x64. x64's always complete writes in order to normal memory, so there is no need for an sfence instruction. It's for non-temporal stores and write-combining memory only.

Also, string operations (the movs and stos family of instructions).

Sorry, mmozeiko and Casey are correct and I was wrong. If you're implementing a lock, you need the fence operation, because you can never be sure what happened before the lock is released, but this is lock-free code and you control all the stores.
I would like to have HH hero compile on GCC again. Will somebody ask Casey about this during the stream?

The bug in LoadBitmap and AtomicCompareExchangeUInt32 should probably also be fixed :)
Living in europe I am not able to watch the stream live :(

Will somebody ask Casey during the stream the stream to implement CompletePreviousWritesBeforeFutureWrites and AtomicCompareExchangeUInt32 for GCC/LLVM? Pretty please?

The following code should do it:
1
2
3
4
5
6
7
#define CompletePreviousWritesBeforeFutureWrites asm volatile("" ::: "memory");
inline uint32 AtomicCompareExchangeUInt32(uint32 volatile *Value, uint32 Expected, uint32 New)
{
    uint32 Result = __sync_val_compare_and_swap(Value, Expected, New);
    
    return(Result);
}