Alternative to __COUNTER__

With Casey stating he is hesitant to keep using the __COUNTER__ to keep track of the DebugRecords, I was thinking if there could be an alternative way to do this by counting ourselves. Casey mentioned he was going to use hashtables, which sounds expensive, and I was wondering if the way I was thinking of would be better/faster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
extern u32 GlobalDebugCounter;

#define GET_COUNTER__(Counter, FileNameInit, LineNumberInit, BlockNameInit, ...) \
	static u32 Init_##Counter = 0; \
	static u32 Counter = 0;\
	if (Init_##Counter != 2) {\
		if (AtomicCompareExchangeUInt32(&Init_##Counter, 1, 0) == 0) {\
			Counter = AtomicAddU32(&GlobalDebugCounter, 1);\
			\
			CompletePreviousWritesBeforeFutureWrites;\
			Init_##Counter = 2;\
			Assert(Counter < MAX_DEBUG_RECORD_COUNT); \
			debug_record *Record = GlobalDebugTable->Records[TRANSLATION_UNIT_INDEX] + Counter; \
		    Record->FileName = FileNameInit;                                        \
		    Record->LineNumber = LineNumberInit;                                    \
		    Record->BlockName = BlockNameInit;                                      \
		} else {\
			volatile u32* test = &Init_##Counter;\
			while (*test != 2);\
		}\
	}
#define GET_COUNTER_(Name, FileNameInit, LineNumberInit, BlockNameInit, ...) GET_COUNTER__(Name, FileNameInit, LineNumberInit, BlockNameInit)
#define GET_COUNTER(Name, FileNameInit, LineNumberInit, BlockNameInit, ...) GET_COUNTER_(Name, FileNameInit, LineNumberInit, BlockNameInit)

#define FRAME_MARKER() \
     { \
     GET_COUNTER(Counter, __FILE__, __LINE__, "Frame Marker");                                   \
     RecordDebugEvent(Counter, DebugEvent_FrameMarker); \
} 

#define TIMED_BLOCK__(BlockName, Number, ...) \
	GET_COUNTER(Counter_##Number, __FILE__, __LINE__, BlockName, ## __VA_ARGS__); \
	timed_block TimedBlock_##Number(Counter_##Number)
#define TIMED_BLOCK_(BlockName, Number, ...) TIMED_BLOCK__(BlockName, Number, ## __VA_ARGS__)
#define TIMED_BLOCK(BlockName, ...) TIMED_BLOCK_(#BlockName, __COUNTER__, ## __VA_ARGS__)
#define TIMED_FUNCTION(...) TIMED_BLOCK_(__FUNCTION__, __COUNTER__, ## __VA_ARGS__)

#define BEGIN_BLOCK_(Counter) RecordDebugEvent(Counter, DebugEvent_BeginBlock);
#define END_BLOCK_(Counter) RecordDebugEvent(Counter, DebugEvent_EndBlock);
    
#define BEGIN_BLOCK(Name) \
    GET_COUNTER(Counter_##Name, __FILE__, __LINE__, #Name);                       \
    BEGIN_BLOCK_(Counter_##Name);

#define END_BLOCK(Name) \
    END_BLOCK_(Counter_##Name);
    
struct timed_block
{
    int Counter;
    
    timed_block(int CounterInit)
    {
        // TODO(casey): Record the hit count value here?
        Counter = CounterInit;
        BEGIN_BLOCK_(Counter);
    }
    
    ~timed_block()
    {
        END_BLOCK_(Counter);
    }
};


You also need AtomicAddU32:
1
2
3
4
5
6
7
inline u32 AtomicAddU32(u32 volatile *Value, u32 Addend)
{
    // NOTE(casey): Returns the original value _prior_ to adding
    u32 Result = _InterlockedExchangeAdd((long volatile *)Value, Addend);

    return(Result);
}


In build.bat there should be only 2 distinct TRANSLATION_UNIT_INDEX with this method, as only the executable and the dll will have separate GlobalDebugCounter variables, which will fix the problem with the TRANSLATION_UNIT being different in the inline/macro RecordDebugEvent

Each of these compiled blocks will then need some additional info.
win32_handmade.cpp and for instance handmade_debug.cpp now need an extra line:
1
u32 GlobalDebugCounter = 0;


Each compiled block also needs to set the GlobalDebugTable->RecordCount to GlobalDebugCounter. But this is currenty not really used as far as I know.
1
GlobalDebugTable->RecordCount[TRANSLATION_UNIT_INDEX] = GlobalDebugCounter;


I also moved the initialisation of the DebugRecord to the GET_COUNTER part that gets the next available counter, so this is also only done once. Which could eliminate some overhead (though small).

In the end I think this only adds one conditional jump to the normal process of gathering the profile info, which should not be that bad. It does have a spinwait to keep other threads that might contest the counter initialisation waiting for the correct value. This could even be removed to let each contesting thread increase the count, which would result in more debugrecords being used. But each counter would only have a max duplicate count of the thread count, so it is limited and should not corrupt the results. But the spinwait is only for the first time inialising and should complete quickly.

Regards,
Mox


BTW, I still use __COUNTER__ in this code, but only to create unique identifiers for the counter variables, not for their values.
My understanding why Casey want's to avoid __COUNTER__ is because it requires using TRANSLATION_UNIT_INDEX. Your solution still uses it. Although it is now something like BINARY_FILE_INDEX (meaning exe or dll) which is less problematic than than translation unit index.

Solution with hashes doesn't need to be expensive. You can calculate hash only once:
1
2
3
4
5
#define GET_COUNTER(... arguments, FileName, LineNumber)   \
{                                                          \
    static int CounterNumber = Hash(FileName, LineNumber); \
    ... // use CounterNumber here                          \
}


hash will be calculated only on first execution of this counter. After that it will be as free as using regular integer variable.

Edited by Mārtiņš Možeiko on
Yes, my idea was that this eliminates the actual problem of the translation units that the original had. The BINARAY_FILE_INDEX as a rename could make this more obvious indeed.

I think hashes have more to them then only calculating the integer part. If it was only an integer, it would be the same as my counter. Hashes usually have a way to handle collisions and also have overhead if you don't use all the entries in the table. So in the end you also need more code to traverse the hash. But maybe this is not bad if it can be contained in the collating part. Though I don't know if it can with the possibility of collisions.

The upside to using a hash to construct the index is that you don't need the thread safe functions when setting it up, as each pass of the initialisation will return the same hash/index. Under the hood the compiler might still use a 'is this static initialised yet' variable, although this might be optimized away because the calculation should be constant. But a simple tests seems to indicate that msvc does not do this, not even with -O2. It get's even worse, the assembly that it produces is not even thread safe: a second thread getting the same static value might see the flag that it has been initialised before the actual value is stored in some conditions (two statics back to back) if my assembly reading skills are not deserting me. Bad compiler bug :ohmy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
	static u32 test = (u32)(__FUNCTION__ + 1);
000007FE967656D7  mov         eax,dword ptr [test+4h (07FED6C28AB0h)]  
000007FE967656DD  test        al,1  
000007FE967656DF  jne         DrawRectangleQuickly+119h (07FE967656F9h)  
000007FE967656E1  lea         rcx,[string "DrawRectangleQuickly"+1h (07FE96791061h)]  
000007FE967656E8  or          eax,1  
000007FE967656EB  mov         dword ptr [test (07FED6C28AACh)],ecx  
000007FE967656F1  mov         dword ptr [test+4h (07FED6C28AB0h)],eax  
000007FE967656F7  jmp         DrawRectangleQuickly+11Fh (07FE967656FFh)  
000007FE967656F9  mov         ecx,dword ptr [test (07FED6C28AACh)]  
	static u32 test2 = (u32)(__FUNCTION__ + 2);
000007FE967656FF  test        al,2  
000007FE96765701  jne         DrawRectangleQuickly+13Bh (07FE9676571Bh)  
000007FE96765703  or          eax,2  
000007FE96765706  mov         dword ptr [test+4h (07FED6C28AB0h)],eax  
000007FE9676570C  lea         rax,[string "DrawRectangleQuickly"+2h (07FE96791062h)]  
000007FE96765713  mov         dword ptr [test2 (07FED6C28AB4h)],eax  
000007FE96765719  jmp         DrawRectangleQuickly+141h (07FE96765721h)  
000007FE9676571B  mov         eax,dword ptr [test2 (07FED6C28AB4h)]

[test2 (07FED6C28AB4h)] is mov'ed to after bit 2 of [test+4h (07FED6C28AB0h)] is set

So if you really compare the two solutions, my counter example only has some overhead at initialisation due to the thread safety requirements. After that my solution will have 'perfect' indexing, while the hash will at the least have empty buckets to look through at collation time, or at the worst some extra chains to get to the right item with each lookup, even at event recording.

I guess the best way to solve this is to actually use metaprogramming and use a custom 'macro' that will become a global counter. That way there can be a single table and there would be no need for TRANSLATION_UNIT_INDEX or BINARY_FILE_INDEX, and getting the table index would be free. And you would not even need a hashing table.
Once you introduce a static, it doesn't matter how expensive the lookup is. That's the key thing. So replacing COUNTER is very simple - it's just:

1
2
3
4
5
6
static int OurIndex;
if(OurIndex == 0)
{
   // Do expensive initialization with hash table here
}
// Do the stuff here


The hash table stuff only gets called once, so it doesn't matter how expensive it is.

- Casey
Mox
After that my solution will have 'perfect' indexing, while the hash will at the least have empty buckets to look through at collation time, or at the worst some extra chains to get to the right item with each lookup, even at event recording.

Same with hash solution. You do all this (chain/bucket traversal, collision resolving) inside "Hash" function in my example. It gets called only first time. You get out index which is direct index in hash table which you now access as regular array.

It get's even worse, the assembly that it produces is not even thread safe: a second thread getting the same static value might see the flag that it has been initialised before the actual value is stored in some conditions (two statics back to back) if my assembly reading skills are not deserting me. Bad compiler bug :ohmy:
This is kind of tricky. C++ standard before C++11 didn't specified thread-safety for static initialization. So that code is perfectly valid code for compiler with version less than C++11.

C++11 standard specifies that in such case only one thread should execute initialization and rest of thread should wait until it finishes. If you look at MSVC as C++11 compatible compiler then yes, it is a bug.

In this page it lists features for modern C++: https://msdn.microsoft.com/en-us/library/vstudio/hh567368.aspx
Look for "magic statics". It says VS2015 supports it.

Here's this code
1
2
3
4
5
6
int g();
int f()
{
  static int a = g();
  return a;
}


Compiled with VS2015 64-bit:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
?f@@YAHXZ PROC						; f, COMDAT
$LN7:
	sub	rsp, 40					; 00000028H
	mov	ecx, DWORD PTR _tls_index
	mov	rax, QWORD PTR gs:88
	mov	edx, OFFSET FLAT:_Init_thread_epoch
	mov	rax, QWORD PTR [rax+rcx*8]
	mov	ecx, DWORD PTR [rdx+rax]
	cmp	DWORD PTR ?$TSS0@?1??f@@YAHXZ@4HA, ecx
	jle	SHORT $LN4@f
	lea	rcx, OFFSET FLAT:?$TSS0@?1??f@@YAHXZ@4HA
	call	_Init_thread_header
	cmp	DWORD PTR ?$TSS0@?1??f@@YAHXZ@4HA, -1
	jne	SHORT $LN4@f
	call	?g@@YAHXZ				; g
	lea	rcx, OFFSET FLAT:?$TSS0@?1??f@@YAHXZ@4HA
	mov	DWORD PTR ?a@?1??f@@YAHXZ@4HA, eax
	call	_Init_thread_footer
$LN4@f:
	mov	eax, DWORD PTR ?a@?1??f@@YAHXZ@4HA
	add	rsp, 40					; 00000028H
	ret	0
?f@@YAHXZ ENDP						; f

Something is happening with threads there. Probably calling Microsoft internal CRT functions.

You can disable this behavior (for example, for performance reasons) and revert back to pre-C++11 thread-unsafe code with "/Zc:threadSafeInit-" compiler flag, as documented here: https://msdn.microsoft.com/en-us/library/y5f6w579.aspx

Edited by Mārtiņš Možeiko on
I believe making the inline function static would also solve the problem. It would prevent the linker from merging the different versions of this function in each translation unit into a single function with the wrong values.
To solve TRANSLATION_UNIT_INDEX problem? Yes, that should solve that.

Another options is to use it only in place where macro gets expanded. Same way how we use __FILE__ and __LINE__ macros. Imho this is much better option. Like this:
1
#define TIMED_FUNCTION(...) TIMED_BLOCK_(__FUNCTION__, __COUNTER__, TRANSLATION_UNIT_INDEX, ## __VA_ARGS__)


Then there is no doubt what value will be used. Rest of processing can happen in inline function, static function, class constructor or macro - whatever.
cmuratori
Once you introduce a static, it doesn't matter how expensive the lookup is. That's the key thing.

Wouldn't this be valid for my "real" counter code too? What is the reason for choosing a hash table over a flat indexed table then? No offense intended, just wondering if there are any other considerations that I might have missed.


mmozeiko

This is kind of tricky. C++ standard before C++11 didn't specified thread-safety for static initialization. So that code is perfectly valid code for compiler with version less than C++11.

I also looked into "Magic statics" before posting this topic. That is what led me to my implementation with the atomic add and exchange. Which would probably also be needed to setup the hash table in a thread safe way. Though duplicates would probably be less of a deal there.

The compiler "bug" is just weird in my eyes if you look at the fact that it is actually placing some known compile-time constants into the initialisation of a static. This should be doable without all that complicated code. But I guess the compiler is not recognizing this as a real constant value.


skeeto
I believe making the inline function static would also solve the problem. It would prevent the linker from merging the different versions of this function in each translation unit into a single function with the wrong values.

Yes, but I started this new thread to go away from that tricky behavior and see what alternatives are applicable. I mentioned the static solution myself in the other thread ;)


mmozeiko
Another options is to use it only in place where macro gets expanded. Same way how we use __FILE__ and __LINE__ macros. Imho this is much better option.

That is actually a very good option.
Mox
Wouldn't this be valid for my "real" counter code too?

Yes it will. But there are more advantages for hash - no need to deal with translation unit index (or binary index). That means no need to store it structures (less memory/cache used). No need to compare it when collating, just the counter value is enough (performance)

But I guess the compiler is not recognizing this as a real constant value.
Gcc is recognizing that :)

This is from gcc 5.2.0 (32-bit code):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__Z1fv:
	movl	__ZZ1fvE4test, %eax
	ret

...
__ZZ1fvE12__FUNCTION__:
	.ascii "f\0"

__ZZ1fvE4test:
	.long	__ZZ1fvE12__FUNCTION__+1
Not really a shocker that gcc is better at this than msvc :P

I can't really picture yet how hashing would have those properties over my counter. Coudn't you adjust my solution to some global array in the same way the hash table would use one?

When thinking about my option I was also thinking about how to completely remove the translation unit, but didn't want to introduce extra complexity without testing the basics first. I might look in to it just for fun.
There would be no need to store TranslationUnit member in debug_event structure in RecordDebugEventCommon function.

It should look something like this (ignoring thread safety to keep things simpler):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
static u16 GetDebugRecordIndex(char *FileName, int LineNumber, char *BlockName)
{
    int DebugRecordIndex = ReallyGoodHashFunction(FileName, LineNumber);

    // Records now is simply debug_record Records[MAX_DEBUG_RECORD_COUNT]
    // no need for MAX_DEBUG_TRANSLATION_UNITS
    debug_record *Record = GlobalDebugTable->Records[DebugRecordIndex];
    debug_record *InitialRecord = Record;
   
    // no need to compare FileName and LinenNumber values, because for
    // each (FileName, LineNumber) pair function will be called only once,
    // so simply find first unused place in table
    while (Record->FileName)
    {
        DebugRecordIndex = (DebugRecordIndex + 1) % MAX_DEBUG_RECORD_COUNT;
        Record = GlobalDebugTable->Records[DebugRecordIndex];

        // if we examined every entry in hash table that means hash table has less entries
        // than debug records we are putting in source code! Increase MAX_DEBUG_RECORD_COUNT!
        Assert(Record != InitialRecord);
    }

    Record->FileName = FileName;
    Record->LineNumber = LineNumber;
    Record->BlockName = BlockName;

    return DebugRecordIndex;
}

inline debug_event *RecordDebugEventCommon(u16 RecordIndex, debug_event_type EventType)
{
    u64 ArrayIndex_EventIndex = AtomicAddU64(&GlobalDebugTable->EventArrayIndex_EventIndex, 1);
    u32 EventIndex = ArrayIndex_EventIndex & 0xFFFFFFFF;
    Assert(EventIndex < MAX_DEBUG_EVENT_COUNT);
    debug_event *Event = GlobalDebugTable->Events[ArrayIndex_EventIndex >> 32] + EventIndex;
    Event->Clock = __rdtsc();
    Event->DebugRecordIndex = RecordIndex;
    Event->Type = (u8)EventType;                                    
    return Event;
}

inline void RecordDebugEvent(u16 RecordIndex, debug_event_type EventType)
{
    debug_event* Event = debguRecordDebugEventCommon(RecordIndex, EventType);
    Event->TC.CoreIndex = 0;
    Event->TC.ThreadID = (u16)GetThreadID();
}

#define FRAME_MARKER(SecondsElapsedInit) 											   \
{ 																			           \
     static u16 RecordIndex = GetDebugRecordIndex(__FILE__, __LINE__, "Frame Marker"); \
     debug_event *Event = RecordDebugEventCommon(RecordIndex, DebugEvent_FrameMarker); \
     Event->SecondsElapsed = SecondsElapsedInit;                                       \
} 

#if HANDMADE_PROFILE

#define TIMED_BLOCK__(BlockName, Number, ...)                                        \
    static u16 Record_##Number = GetDebugRecordIndex(__FILE__, __LINE__, BlockName); \
    timed_block TimedBlock_##Number(Record_##Number, ## __VA_ARGS__)
#define TIMED_BLOCK_(BlockName, Number, ...) TIMED_BLOCK__(BlockName, Number, ## __VA_ARGS__)
#define TIMED_BLOCK(BlockName, ...) TIMED_BLOCK_(#BlockName, __LINE__, ## __VA_ARGS__)
#define TIMED_FUNCTION(...) TIMED_BLOCK_(__FUNCTION__, __LINE__, ## __VA_ARGS__)

#define BEGIN_BLOCK_(RecordIndex) RecordDebugEvent(RecordIndex, DebugEvent_BeginBlock);
#define END_BLOCK_(RecordIndex) RecordDebugEvent(RecordIndex, DebugEvent_EndBlock);
    
#define BEGIN_BLOCK(Name) \
    static u16 Record_##Name = GetDebugRecordIndex(__FILE__, __LINE__); \
    BEGIN_BLOCK_(Record_##Name);

#define END_BLOCK(Name) \
    END_BLOCK_(Record_##Name);
    
struct timed_block
{
    u16 RecordIndex;
    
    timed_block(u16 RecordInit, u32 HitCountInit = 1)
    {
        // TODO(casey): Record the hit count value here?
        RecordIndex = RecordInit;
        BEGIN_BLOCK_(Record);
    }
    
    ~timed_block()
    {
        END_BLOCK_(Record);
    }
};


And for thread safety you do something similar what you did for GET_COUNTER__.
Stick this code at beginning of GetDebugRecordIndex function:
1
2
3
4
5
6
// initialized with 0 by default, that means nobody is running this function
static volatile u32 Lock;
// try to get a "lock", repeat if didn't succeeded
while (AtomicCompareExchangeUInt32(&Lock, 1, 0) != 0)
{
}


And this at the end of it (before return):
1
2
// release a "lock"
*Lock = 0;


You can move Lock variable to GlobalDebugTable structure of course, no need to keep it static.

Not the very best solution, but because this code is executed only once, who cares.

Edited by Mārtiņš Možeiko on
I got something working with minimal changes to my counter code. But the DebugTable and platform API did require some more changes.

Because I only set the DebugRecord with the function name once, I was having trouble reading the strings from the platform layer with the original code. This was because the platform code was setting up the data in the GlobalDebugTable copy it has before the Game DLL has an opportunity to return it its pointer. But since I was trying to set all the data in one place I decided to make the platform structure leading. So it passes a pointer to a debug_table in the game_memory struct now. This debug_table also has the RecordCount, which I use now instead of my old GlobalDebugCounter, so there is one variable for the exe and the dll to index the table.

Doing things in one table could have one drawback. The live code editing feature will let the DebugRecord count keep growing, as each reload will make the DLL increment the count again for each timer. Though the maximum is now set at 65536, so it will probably take a lot of code changes before you would reach an overflow ;)


As far as I can tell your GetDebugRecordIndex method is just a really expensive way to increment a counter. I don't know why you would not just use the Lock and a Result = GlobalDebugTable->RecordCount++;

And with your lookups into the GlobalDebugTable->Records you would need a global lock instead of a function static, or other threads in other functions could mess things up.

PS. should it be named HashFunctionThatHasToBeImproved? ;)
EDIT: What I wrote here before is wrong. it won't work due to game dll reloading.

You need such "complex" GetDebugRecordIndex function because game DLL can be reloaded and you need to get same record index as before. When dll is reloaded all static initializers will be executed again.

This actually means that you need to compare filename and string number in while loop inside this function. What I wrote above is not correct.

Live loop editing shouldn't change record count. Record count is changed only when time blocks is executed first time. After that - doesn't matter how many times it is executed, nothing will change. That's the point of static.

And you are right about keeping file name strings alive between game dll reloads. In my code you need to move assinging filename and block name from GetDebugRecordIndex back to FRAME_MARKER and BEGIN_BLOCK_ macros.

Edited by Mārtiņš Možeiko on