Question about thread local storage

On day 178, someone asked Casey why did he consider using atomics for the timers instead of using thread local storage, and he responded that thread local storage has multiple problems and may impact the performance more while atomics are "almost entirely free".

But in the recent Engine Simplication video, Casey said that his multithreading coding style has changed a lot, and then explained why using atomics here is very bad and that he considered using thread local storage instead.

And on day 665, while discussing local static variables, he again said thread storage is "very slow" and implementing local static variables that way is terrible.

This all seems very confusing to me. Is thread local storage fast or slow? Should I use it? When should I use it? What about atomics? If they're all terrible then what are the alternatives?

"very slow" depends on context. Sometimes it can be very slow, sometimes not. If you use them in some inner loop, then it can be slower than some alternative solutions.

Atomics and thread local storage are two very different things. They are not replacement for each other. Sometimes things you do can use one or another in similar manner. But majority of time you won't be. So when you have some problem you're solving you choose one approach you'll use with all it's advantages and disadvantages, it is not like coin-flip between atomics or TLS.


Edited by Mārtiņš Možeiko on

What are the advantages and disadvantages of each approach, and when should I use one or the other? Let's take the performance counter code in the debug system of handmade hero for example, why does Casey choose one over the other?


Replying to mmozeiko (#26884)

Think of it this way - atomics is CPU feature which allows you to inform CPU how to access & manipulate shared memory locations between CPU cores/threads so they stay in consistent state. But TLS is OS feature that allows to map same "variable" to different places in memory, so different threads (which is OS feature) can use different memory without bothering each other.

So for this kind of performance time recording workflow it probably is the best to use TLS. Because each thread recording these timers does not care that other threads are recording them too - it does not need to store them all together. Yes it has some overhead of accessing TLS variables, but it won't affect memory used & code running on other cores. All you need is just extra pass at end of frame to collect & merge the data from multiple places where individual threads recorded their data.

Still to really know what's the difference you would need to implement it and measure it. It's really hard to predict how big the change will be.


Replying to longtran2904 (#26887)

But TLS is OS feature that allows to map same "variable" to different places in memory, so different threads (which is OS feature) can use different memory without bothering each other.

So TLS is kind of a relative pointer for each of the threads? How does accessing a TLS pointer work?

All you need is just extra pass at end of frame to collect & merge the data from multiple places where individual threads recorded their data.

How would you do it? Aren't the data local to a thread? How can other threads access and merge them?


Replying to mmozeiko (#26889)

TLS storage is allocated in special place that is accessed through register that is set up differently for each thread (fs/gs). You can read how TLS works more in these series of articles (for Windows): http://www.nynaeve.net/?p=190

To collect from data stored in different threads you simply store there pointers to common storage:

// somewhere globally
__declspec(thread) Data* globalData; // set up from ThreadProc callback


// in main thread
Data data[MAX_THREADS];
for (int i=0; i<count; i++)
  CreateThread(.., &ThreadProc, &data[i]); // pass different pointer to different threads
                                           // each thread writes argument to "globalData" variable


// ... sometime later in main or some other thread
for (int i=0; i<count; i++)
  ProcessThreadData(&data[i]);

Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#26892)

Thanks for the link, I will check it out.

As for your example, does the globalData of each thread point to the corresponding element in the data array?


Replying to mmozeiko (#26893)

Yes, the ThreadProc would look something like this:

static DWORD ThreadProc(void* arg)
{
    Data* data = (Data*)arg; // data passed in CreateThread
    globalData = data; // write to TLS

    ... // now rest of code can use globalData from all the functions called here
}

Replying to longtran2904 (#26894)

Haven't read the link you sent, but what is slower: accessing a thread-local variable or using atomics like atomic_add or atomic_exchange? Consider that the data is already in the cache.

So I kind of understand the pros and cons of thread-local storage, and now I want to know about atomics. When should I use them? What sort of things that I need to think about when using them? What happened behind the scene? It would be helpful if you could give me some examples like the debug system in handmade hero.


Replying to mmozeiko (#26897)

It does not matter which one is slower. You use them in different situations so they will affect performance differently. Comparing them is useless. It's like asking whether add or mul is faster, which does not matter as you cannot replace one with another.

You use atomics when you want to access shared data across multiple threads. They are instructions to CPU how to interact with memory. There are good series of articles here:
https://preshing.com/20120612/an-introduction-to-lock-free-programming/
https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/
https://preshing.com/20120930/weak-vs-strong-memory-models/
https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
(just follow linked articles & search for relevant keywords in archives, there are many more)


Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#26898)

I was asking out of curiosity. AFAIK, multiplication takes more cycles than addition. I know they're two different things, I was just interested in the pure cycle count.


Replying to mmozeiko (#26899)

What is the same: multiplication vs addition or TLS vs atomics? Also, the link you sent was about SIMD instruction, what about normal operations?


Replying to mmozeiko (#26901)

Not sure what you mean by "same". Everything is different. They are not comparable. It's like asking how different is writing to file or drawing a pixel on screen. Also TLS depends on compiler implementation. Different compilers implement TLS support differently. So it matters for which OS you are compiling and which compiler are you using, and some compilers allow to switch to different TLS implementation with compiler arguments. And depending on architecture there are different atomic instructions with different features/properties available, so it matters which ones you actually need in specific situations - see the links above to preshing.com articles.


Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#26902)

You said:

It depends, often it is exactly the same.

So I didn't understand what you were referring to. Did you mean addition and multiplication execute at the same speed or did you mean TLS and atomics run as fast as each other?


Replying to mmozeiko (#26903)