Question about thread local storage

Alexey

#26906

October 7, 2022

They meant that mulps/mulss and addps/mulss have the same latency on many x64 chips, but that doesn't mean that one of them is "faster" than the other. It's like asking what is faster: an apple or an orange? Depends on how you use it, I guess ;)

longtran2904

#26907

October 8, 2022

Just to rephrase my question: I was just curious how many cycles it takes to use TLS or atomics. I know they're different from each other. If you don't like the word "faster" then I won't use it, just interested in the cycle count. When he said: "it is often exactly the same", I didn't know what the "it" was.

Also, the mulps and addps are SIMD instructions, what about normal instructions?

Replying to aolo2 (#26906)

Mārtiņš Možeiko

#26908

October 8, 2022

By it I meant mulps and addps being the same cycle count. It was reply to your statement that mul takes more cycles than add.

You should not think about TLS and as individual instructions. You can measure how much they take, but it will be different depending on situation. It is not as fixed as regular instructions. Again - TLS is not an instruction, is concept of OS and compiler that is implemented as set of instructions. You could write code and see what instructions it generates and then add them up or measure to know cycle amount. Similar for atomics, it is not possible to tell exact cycle count because it depends on situation where it used, how other cores run and how they all access shared memory. In worse case it will take as many cycles as accessing main memory because it need to synchronize with other cores.

Edited by Mārtiņš Možeiko on October 8, 2022, 3:34am

Replying to longtran2904 (#26907)

longtran2904

#26910

October 9, 2022

I didn't expect to be given a single number, I was more interested in a range of numbers. Kind of like: if there's only a core need to write to that data and the data is already in the cache, then it takes somewhere around x cycles. But if there are two cores that both need the data, or there's a core that is writing to it and another core suddenly needs it, then it will be way slower and you're looking at a number between a to b cycles. I just wanted to know the scale rather than the actual number.

Replying to mmozeiko (#26908)

longtran2904

#30196

June 5, 2024

You can read how TLS works more in these series of articles (for Windows): http://www.nynaeve.net/?p=190

So I finally finished this and it's very useful. There are a couple of things that are confusing though:

In part 3, he explained about how TLS callbacks work. What's the point of these callbacks? Is it for __declspec(thread) variables that are initialized by a function/constructor?
The series starts by explaining how explicit TLS (TlsAlloc, TlsGetValue, etc) is implemented using the two arrays TlsSlots and TlsExpansionSlots inside the TEB structure. Later, when talking about implicit TLS (__declspec(thread)), he introduced a new pointer called ThreadLocalStoragePointer. When I first read this, I was confused on what's the relationship between ThreadLocalStoragePointer and TlsSlots/TlsExpansionSlots, but after some digging around, it seems to be that it's just another field inside the TEB struct that works like TlsSlots/TlsExpansionSlots but only for implicit TLS. So rather than using TlsAlloc, the compiler just uses a completely separate array for implicit TLS. Each slot in this array is for a module (presumably a module is the current main process, a DLL, or a child process). Is what I'm saying correct?
If this is the case, then why does the ThreadLocalStoragePointer array even need to be dynamically allocated? Can't it just work like the TlsSlots where you just have a fixed size array (1024 or 1088)?
When I use __declspec(thread) in combination with __declspec(dllexport/dllimport), the compiler complains about it. Is there any way for me to declare that a variable is both TLS and exported/imported to/from another DLL?

Replying to mmozeiko (#26893)