Question about thread local storage

aolo2

Alexey

16 posts

#26906

Question about thread local storage

1 year, 9 months ago

They meant that mulps/mulss and addps/mulss have the same latency on many x64 chips, but that doesn't mean that one of them is "faster" than the other. It's like asking what is faster: an apple or an orange? Depends on how you use it, I guess ;)

longtran2904

251 posts

#26907

Question about thread local storage

1 year, 9 months ago

Replying to aolo2 (#26906)

Just to rephrase my question: I was just curious how many cycles it takes to use TLS or atomics. I know they're different from each other. If you don't like the word "faster" then I won't use it, just interested in the cycle count. When he said: "it is often exactly the same", I didn't know what the "it" was.

Also, the mulps and addps are SIMD instructions, what about normal instructions?

mmozeiko

Mārtiņš Možeiko

2583 posts / 2 projects

#26908

Question about thread local storage

1 year, 9 months ago Edited by Mārtiņš Možeiko on October 8, 2022, 3:34am

Replying to longtran2904 (#26907)

By it I meant mulps and addps being the same cycle count. It was reply to your statement that mul takes more cycles than add.

You should not think about TLS and as individual instructions. You can measure how much they take, but it will be different depending on situation. It is not as fixed as regular instructions. Again - TLS is not an instruction, is concept of OS and compiler that is implemented as set of instructions. You could write code and see what instructions it generates and then add them up or measure to know cycle amount. Similar for atomics, it is not possible to tell exact cycle count because it depends on situation where it used, how other cores run and how they all access shared memory. In worse case it will take as many cycles as accessing main memory because it need to synchronize with other cores.

longtran2904

251 posts

#26910

Question about thread local storage

1 year, 9 months ago

Replying to mmozeiko (#26908)

I didn't expect to be given a single number, I was more interested in a range of numbers. Kind of like: if there's only a core need to write to that data and the data is already in the cache, then it takes somewhere around x cycles. But if there are two cores that both need the data, or there's a core that is writing to it and another core suddenly needs it, then it will be way slower and you're looking at a number between a to b cycles. I just wanted to know the scale rather than the actual number.

longtran2904

251 posts

#30196

Question about thread local storage

1 month, 3 weeks ago

Replying to mmozeiko (#26893)

You can read how TLS works more in these series of articles (for Windows): http://www.nynaeve.net/?p=190

So I finally finished this and it's very useful. There are a couple of things that are confusing though:

In part 3, he explained about how TLS callbacks work. What's the point of these callbacks? Is it for __declspec(thread) variables that are initialized by a function/constructor?
The series starts by explaining how explicit TLS (TlsAlloc, TlsGetValue, etc) is implemented using the two arrays TlsSlots and TlsExpansionSlots inside the TEB structure. Later, when talking about implicit TLS (__declspec(thread)), he introduced a new pointer called ThreadLocalStoragePointer. When I first read this, I was confused on what's the relationship between ThreadLocalStoragePointer and TlsSlots/TlsExpansionSlots, but after some digging around, it seems to be that it's just another field inside the TEB struct that works like TlsSlots/TlsExpansionSlots but only for implicit TLS. So rather than using TlsAlloc, the compiler just uses a completely separate array for implicit TLS. Each slot in this array is for a module (presumably a module is the current main process, a DLL, or a child process). Is what I'm saying correct?
If this is the case, then why does the ThreadLocalStoragePointer array even need to be dynamically allocated? Can't it just work like the TlsSlots where you just have a fixed size array (1024 or 1088)?
When I use __declspec(thread) in combination with __declspec(dllexport/dllimport), the compiler complains about it. Is there any way for me to declare that a variable is both TLS and exported/imported to/from another DLL?