Just to rephrase my question: I was just curious how many cycles it takes to use TLS or atomics. I know they're different from each other. If you don't like the word "faster" then I won't use it, just interested in the cycle count. When he said: "it is often exactly the same", I didn't know what the "it" was.
Also, the mulps and addps are SIMD instructions, what about normal instructions?
By it I meant mulps and addps being the same cycle count. It was reply to your statement that mul takes more cycles than add.
You should not think about TLS and as individual instructions. You can measure how much they take, but it will be different depending on situation. It is not as fixed as regular instructions. Again - TLS is not an instruction, is concept of OS and compiler that is implemented as set of instructions. You could write code and see what instructions it generates and then add them up or measure to know cycle amount. Similar for atomics, it is not possible to tell exact cycle count because it depends on situation where it used, how other cores run and how they all access shared memory. In worse case it will take as many cycles as accessing main memory because it need to synchronize with other cores.
I didn't expect to be given a single number, I was more interested in a range of numbers. Kind of like: if there's only a core need to write to that data and the data is already in the cache, then it takes somewhere around x cycles. But if there are two cores that both need the data, or there's a core that is writing to it and another core suddenly needs it, then it will be way slower and you're looking at a number between a to b cycles. I just wanted to know the scale rather than the actual number.
You can read how TLS works more in these series of articles (for Windows): http://www.nynaeve.net/?p=190
So I finally finished this and it's very useful. There are a couple of things that are confusing though:
__declspec(thread)
variables that are initialized by a function/constructor?TlsAlloc
, TlsGetValue
, etc) is implemented using the two arrays TlsSlots
and TlsExpansionSlots
inside the TEB
structure. Later, when talking about implicit TLS (__declspec(thread)
), he introduced a new pointer called ThreadLocalStoragePointer
. When I first read this, I was confused on what's the relationship between ThreadLocalStoragePointer
and TlsSlots/TlsExpansionSlots
, but after some digging around, it seems to be that it's just another field inside the TEB
struct that works like TlsSlots/TlsExpansionSlots
but only for implicit TLS. So rather than using TlsAlloc
, the compiler just uses a completely separate array for implicit TLS. Each slot in this array is for a module (presumably a module is the current main process, a DLL, or a child process). Is what I'm saying correct?ThreadLocalStoragePointer
array even need to be dynamically allocated? Can't it just work like the TlsSlots
where you just have a fixed size array (1024 or 1088)?__declspec(thread)
in combination with __declspec(dllexport/dllimport)
, the compiler complains about it. Is there any way for me to declare that a variable is both TLS and exported/imported to/from another DLL?