Handmade Hero»Forums»Code
Alexey
16 posts
Question about thread local storage

They meant that mulps/mulss and addps/mulss have the same latency on many x64 chips, but that doesn't mean that one of them is "faster" than the other. It's like asking what is faster: an apple or an orange? Depends on how you use it, I guess ;)

217 posts
Question about thread local storage
Replying to aolo2 (#26906)

Just to rephrase my question: I was just curious how many cycles it takes to use TLS or atomics. I know they're different from each other. If you don't like the word "faster" then I won't use it, just interested in the cycle count. When he said: "it is often exactly the same", I didn't know what the "it" was.

Also, the mulps and addps are SIMD instructions, what about normal instructions?

Mārtiņš Možeiko
2562 posts / 2 projects
Question about thread local storage
Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#26907)

By it I meant mulps and addps being the same cycle count. It was reply to your statement that mul takes more cycles than add.

You should not think about TLS and as individual instructions. You can measure how much they take, but it will be different depending on situation. It is not as fixed as regular instructions. Again - TLS is not an instruction, is concept of OS and compiler that is implemented as set of instructions. You could write code and see what instructions it generates and then add them up or measure to know cycle amount. Similar for atomics, it is not possible to tell exact cycle count because it depends on situation where it used, how other cores run and how they all access shared memory. In worse case it will take as many cycles as accessing main memory because it need to synchronize with other cores.

217 posts
Question about thread local storage
Replying to mmozeiko (#26908)

I didn't expect to be given a single number, I was more interested in a range of numbers. Kind of like: if there's only a core need to write to that data and the data is already in the cache, then it takes somewhere around x cycles. But if there are two cores that both need the data, or there's a core that is writing to it and another core suddenly needs it, then it will be way slower and you're looking at a number between a to b cycles. I just wanted to know the scale rather than the actual number.