Strange speed test results.

Can someone help me to explain what I just observed?

The other day, I was running some code here to test the speed of making a call to some function 100000 times, as supposed to inlining a function.

From the last time I did it, I had saved away the results. For referance. And I've made this test a number of times, and they always stayed the same, to a certainty of about a couple 100s of cycles.

And to my great astonishment the times was CONSISTENTLY halved, from the last time I made the test. Which should be impossible. I have an invariant RDTSC in my computer. The only way it could be wrong is if the OS (or something else, like a driver) write to it, in ring 0.

I checked my processlist, and I found that a backup-program I have, that takes backups of my sourcefiles, each time they change, had gone into a loop. And it was taking like 7% of the CPU.

But how could this HALVE my timings? The only thing that can change an invariant TSC is ring 0 code, is it not?

Edited by Livet Ersomen Strøm on
Hard to say without looking into it more, but keep in mind that there are many things that affect performance which other processes on the computer can cause to happen (cache pollution, inopportune thread switches, etc.)

This is one of the reasons why typically when you are timing things, you do not do loops that average their time, you do loops that take the minimum time from all loop executions (sometimes they have to be loops within loops if the code is too small, but still), because you know that some (possibly high) portion of your test loops may get borked by Windows.

- Casey
cmuratori

Hard to say without looking into it more, but keep in mind that there are many things that affect performance which other processes on the computer can cause to happen (cache pollution, inopportune thread switches, etc.)


Yes, this is true. But let me give you my view on things, then you can correct my thinking if it is wrong.

As far as I know, each thread get's a timeslice from the system. And within that timeslice, no other thread will be allowed to run on that cpu core/thread. I mean, this is simply impossible, by the very nature of things. =>> unless the CPU is interrupted, then it will execute code where it's at.

So if my stuff can run to completion, within the timeslice, it will clock correctly.
Actually, if it couldn't we could just take a hammer and smash the crystal to bits.

But the beauty of this Invariant TSC, that afaik comes from a crystal, is that it is ALSO consistent across all the cores of the system. This is true for all new processors, and it is a real beauty, that intel got right. So if it is left alone, it will not be wrong.

It is not 100% correct, over time, it slides, but it is in fact the most accurate of the clocks in the system, unless I am mistaken. Particularly within short periods of time. And it's frequency on my machine, seems to correspond almost exactly to the 3.2Ghz of the machine. So that if I time 200 ms worth of GETTICKCOUNT calls, in a loop, (and time it, exit it, at 200 ms) - and then scale the number, I will get a number that says that my machine can execute 320 cycles, in 100 nano seconds. or 3,2 nanoseconds per cycle. Which is consistently, pretty fucking amazingly cool!

I have over and over (close to 100 times) clocked this particular CALLING code to take about 6 out of 20 cycles MORE TIME, then the same code run inline.

In other words, the code that makes a call inside the loop, runs at about 2 000 000 cycles, while the same code, inlined in the loop is about 1 400 000 cycles. I have seen this numbers consistently since I bought the machine, and they have never changed. And I am running them right now as I write this, and they are still like that. No other software is running. Just the "naked" system, and my test.

But as I said above, as this other software was hanging, a week ago or so, my timings was more then halfed. Every time I ran them. They stopped doing that, once I killeed the other prosess. I would have not mentioned it, if the timings doubled. But they MORE then halfed. It went quicker. Much quicker. It was like 600 000 cycles or something, the inline version, and 1 000 000 or something for the callversion. Repeatedly.

And that makes no sense to me. And I think it is important. To try to understand what may have cause it to act like that. Cause it can only be 1 of 2 things, imo. Either the counter has been changed by something, or the CPU (system) *actually* spent less time completing that code. Which means it is normally takes more time. For instance, it may have been that the other process, was hanging because of a system process was also hanging/sleeping or something, and that this systemprocess, then was simply yielding CPU? It's just a lame guess. Maybe the fact that my OS, by default, runs 600+ threads, for basically doing "nothing", is a rather important reason for my temptation at guessing. But don't tell me that we really have no way in hell to know, because then we should imidiatly go an execute something else, besides our code.

I understand that you can't just take my word for it. I understand that I am a potential source of error in this equation. I also realize you may not be interessted, since it's likely seems too vague. But if you for a moment assumed that the numbers I give are valid. What then would the conclusion be?

Because frankly, if we cannot conclude something, then we know nothing. Because the invariant TSC is documented, and also microsoft has documents that state the same as I do. Unless of course I missread them. And they say, that the TSC counter, is *invariant*, across cores, and that it can only be changed in ring 0.



cmuratori

This is one of the reasons why typically when you are timing things, you do not do loops that average their time, you do loops that take the minimum time from all loop executions (sometimes they have to be loops within loops if the code is too small, but still), because you know that some (possibly high) portion of your test loops may get borked by Windows.

- Casey



Could you post some code, or pseudocode that explains what kind of loop you mean? The way I see it, in the real life, we do real loops. And those are the once we like to time. If you take the minimum of several loops, what you get is the ideal time. Which will be quite unrealistic. But I am more interessted in the typical time. Because that is the one we get.

For instance. When some person posted here timings for a Copy, these are very unrealistic, since the PAGING time for the source and dest is not acounted for. And that is even MORE time then the copy itself. And in most cases, you will actually have to eat that extratime. Whereas if you used unpaged memories, you would never have to eat it. But when you copy as much as 1 GIGA of ram, in most REAL cases, you would page. Even in handmade, since you do not do it all the time, but only when you need to. Unless you made sure you would not. If you then timed the ideal time, you time just the hardware, and the numbers will be extemely unrealistic, to what you could expect under actuall usage.


PS. Today I again vitnessed this behaviour. And it was when Window7 was post booting after installation of servicepack 1. The timings are again halfed, or more then halfed. The CALL form of my loop is halfed, almost exactly. while the inline-version of the loop is MORE then halved in crystal cycles. It is now 600 000 cycles, almost to the cycle. I d be extremely happy if someone could sheed some light on this.

Edited by Livet Ersomen Strøm on
Kladdehelvete
When some person posted here timings for a Copy, these are very unrealistic, since the PAGING time for the source and dest is not acounted for. And that is even MORE time then the copy itself. And in most cases, you will actually have to eat that extratime. Whereas if you used unpaged memories, you would never have to eat it. But when you copy as much as 1 GIGA of ram, in most REAL cases, you would page.

Are you talking about code I posted here? I was running that on machine with 16GB and with page file (on Windows) and swap area (on Linux) DISABLED. There was nothing to page out or in, because there is no page file. Everything was fully in RAM.

Answering your questions about strange timings - as Casey said, it is hard to guess what could be wrong. What you should do is reduce test code to minimal amount possible (preferably one small function) that measures time and can show strange behavior. Then post code so we can analyze it and produce reasonable hypothesis. Otherwise it will be neverending guessing game.

Edited by Mārtiņš Možeiko on
mmozeiko

Are you talking about code I posted here? I was running that on machine with 16GB and with page file (on Windows) and swap area (on Linux) DISABLED. There was nothing to page out or in, because there is no page file. Everything was fully in RAM.


That's exactly why I say it is unrealistic. As compared to the everyday life you expect on the average computer. You will only reach a throughput of that much, when paging is unaccounted for. If it accounted for, you will find that the throughput for the copy, will be less than 50%

mmozeiko

Answering your questions about strange timings - as Casey said, it is hard to guess what could be wrong. What you should do is reduce test code to minimal amount possible (preferably one small function) that measures time and can show strange behavior. Then post code so we can analyze it and produce reasonable hypothesis. Otherwise it will be neverending guessing game.


I would be happy to accept that, if it was not for the fact that I have an *invariant* TSC. The entire point of having one, is that you will not experience the things you worry about.
Ah, I see what you meant. That benchmark was not to get expected memcpy numbers on user machine. It was to show that you can not expect to get memory bandwidth number specified in Intel CPU datasheets (what Casey was expecting).

What CPU are you using? Invariant TSC is not present on all CPU's.
You can use CPUID to check if you have invariant TSC. Check bit nr8 in EDX for 0x80000007 category: http://www.gossamer-threads.com/lists/xen/devel/185419
mmozeiko
Ah, I see what you meant. That benchmark was not to get expected memcpy numbers on user machine. It was to show that you can not expect to get memory bandwidth number specified in Intel CPU datasheets (what Casey was expecting).


Yes, but I would think it would be the memorychips, and not Intel to "blame" for that.

mmozeiko

What CPU are you using? Invariant TSC is not present on all CPU's.
You can use CPUID to check if you have invariant TSC. Check bit nr8 in EDX for 0x80000007 category: http://www.gossamer-threads.com/lists/xen/devel/185419


Invariant TSC are present in all new CPUs since at least 4 years ago. I don't know exactly, but you will find this info on MSDN. I will write the code, as you suggest later, but I already know I have one.

Edited by Livet Ersomen Strøm on
Of course I'm not blaming Intel for that. I'm simply stating that those numbers are not what you should expect. There are many different reasons why you won't get that - memory chips, their speed, motherboard, chipset, etc...

So what CPU you have? If it is SandyBridge, Ivy Bridge or Haswell, then I believe you have it. No need to write code then :) For other CPU's I'm not sure, there may be models that don't have it (I've no idea if that is true or not). And no idea about AMD cpus, I haven't used them for a long time.
mmozeiko

So what CPU you have? If it is SandyBridge, Ivy Bridge or Haswell, then I believe you have it. No need to write code then :) For other CPU's I'm not sure, there may be models that don't have it (I've no idea if that is true or not). And no idea about AMD cpus, I haven't used them for a long time.


its an i7.

If I recall well, all newer intel cpus, and amds have invariant TSC. That's how I "know".

.. so I just tested it with this code:

mov eax, 0x80000007
cpuid
test edx, Bit8
jz NoSupport
;yes we have invariant TSC, which is amazing !

NoSupport:

be careful when you test it, so you don't test for bit 7 instead of 8. They are zerobased.

Edited by Livet Ersomen Strøm on