cmuratori
Hard to say without looking into it more, but keep in mind that there are many things that affect performance which other processes on the computer can cause to happen (cache pollution, inopportune thread switches, etc.)
Yes, this is true. But let me give you my view on things, then you can correct my thinking if it is wrong.
As far as I know, each thread get's a timeslice from the system. And within that timeslice, no other thread will be allowed to run on that cpu core/thread. I mean, this is simply impossible, by the very nature of things. =>> unless the CPU is interrupted, then it will execute code where it's at.
So if my stuff can run to completion, within the timeslice, it will clock correctly.
Actually, if it couldn't we could just take a hammer and smash the crystal to bits.
But the beauty of this Invariant TSC, that afaik comes from a crystal, is that it is ALSO consistent across all the cores of the system. This is true for all new processors, and it is a real beauty, that intel got right. So if it is left alone, it will not be wrong.
It is not 100% correct, over time, it slides, but it is in fact the most accurate of the clocks in the system, unless I am mistaken. Particularly within short periods of time. And it's frequency on my machine, seems to correspond almost exactly to the 3.2Ghz of the machine. So that if I time 200 ms worth of GETTICKCOUNT calls, in a loop, (and time it, exit it, at 200 ms) - and then scale the number, I will get a number that says that my machine can execute 320 cycles, in 100 nano seconds. or 3,2 nanoseconds per cycle. Which is consistently, pretty fucking amazingly cool!
I have over and over (close to 100 times) clocked this particular CALLING code to take about 6 out of 20 cycles MORE TIME, then the same code run inline.
In other words, the code that makes a call inside the loop, runs at about 2 000 000 cycles, while the same code, inlined in the loop is about 1 400 000 cycles. I have seen this numbers consistently since I bought the machine, and they have never changed. And I am running them right now as I write this, and they are still like that. No other software is running. Just the "naked" system, and my test.
But as I said above, as this other software was hanging, a week ago or so, my timings was more then halfed. Every time I ran them. They stopped doing that, once I killeed the other prosess. I would have not mentioned it, if the timings doubled. But they MORE then halfed. It went quicker. Much quicker. It was like 600 000 cycles or something, the inline version, and 1 000 000 or something for the callversion. Repeatedly.
And that makes no sense to me. And I think it is important. To try to understand what may have cause it to act like that. Cause it can only be 1 of 2 things, imo. Either the counter has been changed by something, or the CPU (system) *actually* spent less time completing that code. Which means it is normally takes more time. For instance, it may have been that the other process, was hanging because of a system process was also hanging/sleeping or something, and that this systemprocess, then was simply yielding CPU? It's just a lame guess. Maybe the fact that my OS, by default, runs 600+ threads, for basically doing "nothing", is a rather important reason for my temptation at guessing. But don't tell me that we really have no way in hell to know, because then we should imidiatly go an execute something else, besides our code.
I understand that you can't just take my word for it. I understand that I am a potential source of error in this equation. I also realize you may not be interessted, since it's likely seems too vague. But if you for a moment assumed that the numbers I give are valid. What then would the conclusion be?
Because frankly, if we cannot conclude something, then we know nothing. Because the invariant TSC is documented, and also microsoft has documents that state the same as I do. Unless of course I missread them. And they say, that the TSC counter, is *invariant*, across cores, and that it can only be changed in ring 0.
cmuratori
This is one of the reasons why typically when you are timing things, you do not do loops that average their time, you do loops that take the minimum time from all loop executions (sometimes they have to be loops within loops if the code is too small, but still), because you know that some (possibly high) portion of your test loops may get borked by Windows.
- Casey
Could you post some code, or pseudocode that explains what kind of loop you mean? The way I see it, in the real life, we do real loops. And those are the once we like to time. If you take the minimum of several loops, what you get is the ideal time. Which will be quite unrealistic. But I am more interessted in the typical time. Because that is the one we get.
For instance. When some person posted here timings for a Copy, these are very unrealistic, since the PAGING time for the source and dest is not acounted for. And that is even MORE time then the copy itself. And in most cases, you will actually have to eat that extratime. Whereas if you used unpaged memories, you would never have to eat it. But when you copy as much as 1 GIGA of ram, in most REAL cases, you would page. Even in handmade, since you do not do it all the time, but only when you need to. Unless you made sure you would not. If you then timed the ideal time, you time just the hardware, and the numbers will be extemely unrealistic, to what you could expect under actuall usage.
PS. Today I again vitnessed this behaviour. And it was when Window7 was post booting after installation of servicepack 1. The timings are again halfed, or more then halfed. The CALL form of my loop is halfed, almost exactly. while the inline-version of the loop is MORE then halved in crystal cycles. It is now 600 000 cycles, almost to the cycle. I d be extremely happy if someone could sheed some light on this.