Results of some quick research on timing in Win32
by Ryan Geiss - 16 August 2002 (...with updates since then)


You might be thinking to yourself: this is a pretty simple thing
to be posting; what's the big deal?  The deal is that somehow,
good timing code has eluded me for years.  Finally, frustrated,
I dug in and did some formal experiments on a few different computers,
testing the timing precision they could offer, using various win32
functions.  I was fairly surprised by the results!

I tested on three computers; here are their specs:

    Gemini:   933 mhz desktop, win2k
    Vaio:     333 mhz laptop,  win98
    HP:       733 mhz laptop,  win2k

Also, abbreviations to be used hereafter:

    ms: milliseconds, or 1/1,000 of a second
    us: microseconds, or 1/1,000,000 of a second


timeGetTime - what they don't tell you 
    
First, I tried to determine the precision of timeGetTime().  
In order to do this, I simply ran a loop, constantly polling
timeGetTime() until the time changed, and then printing the 
delta (between the prev. time and the new time).  I then looked
at the output, and for each computer, took the minimum of all
the delta's that occured.  (Usually, the minimum was very solid,
occuring about 90% of the time.)  The results:

              Resolution of timeGetTime()
    Gemini:   10 ms
    Vaio:     1 ms
    HP:       10 ms
    
For now, I am assuming that it was the OS kernel that made the
difference: win2k offers a max. precision of 10 ms for timeGetTime(),
while win98 is much better, at 1 ms.  I assume that WinXP would also
have a precision of 10 ms, and that Win95 would be ~1 ms, like Win98.
(If anyone tests this out, please let me know either way!)

(Note that using timeGetTime() unfortunately requires linking to 
winmm.lib, which slightly increases your file size.  You could use 
GetTickCount() instead, which doesn't require linking to winmm.lib,
but it tends to not have as good of a timer resolution... so I would 
recommend sticking with timeGetTime().

Next, I tested Sleep().  A while back I noticed that when you call
Sleep(1), it doesn't really sleep for 1 ms; it usually sleeps for longer
than that.  I verified this by calling Sleep(1) ten times in a row,
and taking the difference in timeGetTime() readings from the beginning
to the end.  Whatever delta there was for these ten sleeps, I just divided
it by 10 to get the average duration of Sleep(1).  This turned out to be:

              Average duration of Sleep(1)
    Gemini:   10 ms  (10 calls to Sleep(1) took exactly 100 ms)
    Vaio:     ~4 ms  (10 calls to Sleep(1) took 35-45 ms)
    HP:       10 ms  (10 calls to Sleep(1) took exactly 100 ms)

Now, this was disturbing, because it meant that if you call Sleep(1)
and Sleep(9) on a win2k machine, there is no difference - it still
sleeps for 10 ms!  "So *this* is the reason all my timing code sucks,"
I sighed to myself.

Given that, I decided to give up on Sleep() and timeGetTime().  The
application I was working on required really good fps limiting, and
10ms Sleeps were not precise enough to do a good job.  So I looked
elsewhere.

UPDATE: Matthijs de Boer points out that the timeGetTime function 
returns a DWORD value, which will wraps around to 0 every 2^32 
milliseconds, which is about 49.71 days, so you should write your
code to be aware of this possibility.



timeBeginPeriod / timeEndPeriod

HOWEVER, I should not have given up so fast!  It turns out that there
is a win32 command, timeBeginPeriod(), which solves our problem:
it lowers the granularity of Sleep() to whatever parameter you give it.
So if you're on windows 2000 and you call timeBeginPeriod(1) and then
Sleep(1), it will truly sleep for just 1 millisecond, rather than the 
default 10!

timeBeginPeriod() only affects the granularity of Sleep() for the application
that calls it, so don't worry about messing up the system with it.  Also,
be sure you call timeEndPeriod() when your program exits, with the same
parameter you fed into timeBeginPeriod() when your program started (presumably
1).  Both of these functions are in winmm.lib, so you'll have to link to it
if you want to lower your Sleep() granularity down to 1 ms.

How reliable is it?  I have yet to find a system for which timeBeginPeriod(1) 
does not drop the granularity of Sleep(1) to 1 or, at most, 2 milliseconds.
If anyone out there does, please let me know
(e-mail: );
I'd like to hear about it, and I will post a warning here.

Note also that calling timeBeginPeriod() also affects the granularity of some
other timing calls, such as CreateWaitableTimer() and WaitForSingleObject(); 
however, some functions are still unaffected, such as _ftime().  (Special 
thanks to Mark Epstein for pointing this out to me!)


some convenient test code

The following code will tell you:
    1. what the granularity, or minimum resolution, of calls to timeGetTime() are, 
on your system.  In other words, if you sit in a tight loop and call timeGetTime(),
only noting when the value returned changes, what value do you get?  This 
granularity tells you, more or less, what kind of potential error to expect in 
the result when calling timeGetTime().
    2. it also tests how long your machine really sleeps when you call Sleep(1).
Often this is actually 2 or more milliseconds, so be careful!

NOTE that these tests are performed after calling timeBeginPeriod(1), so if
you forget to call timeBeginPeriod(1) in your own init code, you might not get 
as good of granularity as you see from this test!

        
        #include        <stdio.h>
        #include        "windows.h"
        
        int main(int argc, char **argv)
        {
            const int count = 64;
        
            timeBeginPeriod(1);
        
            printf("1. testing granularity of timeGetTime()...\n");
            int its = 0;
            long cur = 0, last = timeGetTime();
            while (its < count) {
                cur = timeGetTime();
                if (cur != last) {
                    printf("%ld ", cur-last);
                    last = cur;
                    its++;
                }
            }
            
            printf("\n\n2. testing granularity of Sleep(1)...\n  ");
            long first = timeGetTime();
            cur = first;
            last = first;
            for (int n=0; n<count; n++) {
                Sleep(1);
                cur = timeGetTime();
                printf("%d ", cur-last);
                last = cur;
            }
            printf("\n");
        
            return 0;
        }
        




RDTSC: Eh, no thanks

On the web, I found several references to the "RDTSC" Pentium instruction,
which stands for "Read Time Stamp Counter."  This assembly instruction returns
an unsigned 64-bit integer reading on the processor's internal high-precision 
timer.  In order to get the frequency of the timer (how much the timer return
value will increment in 1 second), you can read the registry for the machine's 
speed (in MHz - millions of cycles per second), like this:
        
        // WARNING: YOU DON'T REALLY WANT TO USE THIS FUNCTION
        bool GetPentiumClockEstimateFromRegistry(unsigned __int64 *frequency) 
        { 
            HKEY                        hKey; 
            DWORD                       cbBuffer; 
            LONG                        rc; 

            *frequency = 0;
        
            rc = RegOpenKeyEx( 
                     HKEY_LOCAL_MACHINE, 
                     "Hardware\\Description\\System\\CentralProcessor\\0", 
                     0, 
                     KEY_READ, 
                     &hKey 
                 ); 
        
            if (rc == ERROR_SUCCESS) 
            { 
                cbBuffer = sizeof (DWORD); 
                DWORD freq_mhz;
                rc = RegQueryValueEx 
                     ( 
                       hKey, 
                       "~MHz", 
                       NULL, 
                       NULL, 
                       (LPBYTE)(&freq_mhz), 
                       &cbBuffer 
                     ); 
                if (rc == ERROR_SUCCESS)
                    *frequency = freq_mhz*1024*1024;
                RegCloseKey (hKey); 
            } 
        
            return (*frequency > 0); 
        }


              Result of GetPentiumClockEstimateFromRegistry()
    Gemini:   975,175,680 Hz
    Vaio:     FAILED.
    HP:       573,571,072 Hz   <-- strange...

              Empirical tests: RDTSC delta after Sleep(1000)
    Gemini:   931,440,000 Hz
    Vaio:     331,500,000 Hz
    HP:        13,401,287 Hz


However, as you can see, this failed on Vaio (the win98 laptop).  
Worse yet, however, is that on the HP, the value in the registry
does not match the MHz rating of the machine (733).  That would
be okay if the value was actually the rate at which the timer
ticked; but, after doing some empirical testing, it turns out that 
the HP's timer frequency is really 13 MHz.  Trusting the
registry reading on the HP would be a big, big mistake!

So, one conclusion is: don't try to read the registry to get the 
timer frequency; you're asking for trouble.  Instead, do it yourself.

Just call Sleep(1000) to allow 1 second (plus or minus ~1%) to pass, 
calling GetPentiumTimeRaw() (below) at the beginning and end, and then
simply subtract the two unsigned __int64's, and voila, you now know
the frequency of the timer that feeds RDTSC on the current system.
(*watch out for timer wraps during that 1 second, though...)

Note that you could easily do this in the background, though, using 
timeGetTime() instead of Sleep(), so there wouldn't be a 1-second pause 
when your program starts.

        int GetPentiumTimeRaw(unsigned __int64 *ret)
        {
            // returns 0 on failure, 1 on success 
            // warning: watch out for wraparound!
        
            // get high-precision time:
            __try
            {
                unsigned __int64 *dest = (unsigned __int64 *)ret;
                __asm 
                {
                    _emit 0xf        // these two bytes form the 'rdtsc' asm instruction,
                    _emit 0x31       //  available on Pentium I and later.
                    mov esi, dest
                    mov [esi  ], eax    // lower 32 bits of tsc
                    mov [esi+4], edx    // upper 32 bits of tsc
                }
                return 1;
            }
            __except(EXCEPTION_EXECUTE_HANDLER)
            {
                return 0;
            }
        
            return 0;
        }

Once you figure out the frequency, using this 1-second test, you can now 
translate readings from the cpu's timestamp counter directly into a real 
'time' reading, in seconds:
        
        double GetPentiumTimeAsDouble(unsigned __int64 frequency)
        {
            // returns < 0 on failure; otherwise, returns current cpu time, in seconds.
            // warning: watch out for wraparound!
        
            if (frequency==0)
                return -1.0;
        
            // get high-precision time:
            __try
            {
                unsigned __int64 high_perf_time;
                unsigned __int64 *dest = &high_perf_time;
                __asm 
                {
                    _emit 0xf        // these two bytes form the 'rdtsc' asm instruction,
                    _emit 0x31       //  available on Pentium I and later.
                    mov esi, dest
                    mov [esi  ], eax    // lower 32 bits of tsc
                    mov [esi+4], edx    // upper 32 bits of tsc
                }
                __int64 time_s     = (__int64)(high_perf_time / frequency);  // unsigned->sign conversion should be safe here
                __int64 time_fract = (__int64)(high_perf_time % frequency);  // unsigned->sign conversion should be safe here
                // note: here, we wrap the timer more frequently (once per week) 
                // than it otherwise would (VERY RARELY - once every 585 years on
                // a 1 GHz), to alleviate floating-point precision errors that start 
                // to occur when you get to very high counter values.  
                double ret = (time_s % (60*60*24*7)) + (double)time_fract/(double)((__int64)frequency);
                return ret;
            }
            __except(EXCEPTION_EXECUTE_HANDLER)
            {
                return -1.0;
            }
        
            return -1.0;
        }

This works pretty well, works on ALL Pentium I and later processors, and offers 
AMAZING precision.  However, it can be messy, especially working that 1-second
test in there with all your other code, so that it runs in the background.

UPDATE: Ross Bencina was kind enough to point out to me that rdtsc "is a per-cpu
operation, so on multiprocessor systems you have to be careful that multiple calls
to rdtsc are actually executing on the same cpu."  (You can do that using the 
SetThreadAffinityMask() function.)  Thanks Ross!


QueryPerformanceFrequency & QueryPerformanceCounter: Nice

There is one more item in our bag of tricks.  It is simple, elegant, and as far 
as I can tell, extremely accurate and reliable.  It is a pair of win32 functions: 
QueryPerformanceFrequency and QueryPerformanceCounter.

QueryPerformanceFrequency returns the amount that the counter will increment over
1 second; QueryPerformanceCounter returns a LARGE_INTEGER (a 64-bit *signed* integer)
that is the current value of the counter.  

Perhaps I am lucky, but it works flawlessly on my 3 machines.  The MSDN library
says that it should work on Windows 95 and later.  

Here are some results:

              Return value of QueryPerformanceFrequency
    Gemini:   3,579,545 Hz
    Vaio:     1,193,000 Hz
    HP:       3,579,545 Hz
    
              Maximum # of unique readings I could get in 1 second
    Gemini:   658,000  (-> 1.52 us resolution!)
    Vaio:     174,300  (-> 5.73 us resolution!)
    HP:       617,000  (-> 1.62 us resolution!)

I was pretty excited to see timing resolutions in the low-microsecond
range.  Note that for the latter test, I avoided printing any text 
during the 1-second interval, as it would drastically affect the outcome.

Now, here is my question to you: do these two functions work for you?  
What OS does the computer run, what is the MHz rating, and is it a laptop
or desktop?  What was the result of QueryPerformanceFrequency?  
What was the max. # of unique readings you could get in 1 second?  
Can you find any computers that it doesn't work on?  Let me know (e-mail: ), and
I'll collect & publish everyone's results here.

So, until I find some computers that QueryPerformanceFrequency & 
QueryPerformanceCounter don't work on, I'm sticking with them.  If they fail,
I've got backup code that will kick in, which uses timeGetTime(); I didn't
bother to use RDTSC because of the calibration issue, and I'm hopeful that
these two functions are highly reliable.  I suppose only feedback from
readers like you will tell... =)

UPDATE: a few people have written e-mail pointing me to this Microsoft Knowledge
Base article which outlines some cases in which the QueryPerformanceCounter
function can unexpectedly jump forward by a few seconds.

UPDATE: Matthijs de Boer points out that you can use the SetThreadAffinityMask() 
function to make your thread stick to one core or the other, so that 'rdtsc' and 
QueryPerformanceCounter() don't have timing issues in dual core systems.



Accurate FPS Limiting / High-precision 'Sleeps'

So now, when I need to do FPS limiting (limiting the framerate to some
maximum), I don't just naively call Sleep() anymore.  Instead, I use 
QueryPerformanceCounter in a loop that runs Sleep(0).  Sleep(0) simply 
gives up your thread's current timeslice to another waiting thread; it 
doesn't really sleep at all.  So, if you just keep calling Sleep(0) 
in a loop until QueryPerformanceCounter() says you've hit the right time, 
you'll get ultra-accurate FPS readings.

There is one problem with this kind of fps limiting: it will use up
100% of the CPU.  Even though the computer WILL remain
quite responsive, because the app sucking up the idle time is being very
"nice", this will still look very bad on the CPU meter (which will stay
at 100%) and, much worse, it will drain the battery quite quickly on 
laptops.  

To get around this, I use a hybrid algorithm that uses Sleep() to do the
bulk of the waiting, and QueryPerformanceCounter() to do the finishing
touches, making it accurate to ~10 microseconds, but still wasting very
little processor.

My code for accurate FPS limiting looks something like this, and runs
at the end of each frame, immediately after the page flip:

        
        // note: BE SURE YOU CALL timeBeginPeriod(1) at program startup!!!
        // note: BE SURE YOU CALL timeEndPeriod(1) at program exit!!!
        // note: that will require linking to winmm.lib
        // note: never use static initializers (like this) with Winamp plug-ins!
        static LARGE_INTEGER m_prev_end_of_frame = 0;  
        int max_fps = 60;
        
        LARGE_INTEGER t;
        QueryPerformanceCounter(&t);

        if (m_prev_end_of_frame.QuadPart != 0)
        {
            int ticks_to_wait = (int)m_high_perf_timer_freq.QuadPart / max_fps;
            int done = 0;
            do
            {
                QueryPerformanceCounter(&t);
                
                int ticks_passed = (int)((__int64)t.QuadPart - (__int64)m_prev_end_of_frame.QuadPart);
                int ticks_left = ticks_to_wait - ticks_passed;

                if (t.QuadPart < m_prev_end_of_frame.QuadPart)    // time wrap
                    done = 1;
                if (ticks_passed >= ticks_to_wait)
                    done = 1;
                
                if (!done)
                {
                    // if > 0.002s left, do Sleep(1), which will actually sleep some 
                    //   steady amount, probably 1-2 ms,
                    //   and do so in a nice way (cpu meter drops; laptop battery spared).
                    // otherwise, do a few Sleep(0)'s, which just give up the timeslice,
                    //   but don't really save cpu or battery, but do pass a tiny
                    //   amount of time.
                    if (ticks_left > (int)m_high_perf_timer_freq.QuadPart*2/1000)
                        Sleep(1);
                    else                        
                        for (int i=0; i<10; i++) 
                            Sleep(0);  // causes thread to give up its timeslice
                }
            }
            while (!done);            
        }

        m_prev_end_of_frame = t;

...which is trivial to convert this into a high-precision Sleep() function.


Conclusions & Summary 

Using regular old timeGetTime() to do timing is not reliable on many Windows-based 
operating systems because the granularity of the system timer can be as high as 10-15 
milliseconds, meaning that timeGetTime() is only accurate to 10-15 milliseconds.
[Note that the high granularities occur on NT-based operation systems like Windows NT,
2000, and XP.  Windows 95 and 98 tend to have much better granularity, around 1-5 ms.]

However, if you call timeBeginPeriod(1) at the beginning of your program (and 
timeEndPeriod(1) at the end), timeGetTime() will usually become accurate to 1-2 
milliseconds, and will provide you with extremely accurate timing information.

Sleep() behaves similarly; the length of time that Sleep() actually sleeps for 
goes hand-in-hand with the granularity of timeGetTime(), so after calling 
timeBeginPeriod(1) once, Sleep(1) will actually sleep for 1-2 milliseconds, Sleep(2)
for 2-3, and so on (instead of sleeping in increments as high as 10-15 ms).

For higher precision timing (sub-millisecond accuracy), you'll probably want to avoid 
using the assembly mnemonic RDTSC because it is hard to calibrate; instead, use 
QueryPerformanceFrequency and QueryPerformanceCounter, which are accurate to less 
than 10 microseconds (0.00001 seconds).  

For simple timing, both timeGetTime and QueryPerformanceCounter work well, and 
QueryPerformanceCounter is obviously more accurate.  However, if you need to do 
any kind of "timed pauses" (such as those necessary for framerate limiting), you 
need to be careful of sitting in a loop calling QueryPerformanceCounter, waiting 
for it to reach a certain value; this will eat up 100% of your processor.  Instead, 
consider a hybrid scheme, where you call Sleep(1) (don't forget timeBeginPeriod(1) 
first!) whenever you need to pass more than 1 ms of time, and then only enter the 
QueryPerformanceCounter 100%-busy loop to finish off the last < 1/1000th of a 
second of the delay you need.  This will give you ultra-accurate delays (accurate 
to 10 microseconds), with very minimal CPU usage.  See the code above.

Please Note: Several people have written me over the years, offering additions 
or new developments since I first wrote this article, and I've added 'update' 
comments here and there.  The general text of the article DOES NOT reflect the
'UPDATE' comments yet, so please keep that in mind, if you see any contradictions.



UPDATE: Matthijs de Boer points out that you should watch out for variable CPU speeds, in general, when running on laptops or other power-conserving (perhaps even just eco-friendly) devices. (Thanks Matthijs!)
This document copyright (c)2002+ Ryan M. Geiss.
Return to faq page