How to estimate the thread context switching overhead?

While you said you don't want to write a test application, I did this for a previous test on an ARM9 Linux platform to find out what the overhead is. It was just two threads that would boost::thread::yield() (or, you know) and increment some variable, and after a minute or so (without other running processes, at least none that do something), the app printed how many context switches it could do per second. Of course this is not really exact, but the point is that both threads yielded the CPU to each other, and it was so fast that it just didn't make sense any more to think about the overhead. So, simply go ahead and just write a simple test instead of thinking too much about a problem that may be non-existent.

Other than that, you might try like 1800 suggested with performance counters.

Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with intensive switching at times, and never ran into performance issues. We also tried to implement the core threading thing without threads at all, and saw no performance improvement (the GUI just responded much slower, but everything else was the same). Maybe you can try the same, by either reducing the number of context switches or by removing threads completely (just for testing).


I doubt you can find this overhead somewhere on the web for any existing platform. There exists just too many different platforms. The overhead depends on two factors:

  • The CPU, as the necessary operations may be easier or harder on different CPU types
  • The system kernel, as different kernels will have to perform different operations on each switch

Other factors include how the switch takes place. A switch can take place when

  1. the thread has used all of its time quantum. When a thread is started, it may run for a given amount of time before it has to return control to the kernel that will decide who's next.

  2. the thread was preempted. This happens when another thread needs CPU time and has a higher priority. E.g. the thread that handles mouse/keyboard input may be such a thread. No matter what thread owns the CPU right now, when the user types something or clicks something, he doesn't want to wait till the current threads time quantum has been used up completely, he wants to see the system reacting immediately. Thus some systems will make the current thread stop immediately and return control to some other thread with higher priority.

  3. the thread doesn't need CPU time anymore, because it's blocking on some operation or just called sleep() (or similar) to stop running.

These 3 scenarios might have different thread switching times in theory. E.g. I'd expect the last one to be slowest, since a call to sleep() means the CPU is given back to the kernel and the kernel needs to setup a wake-up call that will make sure the thread is woken up after about the amount of time it requested to sleep, it then must take the thread out of the scheduling process, and once the thread is woken up, it must add the thread again to the scheduling process. All these steeps will take some amount of time. So the actual sleep-call might be longer than the time it takes to switch to another thread.

I think if you want to know for sure, you must benchmark. The problem is that you usually will have to either put threads to sleep or you must synchronize them using mutexes. Sleeping or Locking/Unlocking mutexes has itself an overhead. This means your benchmark will include these overheads as well. Without having a powerful profiler, it's hard to later on say how much CPU time was used for the actual switch and how much for the sleep/mutex-call. On the other hand, in a real life scenario, your threads will either sleep or synchronize via locks as well. A benchmark that purely measures the context switch time is a synthetically benchmark as it does not model any real life scenario. Benchmarks are much more "realistic" if they base on real-life scenarios. Of what use is a GPU benchmark that tells me my GPU can in theory handle 2 billion polygons a second, if this result can never be achieved in a real life 3D application? Wouldn't it be much more interesting to know how many polygons a real life 3D application can have the GPU handle a second?

Unfortunately I know nothing of Windows programming. I could write an application for Windows in Java or maybe in C#, but C/C++ on Windows makes me cry. I can only offer you some source code for POSIX.

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>
#include <unistd.h>

uint32_t COUNTER;
pthread_mutex_t LOCK;
pthread_mutex_t START;
pthread_cond_t CONDITION;

void * threads (
    void * unused
) {
    // Wait till we may fire away
    pthread_mutex_lock(&START);
    pthread_mutex_unlock(&START);

    pthread_mutex_lock(&LOCK);
    // If I'm not the first thread, the other thread is already waiting on
    // the condition, thus Ihave to wake it up first, otherwise we'll deadlock
    if (COUNTER > 0) {
        pthread_cond_signal(&CONDITION);
    }
    for (;;) {
        COUNTER++;
        pthread_cond_wait(&CONDITION, &LOCK);
        // Always wake up the other thread before processing. The other
        // thread will not be able to do anything as long as I don't go
        // back to sleep first.
        pthread_cond_signal(&CONDITION);
    }
    pthread_mutex_unlock(&LOCK); //To unlock
}

int64_t timeInMS ()
{
    struct timeval t;

    gettimeofday(&t, NULL);
    return (
        (int64_t)t.tv_sec * 1000 +
        (int64_t)t.tv_usec / 1000
    );
}


int main (
    int argc,
    char ** argv
) {
    int64_t start;
    pthread_t t1;
    pthread_t t2;
    int64_t myTime;

    pthread_mutex_init(&LOCK, NULL);
    pthread_mutex_init(&START, NULL);   
    pthread_cond_init(&CONDITION, NULL);

    pthread_mutex_lock(&START);
    COUNTER = 0;
    pthread_create(&t1, NULL, threads, NULL);
    pthread_create(&t2, NULL, threads, NULL);
    pthread_detach(t1);
    pthread_detach(t2);
    // Get start time and fire away
    myTime = timeInMS();
    pthread_mutex_unlock(&START);
    // Wait for about a second
    sleep(1);
    // Stop both threads
    pthread_mutex_lock(&LOCK);
    // Find out how much time has really passed. sleep won't guarantee me that
    // I sleep exactly one second, I might sleep longer since even after being
    // woken up, it can take some time before I gain back CPU time. Further
    // some more time might have passed before I obtained the lock!
    myTime = timeInMS() - myTime;
    // Correct the number of thread switches accordingly
    COUNTER = (uint32_t)(((uint64_t)COUNTER * 1000) / myTime);
    printf("Number of thread switches in about one second was %u\n", COUNTER);
    return 0;
}

Output

Number of thread switches in about one second was 108406

Over 100'000 is not too bad and that even though we have locking and conditional waits. I'd guess without all this stuff at least twice as many thread switches were possible a second.


You can't estimate it. You need to measure it. And it's going to vary depending on the processor in the device.

There are two fairly simple ways to measure a context switch. One involves code, the other doesn't.

First, the code way (pseudocode):

DWORD tick;

main()
{
  HANDLE hThread = CreateThread(..., ThreadProc, CREATE_SUSPENDED, ...);
  tick = QueryPerformanceCounter();
  CeSetThreadPriority(hThread, 10); // real high
  ResumeThread(hThread);
  Sleep(10);
}

ThreadProc()
{
  tick = QueryPerformanceCounter() - tick;
  RETAILMSG(TRUE, (_T("ET: %i\r\n"), tick));
}

Obviously doing it in a loop and averaging will be better. Keep in mind that this doesn't just measure the context switch. You're also measuring the call to ResumeThread and there's no guarantee the scheduler is going to immediately switch to your other thread (though the priority of 10 should help increase the odds that it will).

You can get a more accurate measurement with CeLog by hooking into scheduler events, but it's far from simple to do and not very well documented. If you really want to go that route, Sue Loh has several blogs on it that a search engine can find.

The non-code route would be to use Remote Kernel Tracker. Install eVC 4.0 or the eval version of Platform Builder to get it. It will give a graphical display of everything the kernel is doing and you can directly measure a thread context switch with the provided cursor capabilities. Again, I'm certain Sue has a blog entry on using Kernel Tracker as well.

All that said, you're going to find that CE intra-process thread context switches are really, really fast. It's the process switches that are expensive, as it requires swapping the active process in RAM and then doing the migration.