Buddhike's Weblog

WCF Threading Internals

Along with its bunch of different features, WCF carries a lot of performance optimizations as well (you think I'm kidding? Then see it yourself http://msdn2.microsoft.com/en-us/library/bb310550.aspx). As a part of this, WCF has given a lot of thought about threading model that it uses behind the scenes.

At this point you might probably think;

"Well… I know it uses thread pool API. There is nothing so much about it. I just know that it performs well."  Well you are correct but not really correct. I hope you want to go down to the metal. So please read on… :)

WCF uses CLR thread pool threads to do things asynchronously. However, interestingly it uses IO worker threads in the thread pool instead of the regular thread pool worker threads (don't fuzz if you are not aware of these two kinds of threads). The theory behind IO worker threads reveals a lot about why WCF use it. Therefore I thought I would dedicate this post to talk a little bit about it.

So before we actually dig in let me ask you a simple question. When do you consider that you are taking max out of your CPU? Is it when a single thread trying to take up 100% or multiple threads trying take up 100%? I know, you said single thread 100% case which is correct. When you have multiple CPU bound threads additional costs of things like context switching slow things down. Wanna see it yourself? Write a small lengthy loop. Measure the time it takes when you run it in a single thread. Then delegate it to several threads and again measure the time that each of them take to finish it and compare the values.

So… technically speaking, we can achieve the best CPU utilization only by sticking to "One tread per CPU per execution quantum" invariant.

I/O completion ports (IOCP) were introduce to Windows NT kernel to achieve this goal. Although it's a complex technology, the fundamental theory behind the scene is fairly straightforward.

Before the invent of IOCP; there were two major IO programming techniques. One thread for all IO and one thread per IO. In one thread for all IO model; a thread that is IO bound had to wait doing nothing. Also all other IO operation were blocked until the one going on is done. Although this model was fair enough for single threaded client apps, server applications did not fit in at all. For example, if a simple server was written in this way it will serve only one client at a time. One thread per IO model on the other hand spawn up a new thread for each client. But this eventually ended up with too many threads striving for CPU. So we either have too few threads or too many threads causing the trouble.

In order to solve this problem; IOCP work like a controller hub between two parties. In one end it receives IO completion packets saying that some work is available. On the other end it has some IO worker threads waiting for work. When an IOCP receives an I/O completion packet, it makes one of the waiting threads active and delegate the available work (thread is picked up in LIFO order to avoid potential context switching). So what? How does this model help to solve problem we addressed earlier. The secret lies within NumberOfConcurrentThreads parameter value we pass when we create the IOCP using CreateIoCompletionPort function. This parameter tells IOCP how many concurrent threads that we actually want to have active to process the incoming work. The preferred value for this number equals to the number of CPUs you have in the box. So no matter how fast the work is being queued, we only have desired number of threads concurrently processing them. The other cool thing about this is; when an active IO worker thread goes to wait state (may be it's doing some more IO work), windows scheduler tells the pertaining IOCP that one of its active threads are inactive now. So that the IOCP can make another thread active to perform some other work (Smart! Isn't it? ;)). Consequently the number of active threads in an IOCP at given time is usually a little bit higher than the provided NumberOfConcurrentThreads value.

OK. Now we know why IOCPs are so elegant. But how does WCF actually use it? Well... CLR thread pool has an IOCP associated with it. When thread pool creates its IOCP for the first time, it also creates IO worker threads which are waiting for work. So essentially, we can get an IO worker thread to do some work by sending an IO completion packet to this IOCP. To do that we can use ThreadPool.UnsafeQueueNativeOverlapped function (This method internally invokes the PostQueuedCompletionStatus function). Here is a little program to demonstrate how you could do that.

static void Main(string[] args) 
{ 
  unsafe 
  { 
    // Create an Overlapped structure and pack it with a pointer to the function 
    // that we want to invoke from the IO worker thread. 
    Overlapped overlapped = new Overlapped(0, 0, IntPtr.Zero, null); 
    NativeOverlapped* pOverlapped = overlapped.UnsafePack(
        new IOCompletionCallback(OnIoCompletion), null); 

    // Send an IO completion packet to thrad pool's IOCP 
    ThreadPool.UnsafeQueueNativeOverlapped(pOverlapped); 
  } 
}

static unsafe void OnIoCompletion(uint errorCode, 
    uint numBytes, 
    NativeOverlapped* pOverlapped) 
{ 
  Console.WriteLine("This is from an IO worker thread"); 
}

WCF also basically follows the same concept. But it has an elegantly designed queue based IO thread scheduler. I would like to dedicate a separate post to talk about how exactly it works. But if you are reflector fan like me, take a look at System.ServiveModel.IoThreadScheduler class and you will see it in your own eyes.

So what does all this tell us? WCF uses this IoThreadScheduler to queue work items for IO worker threads. This way it preserves the "One thread per CPU per execution quantum" constant and achieves the best CPU utilization. It never (may be I should say I've never seen it but I have a lot of faith on the WCF team) uses ThreadPool.QueueUserWorkItem API and thus refrain from using regular thread pool worker threads (I'm sure now you see why WCF performs a lot better than ASMX runtime ;)).

OK, are you still not certain that WCF is working this way? Cool! I guess you don't have too much faith on me. Well.. Then get ready for a little exercise. Create a little service with a single operation. Make this operation do some lengthy CPU intensive work (perhaps a loop doing some math). And then try to invoke this operation from multiple clients simultaneously (or you can call it from multiple threads in the same client). How many requests that your server can service concurrently? Looking forward to hearing your results :)

Posted: Aug 02 2007, 05:19 AM by Buddhike | with 1 comment(s)
Filed under:

Comments

Geek's Diary said:

Writing my last post about WCF threading internals compelled me to reveal some of the tests I did sometime

# November 3, 2007 1:13 PM