As someone writing code everyday to get various useful things done, I
always become fascinated by how programmers interact with the machines. Evolution of this interface between the machine and the programmer has significantly changed during the past few decades. Out of number of improvements happened in this area, I tend to see the generic ISA and the RISC as the most significant changes that changed our world.
ISA gave microprocessor designers the ability to optimize their microarchitectures without breaking the applications. RISC on the other hand took away the burden of microcode engines that were initially used for ISA (There are microcode engines exist in today's processors, but they are light weight (?) and work for specific purposes such as decoding lengthy x86 instructions into microarchitecure specific op codes).
As the size of the transistor became smaller and smaller, with ISA and RISC in their hands, microprocessor designers had the luxury to increase the transistor count and optimize their microarchitectures for performance. As a matter of fact they even introduced multiple integer units that could execute instructions in parallel (I was surprised when I started to read some old Intel docs on this). But these changes were not apparent to the programmer due to the consistency in ISA. So the programmer was having the perception that his instruction stream is executed sequentially.
However, over the past few years microprocessor designers have realized that they are not too far away from hitting the limits in optimizing a single processor. Therefore, instead of only trying to speedup things inside their microarchitecture, they started to ship multiple execution cores in the same die.
Although the invent of multi cores does not necessarily introduce significant (?) changes to the the underlying ISA, having two visible cores to the programmer certainly requires a change in the way that we write our programs today.
When I bought my first dual-core processor couple of years ago (it was a power hungry Pentium D), I certainly noticed a big improvement in the responsiveness. Especially, every time I wrote a buggy loop, I did not have to wait to kick off Process Explorer and terminate the process. But whenever I did that, I noticed that my CPU usage is only around 50% - 60% mark although my process was hogging CPU. In the next instance I realized that it's totally legitimate as my process was running the buggy loop only on one thread. These little incidents convinced me that we are not too far away from writing code specifically for multiple cores (i.e. multi-threaded code). As I mentioned before, hardware designers have already gone through this challenge in their microacrhitectures and now the time has come to the software designers (specially for compiler engineers) to optimize their part of the game.
Over the past few years I found myself having fun while exploring this area. Therefore I've decided to devote this blog for my future discussions in this space.
In my journey towards highly optimized systems for parallel processing, I'd be exploring the following topics.
- Fundamentals of scheduling and threads.
- Multi core architectures.
- C++ compiler and OpenMP.
- High level parallel runtimes such as TPL (Task Parallel Library)
- Optimizations for parallel execution, such as data structures, synchronization.
In the past few years, I've come across various systems written for parallel execution. But each of them had its own way of dealing with operating system's threads. Some of them did a decent job for average systems but caused serious pains when used under wrong circumstances. And some of them were closely coupled to the systems scheduler and therefore reached the near to perfect mark. But now that were are moving towards advance compilers and micro runtimes (I call the little libraries that make up the scheduling, thread management stuff as micro runtimes) to make our code ready for parellel execution, I would love to see a common OS level base used by everything running on that particular OS. A good example for this would be a OS wide thread pool that is natively visible to the scheduler for efficient scheduling. Then a micro runtime that leverages that. And finally the compilers to emit the code necessary to execute the microruntime functions to parellelize work. But I will leave more details on this subject for a dedicated post.
Wrapping up this post, I strongly believe that this will be the next major change happening in our industry. Having N cores and code optimized to run in them would certainly change everything from desktop publishing to multimedia to web/web services to gaming to AI to robots.
PS: If you see a (?) mark within the text, that statement probably has some uncertainty in it. I would be delighted and really grateful if you could take a minute to correct it or share more information on those sub topics.
The title has a lot of excitement as well as a bit of fear. Don, Doug and Kavita unveil their mission!!! At this stage all I know is PDC 2008 is going to be fun! :)