Times are changing. We're moving toward a model where developers like me are given sudo everywhere being versed in not destroying things. This gives me an unprecedented opportunity to tune my software. I built it to be tunable, but tuning has always been guesswork because tuning is an interactive process – something that is hard to do when you have to relay commands to somebody and you can't see what's on their screen.
My first big tuning project involves the Socorro processors. They are I/O bound multithreaded Python applications. We've always just set the number of threads equal to the number of CPUs on the machine. We've always gotten adequate throughput and have never revisited the settings.
By the way, it is important to know that the Socorro processors invoke a subprocess as part of processing a Firefox crash. That subprocess, written in C, runs on its own core. While the GIL keeps the Python code running only on one core at a time, the subprocesses are free to run simultaneously on all the cores. There is a one to one correspondence between threads and subprocesses. Each thread will spawn a single subprocess for each crash that it processes. Ten threads should equal ten subprocesses.
My collegue, Chris Lonnen, has really wanted to drop the multithreading part and run the processors as a mob of single threaded processes. In the construction of Socorro, I made the task management / producer consumer system a pluggable component. Don't want multithreading? Drop in the the single threading class instead. That eliminates all the overhead of the internal queuing that would still be there by just setting the multithread class to use only one thread.
The results of comparing 24 single threaded processes to one 24 thread process startled me. The mob was between 1.8 and 2.2 times faster than the 24 thread process. That did not sit well with me as a staunch supporter of multithreading. I starting playing around with my multithreading and made a discovery. For the processor on the 24 core machine, throughput did not increase for anything more than twelve threads. The overhead of the rest of the code was such that it could not keep all 24 cores busy.
Watching the load level while running the 24 process single thread version test, I can easily see that it keeps the server's load steady at 24 while the multithread version rarely rose above 9. That got me thinking, could I get more throughput with multiple multithreaded processes? Could that out perform the 24 single thread process?
The answer is yes. I first tried two 12 thread processes. That approached the throughput of the 24 single thread processors. I noticed that the load did not ever rise above 16. That told me that there still was room for improvement. I tried twelve 2 thread processes, that matched the single thread throughput, but with running the load level at eighty percent of the single thread mob test. Various combinations yielded more gold.
The best performer was four 6 thread processes:
|# of processes
|# of threads
|average # of items processed / 10 min
This is just ad hoc testing with no real formal process. I ought to perform a better controlled study, but I'm not sure I have that luxury right now.