Monday, February 17, 2014

Single Process Multithread vs Multi Process Single Thread

For the first time in my years at Mozilla I have a wonderful opportunity to seriously work and experiment with Socorro.  Back in the days when there was a great brick wall between Dev and IT, I couldn't touch or even gaze upon a production or staging machine.  If something went wrong with my software, I'd have to make a request to see logs, wait for someone in IT to respond.  Frequently, I'd ask for logs about a specific error, but receive just the log line with the error on it with no context.  I already knew the name of the error.

Times are changing.  We're moving toward a model where developers like me are given sudo everywhere being versed in not destroying things.  This gives me an unprecedented opportunity to tune my software.  I built it to be tunable, but tuning has always been guesswork because tuning is an interactive process – something that is hard to do when you have to relay commands to somebody and you can't see what's on their screen.

My first big tuning project involves the Socorro processors.  They are I/O bound multithreaded Python applications.  We've always just set the number of threads equal to the number of CPUs on the machine.  We've always gotten adequate throughput and have never revisited the settings.

By the way, it is important to know that the Socorro processors invoke a subprocess as part of processing a Firefox crash.  That subprocess, written in C, runs on its own core.  While the GIL keeps the Python code running only on one core at a time, the subprocesses are free to run simultaneously on all the cores.  There is a one to one correspondence between threads and subprocesses.  Each thread will spawn a single subprocess for each crash that it processes. Ten threads should equal ten subprocesses.

My collegue, Chris Lonnen, has really wanted to drop the multithreading part and run the processors as a mob of single threaded processes.  In the construction of Socorro, I made the task management / producer consumer system a pluggable component.  Don't want multithreading?  Drop in the the single threading class instead.  That eliminates all the overhead of the internal queuing that would still be there by just setting the multithread class to use only one thread.

The results of comparing 24 single threaded processes to one 24 thread process startled me.  The mob was between 1.8 and 2.2 times faster than the 24 thread process.  That did not sit well with me as a staunch supporter of multithreading.  I starting playing around with my multithreading and made a discovery.  For the processor on the 24 core machine, throughput did not increase for anything more than twelve threads.   The overhead of the rest of the code was such that it could not keep all 24 cores busy.

Watching the load level while running the 24 process single thread version test, I can easily see that it keeps the server's load steady at 24 while the multithread version rarely rose above 9.   That got me thinking, could I get more throughput with multiple multithreaded processes?  Could that out perform the 24 single thread process?

The answer is yes.  I first tried two 12 thread processes.  That approached the throughput of the 24 single thread processors.  I noticed that the load did not ever rise above 16.  That told me that there still was room for improvement.  I tried twelve 2 thread processes, that matched the single thread throughput, but with running the load level at eighty percent of the single thread mob test.  Various combinations yielded more gold.

The best performer was four 6 thread processes:


# of processes # of threads average # of items processed / 10 mincomment




24 1 2200
1 24 1000 current configuration
1 12 1000
2 12 1800
12 2 2000
4 6 2600

This is just ad hoc testing with no real formal process.  I ought to perform a better controlled study, but I'm not sure I have that luxury right now.