The Tale of the Unstable Connection and the Transaction Class

The Socorro project at Mozilla is perpetually plagued with Thrift connections to HBase that are unstable.  The combined efforts of Dev, IT, and NetOps have only patched over the problem.  For a while we think we've nailed it, but like the cat in the childrens' song, it comes back, frequently on the very next day.

The Socorro code base has been reactionary.  It's been hacked to retry connections to HBase that fail.  Then hacks on top of those hacks retry entire transactions when the lower level retries fail.

In the grand Socorro backend refactoring currently in progress, I've formalized and generalized the transactional retry behavior in a set of classes:
  •  TransactionExecutor
  •  TransactionExecutorWithLimitedBackoff
  •  TransactionExecutorWithInfiniteBackoff
These classes implement methods that accepts a function, a connection context to some resource and arbitrary function parameters. When instantiated and invoked, these classes will call the function passing it the connection and the additional parameters.  The raising of an exception within the function indicates that a failure of the transaction: a rollback is automatically issued on the connection context. If the function succeeds and exits normally, then a 'commit' is issued on the connection context.

The first class in the list above is the degenerate single-shot case. It doesn't implement any retry behavior. If the function fails by raising an exception, then a rollback is issued on the connection and program moves on. Success results in a commit and the program moves on.

The latter two classes implement a retry behavior. If the function raises an exception, the Transaction class checks to see if the exception is of a type that is eligible for retry. If it is eligible, then a delay amount is selected and the thread sleeps. When it wakes, it tries to invoke the function again with the same parameters. The time delays are specified by a list of integers representing successive numbers of seconds to wait before trying again. For the class TransactionExecutorWithLimitedBackoff, when the list of time delays is exhausted the transaction is abandoned and the program moves on. The TransactionExecutorWithInfiniteBackoff will never give up, running the last time in the delay list over and over until the transaction finally succeeds or somebody kills the program.

Recently while obsessively watching my latest version of the processors run in our staging environment, I caught sight of an unstable Thrift connection and my transaction object waltzing together. Here's the how it went down (my commentary is in green).

# we start processing a new crash
2012-12-15 13:03:02,068 INFO - Thread-1 - starting job: 5ec59340-7a0c-4b40-a814-ea7092121215
2012-12-15 13:03:02,074 DEBUG - Thread-1 - about to apply rules
2012-12-15 13:03:02,075 DEBUG - Thread-1 - done applying transform rules
# processing is done, try to save to HBase via Thrift
2012-12-15 13:03:10,065 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:03:10,066 DEBUG - Thread-1 - connection successful
# we got a connection, but...
2012-12-15 13:03:21,366 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
# it failed when we tried to use it.
# Our older HBase client code automatically retries:
2012-12-15 13:03:21,366 DEBUG - Thread-1 - retry_wrapper: about to retry connection
2012-12-15 13:03:21,366 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:03:21,367 DEBUG - Thread-1 - connection successful
# replacement connection successfully established
2012-12-15 13:03:32,390 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
# that connection fails, too
# the TransactionExecutorWithLimitedBackoff judges that a timeout is an exception that
# is eligible for retry:
2012-12-15 13:03:32,390 CRITICAL - Thread-1 - transaction error eligible for retry
# it pulls the first delay amount off the list of delays
2012-12-15 13:03:32,391 DEBUG - Thread-1 - retry in 10 seconds
# it now sleeps the alloted time, waking to log every 5 seconds (configurable)
2012-12-15 13:03:32,391 DEBUG - Thread-1 - waiting for retry after failure in transaction: 0sec of 10sec
2012-12-15 13:03:37,397 DEBUG - Thread-1 - waiting for retry after failure in transaction: 5sec of 10sec
# done with waiting 10 seconds
# TransactionExecutorWithLimitedBackoff opens a new connection context
2012-12-15 13:03:42,405 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:03:42,405 DEBUG - Thread-1 - connection successful
# we got a connection, but...
2012-12-15 13:03:51,233 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
# it failed again.
# the old HBase code does its retry:
2012-12-15 13:04:11,321 DEBUG - Thread-1 - retry_wrapper: about to retry connection
2012-12-15 13:04:11,324 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:04:11,324 DEBUG - Thread-1 - connection successful
# replacement connection successfully established
# but no joy, it's #fail every where we look
2012-12-15 13:04:23,133 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
2012-12-15 13:04:23,166 CRITICAL - Thread-1 - transaction error eligible for retry
# the transaction class pulls the second delay amount off the list of delays
2012-12-15 13:04:23,193 DEBUG - Thread-1 - retry in 30 seconds
# it now sleeps the alloted time
2012-12-15 13:04:23,289 DEBUG - Thread-1 - waiting for retry after failure in transaction: 0sec of 30sec
2012-12-15 13:04:28,485 DEBUG - Thread-1 - waiting for retry after failure in transaction: 5sec of 30sec
2012-12-15 13:04:33,653 DEBUG - Thread-1 - waiting for retry after failure in transaction: 10sec of 30sec
2012-12-15 13:04:38,155 DEBUG - Thread-1 - waiting for retry after failure in transaction: 15sec of 30sec
2012-12-15 13:04:43,774 DEBUG - Thread-1 - waiting for retry after failure in transaction: 20sec of 30sec
2012-12-15 13:04:48,125 DEBUG - Thread-1 - waiting for retry after failure in transaction: 25sec of 30sec
# done with waiting for 30 seconds
# TransactionExecutorWithLimitedBackoff optimistically tries again
2012-12-15 13:04:53,405 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:04:53,422 DEBUG - Thread-1 - connection successful
# woot!
# the function succeeds exactly where it left off with failure before the timeout
2012-12-15 13:04:59,617 INFO - Thread-1 - succeeded and committed: 5ec59340-7a0c-4b40-a814-ea7092121215
# this worker thread is eligible to move on to processing a new crash


This set of transaction classes work for any resource for which there is a connection that can be wrapped in a context. We use the same transaction objects for both Postgres and HBase connections.

Wait a minute, HBase doesn't support transactions or implement commit and rollback. Even though HBase doesn't support a transactional interface, the idea of retrying a failed set of actions is valid. The connection context for HBase just ignores calls to commit and rollback. 

This transactional behavior along with the compound storage classes with automatic fallback allow Socorro to keep working even when its backend data stores are not.