Or, "yet another awkwardly named class"
One of the most important tenets of Socorro is to be resilient when external resources fail. The Mozilla Socorro deployment depends on Postgres and HBase to work. However, these are two external resources that can fail.
What happens when we try to write to one of these and we find that the resource in unavailable? Earlier versions of Socorro treated the HBase and Postgres failure cases separately.
For Postgres, since it is a transactional storage system, Socorro employed the native transactional behaviors. Interacting with Postgres involves a series of steps (insert, update, delete, select) followed by commit or rollback. If one of the intervening steps were to fail, we didn't want the program to quit, nor did we want errors to be ignored. Socorro implemented a “backing off retry” behavior. On failure of a step, the code would classify failure into one of two types: retriable and fatal. In either case, a rollback would be issued. In the retry case, the code would sleep for a predetermined amount of time and then retry the transaction from the beginning. In the fatal case, there is no choice except to allow the program to shutdown.
For HBase, true transactions are not supported. However, the behavior Socorro wanted was just the same as the Postgres case: classify the failure and then, if retriable, repeat the steps until we have success. HBase doesn't have the concept of commit and rollback, but the intervening steps of a transaction may be repeated without negative consequence.
Even though the behavior was similar, the two cases were coded independently and shared no code. In the grand Configman refactoring of Socorro, the two cases were merged into one class to maximize reuse. The Postgres case was used as the canonical example. Dummy null op commit and rollback were added to the HBase connection classes to facilitate the use of the class.
How do the TransactionExector classes work? There are three of them with slightly different behaviors:
These classes implement methods that accepts a function, a connection context to some resource and arbitrary function parameters. When instantiated and invoked, these classes will call the function passing it the connection and the additional parameters. The raising of an exception within the function indicates that a failure of the transaction: a rollback is automatically issued on the connection context. If the function succeeds and exits normally, then a 'commit' is issued on the connection context.
The first class in the list above is the degenerate single-shot case. It doesn't implement any retry behavior. If the function fails by raising an exception, then a rollback is issued on the connection and program moves on. Success results in a commit and the program moves on.
The latter two classes implement a retry behavior. If the function raises an exception, the Transaction class checks to see if the exception is of a type that is eligible for retry. If it is eligible, then a delay amount is selected and the thread sleeps. When it wakes, it tries to invoke the function again with the same parameters. The time delays are specified by a list of integers representing successive numbers of seconds to wait before trying again. For the class TransactionExecutorWithLimitedBackoff, when the list of time delays is exhausted the transaction is abandoned and the program moves on. The TransactionExecutorWithInfiniteBackoff will never give up, running the last time in the delay list over and over until the transaction finally succeeds or somebody kills the program.
How does the TransactionExecutor determine if an exception is eligible for retry? The connection context object is required to have a couple instance variables and methods to assist in the determination.
First, operational_exceptions defines a collection of exceptions that are eligible for the retry behavior. If one of the exceptions from this collection is raised, the retry behavior is triggered.
conditional_exceptions is a list of ambiguous exceptions that may or may not be eligible for retry. We encountered this with Postgres using psycopg2 on the ProgrammingError exception. Normally, this type of exception would not be retriable because it indicates a fundamental problem with a query such as a syntax error. Syntax errors are not retrible. However, sometimes we get network errors disguised as ProgrammingErrors; these are retriable.
If an exception found in the conditional_exceptions collection is raised, we have to further examine the error to determine if it should result in a failure or retry. The instance method is_operational_exception implemented by the connection class is used to determine in the current exception is retriable or not. In the case of Postgres, we look to the text of the exception to see if it contains the string “EOF”. We know that's a network error, not really a programming error so we can do a retry.
Is this class named poorly? Now that we've got many more external resources using this retry behavior and only Postgres is truly transactional, it seems that the name may not be right. Perhaps ExternalResourceActionRetrier?