Since 2010, HBase has been our primary storage for Firefox crash data. Spread across something like 70 machines, we maintained a constant cache of at least six months of crash data. It was never a pain free system. Thrift, the system through which Socorro communicated with HBase, seemed to develop a dislike for us from the beginning. We fought it and it fought back.
Through the adversity that embodied our relationship with Thrift/HBase, Socorro evolved fault tolerance and self healing. All connections to external resources in Socorro are wrapped with our TransactionExecutor. It's a class that recognizes certain types of failures and executes a backing off retry when a connection fails. It's quite generic, as it wraps our connections to HBase, PostgreSQL, RabbitMQ, ElasticSearch and now AmazonEC2. It ensures that if an external resource fails with a temporary problem, Socorro doesn't fail, too.
Periodically, HBase would become unavailable. The Socorro system, detecting the problem, would back down, biding its time while waiting for the failed resource to recover. Eventually, after probing the failed resource, Socorro detects recovery and picks up where it left off.
Over the years, we realized that one of the major features that originally attracted us to HBase was not giving us the payoff that we had hoped. We just weren't using the MapReduce capabilities and found the HBase maintenance costs were not worth the expense.
Thus came the decision that we were to migrate away. Initially, we considered moving to Ceph and began a Ceph implementation of what we call our CrashStorage API.
Every external resource in Socorro lives encapsulated in a class that implements the Crash Storage API. Using the Python package Configman, crash storage classes can be loaded at run time, giving us a plugin interface. Ceph turned out to be a bust when the winds of change directed us to move to AmazonS3. Because we implemented the CrashStorage API using the Boto library, we were able to reuse the code.
Then began the migration. Rather than just flipping a switch, our migration was gradual. We started 2014 with HBase as primary storage:
I was amused by the non-event of the severing of Thrift from Socorro. Again, it was a matter of editing HBase out of the configuration, sending a SIGHUP, causing HBase to fall silent. Socorro didn't care. Announced several hours later on the Socorro mailing list, it seem more like a footnote than an announcement: "oh, by the way, HBase is gone".
The primary datastore migration is not the end of the road. We still have to move the server processes themselves to Amazon system. Because everything is captured in the Socorro configuration, however, we do not anticipate that this will an onerous process.
I am quite proud of the success of Socorro's modular design. I think we programmers only ever really just shuffle complexity around from one place to another. In my design of Socorro's crash storage system, I have swung a pendulum far to one side, moving the complexity into the configuration. That has disadvantages. However, in a system that has to rapidly evolve to changing demands and changing environments, we've just demonstrated a spectacular success.
Credit where credit is due: Rob Helmer spearheaded this migration as the DevOp lead. He pressed the buttons and reworked the configuration files. Credit also goes to Selena Deckelmann, who lead the way to Boto for Ceph that gave us Boto for Amazon. Her contribution in writing the Boto CrashStorage class was invaluable. Me? While I wrote most of the Boto CrashStorage class and I'm responsible for the overall design, I was able to mainly just be a witness to this migration. Kind of like watching my children earn great success, I'm proud of the Socorro team and look forward to the next evolutionary steps for Socorro.