Monday, January 07, 2013

HBase as Socorro Crash Storage

Nota Bene: In this blog posting, I'm stretching my area expertise.  I suspect that there may be inaccuracies in my understanding of the specifics of how HBase, the Hadoop File System and Hadoop work.  I will make corrections to this posting as others point them out to me.

In 2010, Daniel Einspanjer and I conspired to migrate the Socorro primary data store from the file system to HBase. Instead of storing the all of our crashes in a monolithic NFS mounted filesystem, we'd save them in the Hadoop File System with HBase. Being a distributed file system, Hadoop FS would give us great headroom for growth as well as fault tolerance. Hbase, with techniques for higher level organization and indexing, brings a type of queries to the proverbial table. The ability to also execute analytical map reduce jobs over the corpus of crashes sealed the deal.

At the time of implementation, there was no agreed on standard Crash Storage API. The only thing we had to work with was the existing API for the File System Storage. It really wasn't entirely appropriate for the semantics of interacting with HBase. To expedite implementation, we agreed that Daniel would make a Pythonic API adaptation of the HBase / Thrift API for storing crashes. I would further adapt his work and, if necessary, add another layer to fit into the newly conceived Crash Storage framework (this was the predecessor to the modern Configman Crash Storage System).

Relational database programmers take for granted the indexing tools inherent in the relational data model. HBase programming is exactly unlike that. HBase code gives you a primitive data structure: the table and very little else. If you want indexing, you have to implement the technique yourself using additional HBase tables. Queries happen on tables by setting up scans that test every row of the table of interest. There are no joins. From the perspective of a relational database, this sounds horrific, but in reality it isn't so bad. The data is distributed across many machines, so the scan happens in parallel and can happen relatively quickly.

Rows in our tables in HBase have a key called a 'row_id'.   Just like we assign meaning to regions of our own string crash_id, parts of a string row_id have meaning to HBase.  The first character of the row_id controls which region of the distributed file system in which the row is stored.  It is our best interest to make sure that these are evenly distributed across the regions.

Our crash_ids start out as a UUID before we alter them.  The possible values of the first character is guaranteed to evenly distributed across the domain of hex characters.  We use that first character of a crash_id as the first character of the row_id.

We frequently want to look at sets of crashes by date.  The interface that we have for HBase to gives us the option of a prefix scan of HBase tables. A prefix scan will return table rows for which a prefix matches row_ids.  We make the second through seventh character of a row_id is the date of submission, identical to the last six digits of the crash_id. This facilitates iterating over date ranges.

For example, if we want to see all the crashes from January 1, 2013, we'd do a prefix scan sixteen times (once for each possible hex digit of the first character):


The composition of an HBase row_id from a crash_id.

The Socorro HBase Tables

Our HBase schema consists of one data table and several index and statistics tables.
  • crash_reports 
  • crash_reports_index_hang_id_submitted_time 
  • crash_reports_index_hang_id 
  • crash_reports_index_legacy_unprocessed_flag crash_reports_index_unprocessed_flag 
  • crash_reports_index_signature_ooid 
  • metrics


the main data table containing all three parts of a crash: raw_crash, dump, and processed_crash. It uses the row_id form outlined above as the row identifier.  The values are stored in rows with columns as in a relational database, but the columns don't have to be the same in every row (which in my mind means that they aren't really columns). Columns are divided into families:

  • ids : the column family for any ids associated with the crash.
    • 'ooid' – the original crash_id assigned by the collector
    • 'hang_id' – the id shared by a hang pair. This is a deprecated concept that ought to be excised from the code. It was replaced by crashes with multiple dumps.
  • raw_data : this is the column family for the binary breakpad dumps. The main raw binary crash dump is in the column 'dump'. While the auxiliary dumps use the names originally given to them by breakpad on the client at the time of the crash.
  • meta_data : this is the column family home of the raw_crash data. It contains a single column called 'json'. An opinion: I find it most unfortunate that we weren't more vigilant in naming columns. 'json' is really a type name rather than a meaningful identifier of the purpose of the data.
  • processed_data : this is the family for data regarding the output of the processor.
    • 'json' – the unfortunate name for the processed_crash in json form
    • 'signature' – the signature from the processed_crashs
  • flags : this a set of binary flags that describe the state of the data in the row:
    • 'processed' – 'Y' if the raw_crash has been processed to make a processed_crash, otherwise 'N'
    • 'legacy_processing' – 'Y' if the collector's throttling system indicated that this crash was to be processed rather than deferred, otherwise the column just doesn't exist.
  • timestamps : a column family for timestamps associated with the crash 
    • 'submitted' – the timestamp at the time of the submission to the collector
    • 'processed' – the timestamp of the time of completion of processing by the processor.


This is a deprecated table for hang pairs. Hangs used to be submitted in multiple crashes. As of release 34, Socorro will support multidump crashes instead. 


Another deprecated table involved in supporting crash hang pairs superseded by multidump.


This table serves as an index for the crash_reports table. It consists of a single column family 'ids' with a single column, 'ooid'. Every row in this table corresponds to a row in the 'crash_reports' table. It is used a queue for crashes that were marked for processing. The monitor scans this table and assigns crash_ids to the processors.


This tables serves as the list of crashes that have not been processed across all storage. Having just a single column family called 'ids', only the crash_id of a given unprocessed crash is stored in this table.  It appears that we're not directly using this table in our Socorro code.  However, values from this table are available for analysis using Hadoop tools.


This table is an indexing scheme for finding signatures. Each row has a unique key consisting of the union of a crash signature and a crash_id. The data in the row is the a single column family called 'ids' containing a single row called 'ooid'.


This is a table of raw statistics. As crashes are added or removed from HBase, counters are incremented or decremented in concert. This table also hold some statistics about aspect of the flow of crashes: the counters: year, month, day, hour, minute, total inserts, number of throttled crashes, number of hangs, number of hangs by hang type, current number of unprocessed crashes, current number of throttle crashes that are unprocessed.  An opinion: I cannot see anywhere in our codebase that we are using any of these stats.  Perhaps we ought to start using them, or stop collecting them.

The HBase Client API 

The module socorro.external.hbase.hbaseclient defines a connection class for HBase called HBaseConnection. The base class defines the semantics of the connection: establish connection, close connection. The derived class,  HBaseConnectionForCrashReports, embues a connection with Socorro domain specific methods such as saving raw and processed crashes, fetching crashes and scanning for crashes with a given attribute. This module is also is a standalone HBase query tool thah can be used at the command line to execute any of the domain specific methods.  This is very useful for debugging Socorro and Hbase.

The HBase Connection Retry System 

In the first month of our deployment of HBase and then continuing periodically for all the years since, we have trouble establishing and maintaining connections to HBase. Rather than having users of the module handle connection troubles, we added a retry decorator to all the domain methods of the HBase connection class. This means that if a domain function raises an exception from a list of exceptions eligible for retry (like timeout, or server unavailable), the decorator code catches the exception and retries the method from the beginning. The decorator uses a looping constraint, so that it will only retry a certain number of times. Should the method fail and the number of retries if exhausted, the exception will escape out for the client application to deal with it.

In the modern Crash Storage System, all HBase actions are placed in a transaction object. This object encapsulates the retry behavior using a time delay back off. When the hbase client module is refactored or rewritten, the native retry behavior will be removed in favor of the transaction object's behavior. Right now in 2013, both systems are in use and they don't interfere with each other.

The Live Command Line Program 

The module also includes a command line tool that allows any of the Socorro domain methods to be invoked from the command line. This is very handy for manual testing or collecting crash data for out-of-system analysis.

To use this feature, invoke it like this, then follow the instructions regard the methods supported.
    python ./socorro/external/hbase/ --help

    Usage: ./socorro/external/hbase/ [-h host[:port]] command [arg1 [arg2...]]

        Crash Report specific:
          get_report ooid
          get_json ooid
          get_dump ooid
          get_processed_json ooid
          get_report_processing_state ooid
          union_scan_with_prefix table prefix columns [limit]
          merge_scan_with_prefix table prefix columns [limit]
          put_json_dump ooid json dump
          put_json_dump_from_files ooid json_path dump_path
          export_jsonz_for_date YYMMDD export_path
          export_jsonz_tarball_for_date YYMMDD temp_path tarball_name
          export_jsonz_tarball_for_ooids temp_path tarball_name <stdin ooids list>
          export_sampled_crashes_tarball_for_dates sample_size
                   YYMMDD,YYMMDD,... path tarball_name
        HBase generic:
          describe_table table_name
          get_full_row table_name row_id

Up Next - The HBase and the Crash Storage API and the Future of HBase in Socorro