Friday, December 28, 2012

The Socorro Crash Storage System

My previous blog posting showed how Configman enables Socorro to be highly configurable.  In this next series of postings, I'm going into detail about the crash storage system.

Socorro defines a set of classes for use in storing and fetching crashes from external storage. Rooted in a single base class, the hierarchy of the CrashStorage classes defines the Crash Storage API for the basic save and fetch methods.

The base class for this scheme, CrashStorageBase, lives in the .../socorro/external/ file. That file defines these public methods: save_raw_crash, save_processed, save_raw_and_processed, get_raw_crash, get_dump, get_dumps, get_dumps_as_files, get_processed, remove, new_crashes. The classes that derive from this base implement the details for a given storage medium. For example, .../socorro/external/hbase/ defines a class derived from CrashStorageBase called HbaseCrashStorage. Using the Thrift API, that class provides everything needed to save and fetch crashes to and from an instance of Hbase.

All of the “configmanized” Socorro back-end applications employ derivatives from the CrashStorage class hierarchy to save and fetch crashes. Each application has at least one “crashstorage_class” configuration parameter that can be any of the crash storage implementations. That means that the Socorro backend can be mixed and matched to implement a system tailored for the scale of the operation. A small installation of Socorro could use a file system crash storage implementation as primary storage. A large installation, such as Mozilla's, eschews the filesystem storage in favor using HBase.

At the end of 2012, Socorro has implementations for Postgres, HBase, ElasticSearch, HTTP_Post, and three flavors of FileSystems. It's is easy to imagine that a wide variety of crash storage schemes could be implemented with any imaginable underlying store: MySQL, Mongo, a queuing system, pipes, etc.

The implementation classes in the hierachry are not required to implement the entire API. Our current implementation of crash storage in Postgres does not store raw crashes, so PostgresCrashStorage is silent on implementation of 'save_raw_crash' and 'get_raw_crash'.

The default behavior in the base class for saving operations is to silently ignore the request. This is to prevent an intentionally unimplemented method in an aggregate storage implementation from derailing aggregate storage. See the Aggregate Crash Storage section below for more information. Honestly, I think this ought to be revisited. A better behavior may be to raise a NotImplemented exception and let the Aggregate Crash Storage make the decision if it should eat the exception or pass it up the call chain.

The default behavior in the base class for fetching operations is to raise a NotImplemented exception. We want this behavior because fetching behavior doesn't currently participate in Aggregate Crash Storage. If, though configuration, we've specified a crash source that doesn't implement fetching, then we clearly want that error to propagate upward and stop the system.

Aggregate Crash Storage 

 The file .../socorro/external/ defines two special aggregating classes. These two classes implement the entire crash storage API and serve as a proxy for a collection of other crash storage classes.

 The class PolyCrashStorage holds a collection of crash storage implementations. When called with one of the 'save' methods, it forwards the save action to each of the other store classes in its collection. This class is used primarily in the processor for saving processed crashes to Postgres and whatever primary storage scheme is in use. In Mozilla's case, primary crash storage is HbaseCrashStorage. In the future, the Processor will also save to ElasticSearch. An obvious future implementation of PolyCrashStorage is to make it multithread hot, so that it saves to each of its subsidiary storage simultaneously.

The second Aggregate Crash Storage is FallbackCrashStorage. This is a storage scheme that holds two other crash storage instances: primary and secondary. The secondary crash storage system is used only if the first one fails. For example, Mozilla has used this idea for an emergency backup. If the primary storage, HBase for example, is unreachable, then this class will fallback to storing in the secondary CrashStorage instance. The secondary crash storage would most likely be a local filesystem crash storage scheme.

The Aggregate Crash Storage classes are on par with all the specific implementations of crash storage. In other words, any of the aggregate crash storage classes can appear anywhere that a Crash Storage class is needed.  Since the aggregates themselves require subsidiary Crash Storage instances they can be recursive.  In otherwords, a PolyCrashStorage instance could have a FallbackCrashStorage instance in its collection which in turn could hold other PolyCrashStorage instances.

The Crash Storage API Methods

  • save_raw_crash: Socorro receives crash from the wild via an HTTP POST to a collector.  The collectors will take the form data and create the raw crash.  A raw crash is data in a mapping that can be serialized easily in JSON form.  With the POST to the collectors comes one or more binary blobs of crash data called 'dumps'.  The collector splits those out of the form data and does not include them in the JSON raw crash.  The dumps each have names as specified by the field name in the original submitted form.  The collector takes the names and the dumps and makes a mapping from them.  The save_raw_crash method pushes the raw crash mapping and the dumps mapping to the underlying storage medium.
  • save_processed: The processed crash is created by the Processor.  It consists of transformed copies of values from the raw crash as well as the output of MDSW on each of the binary dumps.  There is no binary data in the processed crash.  This method accepts a processed crash in a JSON compatible Mapping and saves it to the underlying storage mechanism.
  • save_raw_and_processed: this is a convenience function that will save both raw and processed crashes in one call.  It is used in apps that need to move crashes from one storage system to another.
  • get_raw_crash: this call accepts a crash_id and returns a raw crash mapping.  It does not fetch binary dump information.  If the crash is not found, then it raises a not found exception.
  • get_raw_dump: this is the method that will return a single named binary dump for a given crash_id.  If no name is specified, the first dump found will be returned.  If the crash is not found, then it raises a not found exception.
  • get_raw_dumps: this method returns a mapping of all the binary dumps associated with the provided crash_id.  If the crash is not found, then it raises a not found exception.
  • get_raw_dumps_as_files: the binary dumps can be large.  Sometimes it is undesirable to load them all into memory at the same time.  This method returns a mapping of dump names to pathnames in a file system.  The user of this method is then responsible for opening and reading the files.  For crash storage schemes that actually store their files in a file system, this function may actually return just the existing pathname.  For crash storage schemes that use other methods for actual storage, temporary files will be written to house the dumps.  The user of this method ought to look in the pathnames for the string "TEMPORARY".  If that exists, then the user of this method is responsible for cleanup of the temporary files when it is done.  This method is employed by the Processor to get the binary dumps.  MDSW is a separate program that expects its input to be in the form of a file. 
  • get_processed: this method just returns the processed crash in a JSON compatible mapping.
  • remove: this method deletes all traces of both the raw and processed crashes from a crash storage scheme.
  • new_crashes:  this method is a generator.  It yields a series of crash_ids for raw crashes that arrived in the storage scheme since the last time this generator was invoked.  Its results are generally not repeatable.  Once a crash_id has been yielded, it will not be seen again unless that crash is removed from storage and then resaved.

My  next blog posting will go into detail about each of the crash storage schemes in Socorro.