Monday, December 31, 2012

Socorro File System Storage

In Socorro's first implementation in 2007, the collectors saved crashes to an NFS mounted file system.  Each time a new crash was saved on NFS, a daemon of some sort (I don't recall the details) would wake and notify the processors that a new crash had arrived.  Each of the processors would scramble to try to process the same crash. Of course, they'd collide using the existence of a record in Postgres and subsequent failure of a transaction as a contention resolution mechanism. This system collapsed spectacularly under load for several reasons:
  1. So many crashes would come in rapid succession that the daemon's notifications to the processors quickly saturated the network to the point of paralysis.
  2. Once there are more than about a thousand files in any given directory in NFS, many NFS operations start to thrash.  Just getting a listing of such a directory could take several minutes.
  3. The processors were using the database as a locking mechanism.  Multiple processors would start processing the same crash with only one eventually winning and getting to save its results.  The processors quickly became saturated with work that was destined to be thrown away.
Why not just push crashes from the collectors directly into the relational database?  The Socorro requirements state that the collectors must not rely on a service that could be go down.  The highest priority is to never lose crashes due to an outage of some sort.  So why NFS, isn't that a service that could go down?  Well, yes it is, but at least it is ostensibly more stable than relational databases. 

Fixing this broken crash storage / queuing system was my first task when I agreed to take on Socorro as my responsibility.  I quickly worked to replace the existing flat NFS storage system with a new scheme.  The monitor application was created as arbiter to avoid contention between the processors.  (Credit where credit is due: Frank Griswold was key in assisting with the design and implementation of the File System storage.  It wouldn't have happened so quickly without his knowledge and hard work.)

Still rooted in NFS, this new storage scheme used the file system as the implementation of an old fashion hierarchical database.  The scheme uses radix directories and symbolic links as the underlying storage mechanism.  The crash_id, an adulterated UUID, is the index into the storage system to locate crashes.

The file system storage consists of one file system directory branch for each day of crash storage.  The name of the day directories is of the form YYMMDD.  When given a crash_id, how can we tell what day directory it belongs to?  I lopped off the last six digits of the original UUID and encoded the date in that same format: YYMMDD.  Now given any crash_id, we can know instantly which day directory in which it can be found.

We know that we were going to have to access crashes by both date and name, so the next level from the day directory is a bifurcation: literally the words “name” and “date” (though, more accurately, it ought to have been called “time”).  The “name” branch is used as both storage for the files that comprise a crash and as the indexing mechanism for name lookups.  The “date” branch is an indexing mechanism that allows Socorro to roughly detect the order in which crashes arrived.

The “name” Branch


I employed a radix scheme taking the hex digits in pairs from the start of crash_id.  For example, a crash_id that looks like “38A4F01E...090514” would be stored in a directory structure like this:   

.../090514/name/38/A4/...  

That limits the number of entries within the 'name' directory to 16^2 or 256 entries.  We feared we could still end up with too many files in the leaf nodes, so we stretched the scheme out to four pairs of hexdigits:  

.../090514/name/38/A4/F0/1E/...  

We figured that the leaf nodes would sufficiently diluted that we wouldn't have the problem of too many files in the leaf directories.  At the time, we didn't know how many crashes per day we were going to get, and we wanted to a flexible as possible.  We took another digit from the UUID and coded it to reflect how deep the hierarchy should go. Digit -7 from the right end signifies how deep the radix naming scheme goes.  In practice, we found that two levels was sufficient to spread the crashes out over all the directories.

The “date” Branch


The date branch implements the indexing scheme so that we know roughly what order in which the crashes arrived.  The next level below the literal “date” director is the hour of the day in which the crashes arrived in the form, HH.  Below that is a series of buckets (referred to as slots in the code).  The buckets are either named by the arrival minute or, if there are multiple collectors working in the same file system, the host name of the collector.  The buckets are have an additional sequence number that's incremented if the number of entries in a given bucket crosses a configurable threshold.  A crash that came in at 5 minutes after noon on 2009-05-14 would have an entry like this:

.../090514/date/12/05_0/38A4F01E...090514

So what is the file in that directory?  It's a symbolic link over to the directory  in the name branch that holds the actual crash.  When the symbolic link is created, a back link is also created so that the a leaps can be made back and forth between the two branches.

The decomposition of a “crash_id” and its relationship to
the File System Storage directory structure:


(no, those aren't inductors in the file system)
The monitor application was originally structured to read new crashes from the File System Storage and then divvy them out to the processors.  We wanted to process crashes in roughly the same order in which they arrived.  The monitor would walk the date branch looking for symbolic links.  On finding one, it would get the crash_id value (the adulterated UUID) from the name of the link itself, and insert it into the processing queue in the database. Sometime in the future, a processor would get the crash_id and, by its very form, know how to look it up in the name branch of the directory structure.

The monitor wants to see a new crash only once.  So here's where it gets weird. Once the monitor found a symbolic link and queued the crash for processing, it would destroy the symbolic link as well as the corresponding back link.  This insures that next time the monitor walks the tree looking for new crashes, it won't even see crashes that it had seen before. The “date” branch index is read-once.  After it has been read, it'll never be read again.

This seems too complicated.  Why bother with symbolic links when an empty file with just the crash_id as a name would do?  A little empirical study revealed that it is significantly faster to read a symbolic link that it is to read a file.  The symbolic link information is actually written in the underlying file system inode. Creating and deleting symbolic links is much faster than creating and deleting a real file.

What is the purpose of the backlink?  When is it used?  Sometimes the system would get backlogged and the processing of any given crash could be minutes or hours away.  I created a system called “priority processing” that could, when requested, bump a crash up in the queue for immediate processing.

Priority processing is implemented by the monitor.  It finds crash_ids for priority processing in a table in the database.  For each crash_id, it goes to the file system and looks it up on the “name” branch.

If it sees there is no symbolic link from the “name” branch to the “date” branch, the monitor knows that it has already seen that crash_id.  It goes to the crash processing queue in the database and moves the cash to the head of the queue.

If it does find a symbolic link, it uses that link to jump over to the “date” branch where it deletes the link back to the “name” branch.  Then it deletes the link going the opposite way.  This insures that the regular queuing mechanism of the monitor will not see and queue a crash a second time for processing.

Wow, that's a lot of moving parts.  Even though this system is old crusty and rather complicated, it remains as one of the most stable storage systems that Socorro has.  File systems don't fail very often.  This file system code rarely fails either.  It is still in use as the primary storage for collectors and as an emergency backup when our new primary storage, HBase, fails or is unstable.

This file system storage is destined for refactoring/reimplementation in Q1 2013.  It was built before we instituted coding standards.  More importantly, we need it to have actual useful unit tests.  In 2010, the Mozilla project graduated away from using this scheme as the primary data storage in favor of HBase.  However, we're still using this scheme for temporary storage for the collectors.  They write all their crashes to a local disk using this system.  A separate process spools them off to HBase asynchronously.   I'll write about the HBase role in Socorro crash storage two blog posts from now.  

In my next blog posting, I'll show how variants of the File System Crash Storage comprise “standard storage”, “deferred storage” and “processed storage”.

Friday, December 28, 2012

The Socorro Crash Storage System


My previous blog posting showed how Configman enables Socorro to be highly configurable.  In this next series of postings, I'm going into detail about the crash storage system.

Socorro defines a set of classes for use in storing and fetching crashes from external storage. Rooted in a single base class, the hierarchy of the CrashStorage classes defines the Crash Storage API for the basic save and fetch methods.

The base class for this scheme, CrashStorageBase, lives in the .../socorro/external/crashstorage_base.py file. That file defines these public methods: save_raw_crash, save_processed, save_raw_and_processed, get_raw_crash, get_dump, get_dumps, get_dumps_as_files, get_processed, remove, new_crashes. The classes that derive from this base implement the details for a given storage medium. For example, .../socorro/external/hbase/crashstorage.py defines a class derived from CrashStorageBase called HbaseCrashStorage. Using the Thrift API, that class provides everything needed to save and fetch crashes to and from an instance of Hbase.

All of the “configmanized” Socorro back-end applications employ derivatives from the CrashStorage class hierarchy to save and fetch crashes. Each application has at least one “crashstorage_class” configuration parameter that can be any of the crash storage implementations. That means that the Socorro backend can be mixed and matched to implement a system tailored for the scale of the operation. A small installation of Socorro could use a file system crash storage implementation as primary storage. A large installation, such as Mozilla's, eschews the filesystem storage in favor using HBase.

At the end of 2012, Socorro has implementations for Postgres, HBase, ElasticSearch, HTTP_Post, and three flavors of FileSystems. It's is easy to imagine that a wide variety of crash storage schemes could be implemented with any imaginable underlying store: MySQL, Mongo, a queuing system, pipes, etc.

The implementation classes in the hierachry are not required to implement the entire API. Our current implementation of crash storage in Postgres does not store raw crashes, so PostgresCrashStorage is silent on implementation of 'save_raw_crash' and 'get_raw_crash'.

The default behavior in the base class for saving operations is to silently ignore the request. This is to prevent an intentionally unimplemented method in an aggregate storage implementation from derailing aggregate storage. See the Aggregate Crash Storage section below for more information. Honestly, I think this ought to be revisited. A better behavior may be to raise a NotImplemented exception and let the Aggregate Crash Storage make the decision if it should eat the exception or pass it up the call chain.

The default behavior in the base class for fetching operations is to raise a NotImplemented exception. We want this behavior because fetching behavior doesn't currently participate in Aggregate Crash Storage. If, though configuration, we've specified a crash source that doesn't implement fetching, then we clearly want that error to propagate upward and stop the system.

Aggregate Crash Storage 


 The file .../socorro/external/crashstorage_base.py defines two special aggregating classes. These two classes implement the entire crash storage API and serve as a proxy for a collection of other crash storage classes.

 The class PolyCrashStorage holds a collection of crash storage implementations. When called with one of the 'save' methods, it forwards the save action to each of the other store classes in its collection. This class is used primarily in the processor for saving processed crashes to Postgres and whatever primary storage scheme is in use. In Mozilla's case, primary crash storage is HbaseCrashStorage. In the future, the Processor will also save to ElasticSearch. An obvious future implementation of PolyCrashStorage is to make it multithread hot, so that it saves to each of its subsidiary storage simultaneously.

The second Aggregate Crash Storage is FallbackCrashStorage. This is a storage scheme that holds two other crash storage instances: primary and secondary. The secondary crash storage system is used only if the first one fails. For example, Mozilla has used this idea for an emergency backup. If the primary storage, HBase for example, is unreachable, then this class will fallback to storing in the secondary CrashStorage instance. The secondary crash storage would most likely be a local filesystem crash storage scheme.

The Aggregate Crash Storage classes are on par with all the specific implementations of crash storage. In other words, any of the aggregate crash storage classes can appear anywhere that a Crash Storage class is needed.  Since the aggregates themselves require subsidiary Crash Storage instances they can be recursive.  In otherwords, a PolyCrashStorage instance could have a FallbackCrashStorage instance in its collection which in turn could hold other PolyCrashStorage instances.

The Crash Storage API Methods


  • save_raw_crash: Socorro receives crash from the wild via an HTTP POST to a collector.  The collectors will take the form data and create the raw crash.  A raw crash is data in a mapping that can be serialized easily in JSON form.  With the POST to the collectors comes one or more binary blobs of crash data called 'dumps'.  The collector splits those out of the form data and does not include them in the JSON raw crash.  The dumps each have names as specified by the field name in the original submitted form.  The collector takes the names and the dumps and makes a mapping from them.  The save_raw_crash method pushes the raw crash mapping and the dumps mapping to the underlying storage medium.
  • save_processed: The processed crash is created by the Processor.  It consists of transformed copies of values from the raw crash as well as the output of MDSW on each of the binary dumps.  There is no binary data in the processed crash.  This method accepts a processed crash in a JSON compatible Mapping and saves it to the underlying storage mechanism.
  • save_raw_and_processed: this is a convenience function that will save both raw and processed crashes in one call.  It is used in apps that need to move crashes from one storage system to another.
  • get_raw_crash: this call accepts a crash_id and returns a raw crash mapping.  It does not fetch binary dump information.  If the crash is not found, then it raises a not found exception.
  • get_raw_dump: this is the method that will return a single named binary dump for a given crash_id.  If no name is specified, the first dump found will be returned.  If the crash is not found, then it raises a not found exception.
  • get_raw_dumps: this method returns a mapping of all the binary dumps associated with the provided crash_id.  If the crash is not found, then it raises a not found exception.
  • get_raw_dumps_as_files: the binary dumps can be large.  Sometimes it is undesirable to load them all into memory at the same time.  This method returns a mapping of dump names to pathnames in a file system.  The user of this method is then responsible for opening and reading the files.  For crash storage schemes that actually store their files in a file system, this function may actually return just the existing pathname.  For crash storage schemes that use other methods for actual storage, temporary files will be written to house the dumps.  The user of this method ought to look in the pathnames for the string "TEMPORARY".  If that exists, then the user of this method is responsible for cleanup of the temporary files when it is done.  This method is employed by the Processor to get the binary dumps.  MDSW is a separate program that expects its input to be in the form of a file. 
  • get_processed: this method just returns the processed crash in a JSON compatible mapping.
  • remove: this method deletes all traces of both the raw and processed crashes from a crash storage scheme.
  • new_crashes:  this method is a generator.  It yields a series of crash_ids for raw crashes that arrived in the storage scheme since the last time this generator was invoked.  Its results are generally not repeatable.  Once a crash_id has been yielded, it will not be seen again unless that crash is removed from storage and then resaved.

My  next blog posting will go into detail about each of the crash storage schemes in Socorro.

Thursday, December 27, 2012

Sad Migration

I've been an ardent promoter of Omnifocus for iOS for several years.  That software changed my life by organizing my personal projects.  The iPad version of Omnifocus is in my short list of most useful pieces of software ever.  If you use iOS, I highly recommend it.

Unfortunately, I'm no longer in the ranks of iOS users.  I'll go into the details why I'm walking away from Apple in a post for another day.  So I'm looking for a replacement for Omnifocus that will run as a Webapp and/or  Android application.

I believe that if the folks behind Omnifocus would port their masterwork to platforms other than iOS/OSX they'd own the market.  Sadly, they don't seem to have the slightest interest in doing so (Omnifocus for Android is very unlikely) and so I must walk away from Omnifocus.

So what is the replacement for Omnifocus?  There just doesn't seem to be one.  I've looked into several options like "Remember the Milk" and "Astrid", but they just don't measure up.  I rely on the project management features that just seem to be missing from these alternative projects.

Do you use personal project management / GTD software on Android / Linux and/or as a WebApp?  Can you recommend it?   



Saturday, December 15, 2012

The Tale of the Unstable Connection and the Transaction Class

The Socorro project at Mozilla is perpetually plagued with Thrift connections to HBase that are unstable.  The combined efforts of Dev, IT, and NetOps have only patched over the problem.  For a while we think we've nailed it, but like the cat in the childrens' song, it comes back, frequently on the very next day.

The Socorro code base has been reactionary.  It's been hacked to retry connections to HBase that fail.  Then hacks on top of those hacks retry entire transactions when the lower level retries fail.

In the grand Socorro backend refactoring currently in progress, I've formalized and generalized the transactional retry behavior in a set of classes:
  •  TransactionExecutor
  •  TransactionExecutorWithLimitedBackoff
  •  TransactionExecutorWithInfiniteBackoff
These classes implement methods that accepts a function, a connection context to some resource and arbitrary function parameters. When instantiated and invoked, these classes will call the function passing it the connection and the additional parameters.  The raising of an exception within the function indicates that a failure of the transaction: a rollback is automatically issued on the connection context. If the function succeeds and exits normally, then a 'commit' is issued on the connection context.

The first class in the list above is the degenerate single-shot case. It doesn't implement any retry behavior. If the function fails by raising an exception, then a rollback is issued on the connection and program moves on. Success results in a commit and the program moves on.

The latter two classes implement a retry behavior. If the function raises an exception, the Transaction class checks to see if the exception is of a type that is eligible for retry. If it is eligible, then a delay amount is selected and the thread sleeps. When it wakes, it tries to invoke the function again with the same parameters. The time delays are specified by a list of integers representing successive numbers of seconds to wait before trying again. For the class TransactionExecutorWithLimitedBackoff, when the list of time delays is exhausted the transaction is abandoned and the program moves on. The TransactionExecutorWithInfiniteBackoff will never give up, running the last time in the delay list over and over until the transaction finally succeeds or somebody kills the program.

Recently while obsessively watching my latest version of the processors run in our staging environment, I caught sight of an unstable Thrift connection and my transaction object waltzing together. Here's the how it went down (my commentary is in green).

# we start processing a new crash
2012-12-15 13:03:02,068 INFO - Thread-1 - starting job: 5ec59340-7a0c-4b40-a814-ea7092121215
2012-12-15 13:03:02,074 DEBUG - Thread-1 - about to apply rules
2012-12-15 13:03:02,075 DEBUG - Thread-1 - done applying transform rules
# processing is done, try to save to HBase via Thrift
2012-12-15 13:03:10,065 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:03:10,066 DEBUG - Thread-1 - connection successful
# we got a connection, but...
2012-12-15 13:03:21,366 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
# it failed when we tried to use it.
# Our older HBase client code automatically retries:
2012-12-15 13:03:21,366 DEBUG - Thread-1 - retry_wrapper: about to retry connection
2012-12-15 13:03:21,366 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:03:21,367 DEBUG - Thread-1 - connection successful
# replacement connection successfully established
2012-12-15 13:03:32,390 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
# that connection fails, too
# the TransactionExecutorWithLimitedBackoff judges that a timeout is an exception that
# is eligible for retry:
2012-12-15 13:03:32,390 CRITICAL - Thread-1 - transaction error eligible for retry
# it pulls the first delay amount off the list of delays
2012-12-15 13:03:32,391 DEBUG - Thread-1 - retry in 10 seconds
# it now sleeps the alloted time, waking to log every 5 seconds (configurable)
2012-12-15 13:03:32,391 DEBUG - Thread-1 - waiting for retry after failure in transaction: 0sec of 10sec
2012-12-15 13:03:37,397 DEBUG - Thread-1 - waiting for retry after failure in transaction: 5sec of 10sec
# done with waiting 10 seconds
# TransactionExecutorWithLimitedBackoff opens a new connection context
2012-12-15 13:03:42,405 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:03:42,405 DEBUG - Thread-1 - connection successful
# we got a connection, but...
2012-12-15 13:03:51,233 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
# it failed again.
# the old HBase code does its retry:
2012-12-15 13:04:11,321 DEBUG - Thread-1 - retry_wrapper: about to retry connection
2012-12-15 13:04:11,324 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:04:11,324 DEBUG - Thread-1 - connection successful
# replacement connection successfully established
# but no joy, it's #fail every where we look
2012-12-15 13:04:23,133 DEBUG - Thread-1 - retry_wrapper: handled exception, timed out
2012-12-15 13:04:23,166 CRITICAL - Thread-1 - transaction error eligible for retry
# the transaction class pulls the second delay amount off the list of delays
2012-12-15 13:04:23,193 DEBUG - Thread-1 - retry in 30 seconds
# it now sleeps the alloted time
2012-12-15 13:04:23,289 DEBUG - Thread-1 - waiting for retry after failure in transaction: 0sec of 30sec
2012-12-15 13:04:28,485 DEBUG - Thread-1 - waiting for retry after failure in transaction: 5sec of 30sec
2012-12-15 13:04:33,653 DEBUG - Thread-1 - waiting for retry after failure in transaction: 10sec of 30sec
2012-12-15 13:04:38,155 DEBUG - Thread-1 - waiting for retry after failure in transaction: 15sec of 30sec
2012-12-15 13:04:43,774 DEBUG - Thread-1 - waiting for retry after failure in transaction: 20sec of 30sec
2012-12-15 13:04:48,125 DEBUG - Thread-1 - waiting for retry after failure in transaction: 25sec of 30sec
# done with waiting for 30 seconds
# TransactionExecutorWithLimitedBackoff optimistically tries again
2012-12-15 13:04:53,405 DEBUG - Thread-1 - make_connection, timeout = 5000
2012-12-15 13:04:53,422 DEBUG - Thread-1 - connection successful
# woot!
# the function succeeds exactly where it left off with failure before the timeout
2012-12-15 13:04:59,617 INFO - Thread-1 - succeeded and committed: 5ec59340-7a0c-4b40-a814-ea7092121215
# this worker thread is eligible to move on to processing a new crash


This set of transaction classes work for any resource for which there is a connection that can be wrapped in a context. We use the same transaction objects for both Postgres and HBase connections.

Wait a minute, HBase doesn't support transactions or implement commit and rollback. Even though HBase doesn't support a transactional interface, the idea of retrying a failed set of actions is valid. The connection context for HBase just ignores calls to commit and rollback. 

This transactional behavior along with the compound storage classes with automatic fallback allow Socorro to keep working even when its backend data stores are not.

Socorro Modular Design

Nearly all the back-end applications in Socorro follow the same basic form. Data flows from a source stream into a transformative application and is then streamed out to some destination. For most applications, the source stream consists of crashes and their dumps. A good example of a source stream the inflow of crashes from deployed applications. The destinations are frequently long term storage systems. HBase, Postgres, and file system storage schemes are examples of destinations. The transformation step can be complicated like the processors applying minidump-stackwalk and exploitability analysis. Transformation can also be the degenerate case (no change at all). The crash mover uses the Null Transform when it moves crashes from one location to another.

Socorro's default implementation of the streaming data flow uses a threaded fetch-transform-save producer consumer system. A typical back-end application consists of a couple managerial threads and a flock of worker threads. One of the managerial threads, called the queuing thread, reads a stream of information telling it what crashes it must act on. It places a reference to the crash and a transformative function into a queue. The workers grab from the queue using the crash reference as a key to request the crash from the source. Once the transformation function has acted on the crash, the worker pushes the crash to the destination.

The sources, destinations and transformations are all modular and are loaded at run time during configuration. Creating new sources, destinations and transformations just requires implementing a handful of API calls. The 2013 version of Socorro implements file system, HBase, and Postgres sources and destinations.  Aggregate  destinations can save crash to multiple places or have fallback storage systems in case of failures.

The Sources

  • FileSystem
  • HBase

The Destinations

  • FileSystem
  • HBase
  • Postgres
  • URL POST
  • Elastic Search
  • Aggregate Destinations:
  • Poly Storage
  • Fallback Storage

The Transformations

  • Legacy Processor
  • Null Transformation

Configuring an App From Ground Zero


Hopefully, Socorro will have sensible defaults for all configuration values so it will just work right out of the box. Here's how to setup an app customized for your installation. For this example, the crash mover will be the target app.
If you want to start at ground zero with no prebuilt configuration files, do this first:
    $ cd $SOCORRO_HOME
    
    $ # stash any original config files
    $ mv config config.original

    $ mkdir config
    $ # create a new empty config file
    $ touch config/crashmover.ini

    $ # tell the crash mover to write its own sample ini file
    $ ./socorro/collector/crashmover_app.py --admin.dump_conf=config/crashmover.ini 
          --source.crashstorage_class='' --destination.crashstorage_class=''

If you look at the configuration file at this point, you see something like this:

    # name: application
    # doc: the fully qualified module or class of the application
    # converter: configman.converters.class_converter
    # application='CrashMoverApp'
    
    [destination]

        # name: crashstorage_class
        # doc: the destination storage class
        # converter: configman.converters.class_converter
        crashstorage_class=''

    [logging]

        # section omitted for brevity

    [producer_consumer]

        # section omitted for brevity

    [source]

        # name: crashstorage_class
        # doc: the source storage class
        # converter: configman.converters.class_converter
        crashstorage_class=''


Building a ini file is an iterative process. We're going to invoke the crashmover_app several times adding configuration values with each iteration and having the crashmover_app write out its own configuration with each step. First we're going to setup the [source] section.

When the crashmover_app is invoked, it first reads the any existing configuration file that it can find in the default configuration directory at $SOCORRO_HOME/config. Anything that it finds in the configuration file will override the defaults built into the application. After reading the configuration file, the app searches for any environment variables match names of the app's configuration options. The app brings those values in and they supersede both the defaults and the configuration file. Finally the app brings in the commandline arguments, these values supersede any values found using any of the previous means.

We're going to tell the crashmover_app that we want the source to be a file system location. This means that the crashmover_app will look at a file system location to find crashes to send to the configured destination. The class that implements a file system storage scheme is 'socorro.external.filesystem.crashstorage.FileSystemRawCrashStorage'. We'll specify that on the command line and then get the crashmover_app to write out another ini file:

    $ ./socorro/collector/crashmover_app.py 
        --source.crashstorage_class=socorro.external.filesystem.crashstorage
                                     .FileSystemRawCrashStorage 
        --admin.dump_conf=config/crashmover.ini
 
 Now looking at the ini file, we can see that the filesystem class was loaded and it added a bunch of new configuration requirements to the ini file:
 
    # name: application
    # doc: the fully qualified module or class of the application
    # converter: configman.converters.class_converter
    # application='CrashMoverApp'
 
    [destination]
 
        # name: crashstorage_class
        # doc: the destination storage class
        # converter: configman.converters.class_converter
        crashstorage_class=''

    [logging]

        # section omitted for brevity

    [producer_consumer]

        # section omitted for brevity

    [source]

    # name: crashstorage_class
    # doc: the source storage class
    # converter: configman.converters.class_converter
    crashstorage_class='socorro.external.filesystem.crashstorage'
                       '.FileSystemRawCrashStorage'

    # name: dir_permissions
    # doc: a number used for permissions for directories in the local file system
    # converter: int
    dir_permissions='504'

    # name: dump_dir_count
    # doc: the number of dumps to be stored in a single directory in the local file system
    # converter: int
    dump_dir_count='1024'

    # name: dump_file_suffix
    # doc: the suffix used to identify a dump file
    # converter: str
    dump_file_suffix='.dump'

    # name: dump_gid
    # doc: the group ID for saved crashes in local file system (optional)
    # converter: str
    dump_gid=''

    # name: dump_permissions
    # doc: a number used for permissions crash dump files in the local file system
    # converter: int
    dump_permissions='432'

    # name: json_file_suffix
    # doc: the suffix used to identify a json file
    # converter: str
    json_file_suffix='.json'

    # name: std_fs_root
    # doc: a path to a local file system
    # converter: str
    std_fs_root='/home/socorro/primaryCrashStore'
 
In most cases, the defaults will be acceptable. The most common one that might need changing would be the last one 'std_fs_root'. There are several ways that you can change that value. You could set an environment variable 'source.std_fs_root'. You could go in and directly edit the value in the config file. You could get the crashmover_app to rewrite the ini file with you specifying the new value on the command line:

    $ ./socorro/collector/crashmover_app.py 
         --source.std_fs_root='/home/lars/my_crash_source'
         --admin.dump_conf=config/crashmover.ini

You can verify that the change was made by either inspecting the config file or invoking the crashmover_app yet again this time specifying --help on the command line. The output of help always reflects the values loaded by the stack of configuration value sources (app defaults, config file, environment, command line).

    $ ./socorro/collector/crashmover_app.py –help
    Application: crashmover 2.0
    this app will move crashes from one storage location to another

    Options:
       --admin.conf
       the pathname of the config file (path/filename)
       (default: ./config/crashmover.ini)

       --admin.dump_conf
       a pathname to which to write the current config
       (default: )

      --admin.print_conf

      ... other options omitted for brevity

      --source.std_fs_root
      a path to a local file system
      (default: /home/lars/my_crash_source)

Now we can use the same process to setup the destination storage. This time, our destination will be HBase.
  
    $ ./socorro/collector/crashmover_app.py 
         --destination.crashstorage_class=socorro.external.hbase
                                          .crashstorage.HBaseCrashStorage
         --admin.dump_conf=config/crashmover.ini

This brings in all the configuration dependencies of HBase. We can use --help to see what they are or examine the ini file directly.

    [destination]

        # name: crashstorage_class
        # doc: the destination storage class
        # converter: configman.converters.class_converter
        crashstorage_class='socorro.external.hbase.crashstorage.HBaseCrashStorage'

        # name: dump_file_suffix
        # doc: the suffix used to identify a dump file (for use in temp files)
        # converter: str
        dump_file_suffix='.dump'

        # name: forbidden_keys
        # doc: a comma delimited list of keys banned from the processed crash in HBase
        # converter: socorro.external.hbase.connection_context.<lambda>
        forbidden_keys='email, url, user_id, exploitability'

        # name: hbase_connection_pool_class
        # doc: the class responsible for pooling and giving out HBaseconnections
        # converter: configman.converters.class_converter
        hbase_connection_pool_class=
             'socorro.external.hbase.connection_context.HBaseConnectionContextPooled'

        # name: hbase_host
        # doc: Host to HBase server
        # converter: str
        hbase_host='localhost'

        # name: hbase_port
        # doc: Port to HBase server
        # converter: int
        hbase_port='9090'

        # name: hbase_timeout
        # doc: timeout in milliseconds for an HBase connection
        # converter: int
        hbase_timeout='5000'

        # name: number_of_retries
        # doc: Max. number of retries when fetching from hbaseClient
        # converter: int
        number_of_retries='0'

        # name: temporary_file_system_storage_path
        # doc: a local filesystem path where dumps temporarily during processing
        # converter: str
        temporary_file_system_storage_path='/home/socorro/temp'

        # name: transaction_executor_class
        # doc: a class that will execute transactions
        # converter: configman.converters.class_converter
        transaction_executor_class=
            'socorro.database.transaction_executor.TransactionExecutor'

Again, we can set the proper values by directly editing the config file or continuing this iterative process.  At this point the application is fully configured and ready to run.  Just invoke the app with no command line options.

 Other Class Options For Crash Storage 


When designating HBase or Postgres as destinations, a few more dependencies were brought in that control connection and transactional behaviors. In the case above, the destination.transaction_executor_class is a class that will tell the system how to react to failures.

The default class 'socorro.database.transaction_executor.TransactionExecutor' imbues the HBase with the behavior of “give up instantly”. If the there is a connection problem with HBase, any exception raised will be immediately passed on to the crashmover_app which will pass it on to the thread manager which will log the problem and move on. That's not ideal behavior. It would be nice if the the connection to HBase failed, it would had a fallback behavior of retrying the transaction.

There are three alternatives:
  • TransactionExecutor
  • TransactionExecutorWithLimitedBackoff
  • TransactionExecutorWithInfiniteBackoff. 

The latter two offer retry behaviors with progressively longer delays between retries (completely configurable). The latter most will never give up and keep retrying forever. To use these alternatives, just specify the class in the configuration file.  The latter two classes will bring in a couple more configuration options that specify wait times and logging intervals.

For some more discussion on these classes, see The Tale of the Unstable Connection and the Transaction Class .

Crashstorage Collections


There are two other crashstorage classes that act as duplicating containers for other crash storage classes: PolyCrashStorage and FallbackCrashStorage.  We can use it as a destination storage that will save crashes to two locations at the same time:
 
    $ ./socorro/collector/crashmover_app.py 
         --destination.crashstorage_class=
           socorro.external.crashstorage_base.PolyCrashStorage
         --admin.dump_conf=config/crashmover.ini

This will get us an ini file with a destination section that looks like this:

    [destination]
    
        # name: crashstorage_class
        # doc: the destination storage class
        # converter: configman.converters.class_converter
        crashstorage_class='socorro.external.crashstorage_base.PolyCrashStorage'
    
        # name: storage_classes
        # doc: a comma delimited list of storage classes
        # converter: configman.converters.class_list_converter
        storage_classes=''
    
        [[storage0]]
    
            # name: crashstorage_class
            # doc: None
            # converter: configman.converters.class_converter
            crashstorage_class=''

We can get the details filled in for us with this:

    $ ./socorro/collector/crashmover_app.py 
        --destination.storage_classes=
             'socorro.external.hbase.crashstorage.HBaseCrashStorage,
              socorro.external.filesystem.crashstorage.FileSystemRawCrashStorage'
     
After which, the destination section both crash storage systems will receive exact copies of every crash. The ini file will look like the example below. Edit the values to whatever is appropriate to your local system.

   [destination]
    
        # name: crashstorage_class
        # doc: the destination storage class
        # converter: configman.converters.class_converter
        crashstorage_class='socorro.external.crashstorage_base.PolyCrashStorage'

        # name: storage_classes
        # doc: a comma delimited list of storage classes
        # converter: configman.converters.class_list_converter
        storage_classes='socorro.external.filesystem.crashstorage'
                       '.FileSystemRawCrashStorage,'  
                       'socorro.external.hbase.crashstorage.HBaseCrashStorage'

        [[storage0]]

            # name: crashstorage_class
            # doc: None
            # converter: configman.converters.class_converter
            crashstorage_class='socorro.external.hbase.crashstorage'
                               '.HBaseCrashStorage'

            # name: dump_file_suffix
            # doc: the suffix used to identify a dump file (for use in temp files)
            # converter: str
            dump_file_suffix='.dump'

            # name: forbidden_keys
            # doc: a comma delimited list of keys banned from the processed 
            #      crash in HBase
            # converter: socorro.external.hbase.connection_context.
            forbidden_keys='email, url, user_id, exploitability'

            # name: hbase_connection_pool_class
            # doc: the class responsible for pooling and giving out
            #      HBaseconnections
            # converter: configman.converters.class_converter
            hbase_connection_pool_class='socorro.external.hbase.connection_context'
                                        '.HBaseConnectionContextPooled'

            # name: hbase_host
            # doc: Host to HBase server
            # converter: str
            hbase_host='localhost'

            # name: hbase_port
            # doc: Port to HBase server
            # converter: int
            hbase_port='9090'

            # name: hbase_timeout
            # doc: timeout in milliseconds for an HBase connection
            # converter: int
            hbase_timeout='5000'

            # name: number_of_retries
            # doc: Max. number of retries when fetching from hbaseClient
            # converter: int
            number_of_retries='0'

            # name: temporary_file_system_storage_path
            # doc: a local filesystem path where dumps temporarily 
            #      during processing
            # converter: str
            temporary_file_system_storage_path='/home/socorro/temp'

            # name: transaction_executor_class
            # doc: a class that will execute transactions
            # converter: configman.converters.class_converter
            transaction_executor_class='socorro.database.transaction_executor'
                                       '.TransactionExecutor'

        [[storage1]]

            # name: crashstorage_class
            # doc: None
            # converter: configman.converters.class_converter
            crashstorage_class='socorro.external.filesystem.crashstorage'
                               '.FileSystemRawCrashStorage'

            # name: dir_permissions
            # doc: a number used for permissions for directories in the local 
            #      file system
            # converter: int
            dir_permissions='504'

            # name: dump_dir_count
            # doc: the number of dumps to be stored in a single directory in 
            #      the local file system
            # converter: int
            dump_dir_count='1024'

            # name: dump_file_suffix
            # doc: the suffix used to identify a dump file
            # converter: str
            dump_file_suffix='.dump'

            # name: dump_gid
            # doc: the group ID for saved crashes in local file system (optional)
            # converter: str
            dump_gid=''

            # name: dump_permissions
            # doc: a number used for permissions crash dump files in the local 
            #      file system
            # converter: int
            dump_permissions='432'

            # name: json_file_suffix
            # doc: the suffix used to identify a json file
            # converter: str
            json_file_suffix='.json'

            # name: std_fs_root
            # doc: a path to a local file system
            # converter: str
            std_fs_root='/home/socorro/primaryCrashStore'

The 'FallbackCrashStorage' class is similar except that it only saves to the second crash store if the saving to the first one fails.  In conjunction with the TransactionExecutor's retry behavior, a fault tolerant application can try, for example, to save several times in HBase.  If that ultimately fails, the 'FallBackCrashStorage' will stash the crash into a file system storage for later recovery.

Conclusion

All the Socorro back end applications, the Collector, the Crash Mover, the Submitter, the Monitor, the Processor, the Middleware, the Crontabber and many of the individual cron applications use this same system for modular storage.  This flexibility allows Socorro to scale from tiny installations receiving a handful  of crashes per day to huge installations handling millions of crashes per day.