Monday, December 31, 2012

Socorro File System Storage

In Socorro's first implementation in 2007, the collectors saved crashes to an NFS mounted file system.  Each time a new crash was saved on NFS, a daemon of some sort (I don't recall the details) would wake and notify the processors that a new crash had arrived.  Each of the processors would scramble to try to process the same crash. Of course, they'd collide using the existence of a record in Postgres and subsequent failure of a transaction as a contention resolution mechanism. This system collapsed spectacularly under load for several reasons:
  1. So many crashes would come in rapid succession that the daemon's notifications to the processors quickly saturated the network to the point of paralysis.
  2. Once there are more than about a thousand files in any given directory in NFS, many NFS operations start to thrash.  Just getting a listing of such a directory could take several minutes.
  3. The processors were using the database as a locking mechanism.  Multiple processors would start processing the same crash with only one eventually winning and getting to save its results.  The processors quickly became saturated with work that was destined to be thrown away.
Why not just push crashes from the collectors directly into the relational database?  The Socorro requirements state that the collectors must not rely on a service that could be go down.  The highest priority is to never lose crashes due to an outage of some sort.  So why NFS, isn't that a service that could go down?  Well, yes it is, but at least it is ostensibly more stable than relational databases. 

Fixing this broken crash storage / queuing system was my first task when I agreed to take on Socorro as my responsibility.  I quickly worked to replace the existing flat NFS storage system with a new scheme.  The monitor application was created as arbiter to avoid contention between the processors.  (Credit where credit is due: Frank Griswold was key in assisting with the design and implementation of the File System storage.  It wouldn't have happened so quickly without his knowledge and hard work.)

Still rooted in NFS, this new storage scheme used the file system as the implementation of an old fashion hierarchical database.  The scheme uses radix directories and symbolic links as the underlying storage mechanism.  The crash_id, an adulterated UUID, is the index into the storage system to locate crashes.

The file system storage consists of one file system directory branch for each day of crash storage.  The name of the day directories is of the form YYMMDD.  When given a crash_id, how can we tell what day directory it belongs to?  I lopped off the last six digits of the original UUID and encoded the date in that same format: YYMMDD.  Now given any crash_id, we can know instantly which day directory in which it can be found.

We know that we were going to have to access crashes by both date and name, so the next level from the day directory is a bifurcation: literally the words “name” and “date” (though, more accurately, it ought to have been called “time”).  The “name” branch is used as both storage for the files that comprise a crash and as the indexing mechanism for name lookups.  The “date” branch is an indexing mechanism that allows Socorro to roughly detect the order in which crashes arrived.

The “name” Branch

I employed a radix scheme taking the hex digits in pairs from the start of crash_id.  For example, a crash_id that looks like “38A4F01E...090514” would be stored in a directory structure like this:   


That limits the number of entries within the 'name' directory to 16^2 or 256 entries.  We feared we could still end up with too many files in the leaf nodes, so we stretched the scheme out to four pairs of hexdigits:  


We figured that the leaf nodes would sufficiently diluted that we wouldn't have the problem of too many files in the leaf directories.  At the time, we didn't know how many crashes per day we were going to get, and we wanted to a flexible as possible.  We took another digit from the UUID and coded it to reflect how deep the hierarchy should go. Digit -7 from the right end signifies how deep the radix naming scheme goes.  In practice, we found that two levels was sufficient to spread the crashes out over all the directories.

The “date” Branch

The date branch implements the indexing scheme so that we know roughly what order in which the crashes arrived.  The next level below the literal “date” director is the hour of the day in which the crashes arrived in the form, HH.  Below that is a series of buckets (referred to as slots in the code).  The buckets are either named by the arrival minute or, if there are multiple collectors working in the same file system, the host name of the collector.  The buckets are have an additional sequence number that's incremented if the number of entries in a given bucket crosses a configurable threshold.  A crash that came in at 5 minutes after noon on 2009-05-14 would have an entry like this:


So what is the file in that directory?  It's a symbolic link over to the directory  in the name branch that holds the actual crash.  When the symbolic link is created, a back link is also created so that the a leaps can be made back and forth between the two branches.

The decomposition of a “crash_id” and its relationship to
the File System Storage directory structure:

(no, those aren't inductors in the file system)
The monitor application was originally structured to read new crashes from the File System Storage and then divvy them out to the processors.  We wanted to process crashes in roughly the same order in which they arrived.  The monitor would walk the date branch looking for symbolic links.  On finding one, it would get the crash_id value (the adulterated UUID) from the name of the link itself, and insert it into the processing queue in the database. Sometime in the future, a processor would get the crash_id and, by its very form, know how to look it up in the name branch of the directory structure.

The monitor wants to see a new crash only once.  So here's where it gets weird. Once the monitor found a symbolic link and queued the crash for processing, it would destroy the symbolic link as well as the corresponding back link.  This insures that next time the monitor walks the tree looking for new crashes, it won't even see crashes that it had seen before. The “date” branch index is read-once.  After it has been read, it'll never be read again.

This seems too complicated.  Why bother with symbolic links when an empty file with just the crash_id as a name would do?  A little empirical study revealed that it is significantly faster to read a symbolic link that it is to read a file.  The symbolic link information is actually written in the underlying file system inode. Creating and deleting symbolic links is much faster than creating and deleting a real file.

What is the purpose of the backlink?  When is it used?  Sometimes the system would get backlogged and the processing of any given crash could be minutes or hours away.  I created a system called “priority processing” that could, when requested, bump a crash up in the queue for immediate processing.

Priority processing is implemented by the monitor.  It finds crash_ids for priority processing in a table in the database.  For each crash_id, it goes to the file system and looks it up on the “name” branch.

If it sees there is no symbolic link from the “name” branch to the “date” branch, the monitor knows that it has already seen that crash_id.  It goes to the crash processing queue in the database and moves the cash to the head of the queue.

If it does find a symbolic link, it uses that link to jump over to the “date” branch where it deletes the link back to the “name” branch.  Then it deletes the link going the opposite way.  This insures that the regular queuing mechanism of the monitor will not see and queue a crash a second time for processing.

Wow, that's a lot of moving parts.  Even though this system is old crusty and rather complicated, it remains as one of the most stable storage systems that Socorro has.  File systems don't fail very often.  This file system code rarely fails either.  It is still in use as the primary storage for collectors and as an emergency backup when our new primary storage, HBase, fails or is unstable.

This file system storage is destined for refactoring/reimplementation in Q1 2013.  It was built before we instituted coding standards.  More importantly, we need it to have actual useful unit tests.  In 2010, the Mozilla project graduated away from using this scheme as the primary data storage in favor of HBase.  However, we're still using this scheme for temporary storage for the collectors.  They write all their crashes to a local disk using this system.  A separate process spools them off to HBase asynchronously.   I'll write about the HBase role in Socorro crash storage two blog posts from now.  

In my next blog posting, I'll show how variants of the File System Crash Storage comprise “standard storage”, “deferred storage” and “processed storage”.