Saturday, January 17, 2015

The Smoothest Migration

I must say that it was the smoothest migration that I have ever witnessed. The Socorro system data has left our data center and taken up residence at Amazon.

Since 2010, HBase has been our primary storage for Firefox crash data.  Spread across something like 70 machines, we maintained a constant cache of at least six months of crash data.  It was never a pain free system.  Thrift, the system through which Socorro communicated with HBase, seemed to develop a dislike for us from the beginning.  We fought it and it fought back.

Through the adversity that embodied our relationship with Thrift/HBase, Socorro evolved fault tolerance and self healing.  All connections to external resources in Socorro are wrapped with our TransactionExecutor.  It's a class that recognizes certain types of failures and executes a backing off retry when a connection fails.  It's quite generic, as it wraps our connections to HBase, PostgreSQL, RabbitMQ, ElasticSearch and now AmazonEC2.  It ensures that if an external resource fails with a temporary problem, Socorro doesn't fail, too.

Periodically, HBase would become unavailable. The Socorro system, detecting the problem, would back down, biding its time while waiting for the failed resource to recover.  Eventually, after probing the failed resource, Socorro detects recovery and picks up where it left off.

Over the years, we realized that one of the major features that originally attracted us to HBase was not giving us the payoff that we had hoped.  We just weren't using the MapReduce capabilities and found the HBase maintenance costs were not worth the expense.

Thus came the decision that we were to migrate away.  Initially, we considered moving to Ceph and began a Ceph implementation of what we call our CrashStorage API.

Every external resource in Socorro lives encapsulated in a class that implements the Crash Storage API.  Using the Python package Configman, crash storage classes can be loaded at run time, giving us a plugin interface.  Ceph turned out to be a bust when the winds of change directed us to move to AmazonS3. Because we implemented the CrashStorage API using the Boto library, we were able to reuse the code.

Then began the migration.  Rather than just flipping a switch, our migration was gradual.  We started 2014 with HBase as primary storage:

Then, in December, we started running HBase and AmazonS3 together.   We added the new AmazonS3 CrashStorage classes to the Configman managed Socorro INI files.  While we likely restarted the Socorro services, we could have just sent SIGHUP, prompting them to reread their config files, load the new Crash Storage modules and continue running as if nothing had happened.

After most of a month, and completing a migration of old data from HBase to  Amazon, we were ready to cut HBase loose.

I was amused by the non-event of the severing of Thrift from Socorro.  Again, it was a matter of editing HBase out of the configuration, sending a SIGHUP, causing HBase to fall silent. Socorro didn't care.  Announced several hours later on the Socorro mailing list, it seem more like a footnote than an announcement: "oh, by the way, HBase is gone".

Oh, the migration wasn't completely perfect, there were some glitches.  Most of those were from minor cron jobs that were used for special purposes and inadvertently neglected.

The primary datastore migration is not the end of the road.  We still have to move the server processes themselves to Amazon system.  Because everything is captured in the Socorro configuration, however, we do not anticipate that this will an onerous process.

I am quite proud of the success of Socorro's modular design.  I think we programmers only ever really just shuffle complexity around from one place to another.  In my design of Socorro's crash storage system, I have swung a pendulum far to one side, moving the complexity into the configuration.  That has disadvantages.  However, in a system that has to rapidly evolve to changing demands and changing environments, we've just demonstrated a spectacular success.

Credit where credit is due: Rob Helmer spearheaded this migration as the DevOp lead. He pressed the buttons and reworked the configuration files.  Credit also goes to Selena Deckelmann, who lead the way to Boto for Ceph that gave us Boto for Amazon.  Her contribution in writing the Boto CrashStorage class was invaluable.  Me?  While I wrote most of the Boto CrashStorage class and I'm responsible for the overall design, I was able to mainly just be a witness to this migration.  Kind of like watching my children earn great success, I'm proud of the Socorro team and look forward to the next evolutionary steps for Socorro.

Thursday, November 13, 2014

the World's One Door

Last evening, just before I retired for the night, a coworker, Edna Piranha (not her real name), tweeted something that intrigued me:

the WORLD… their WORLD… the WORLD.. world world, world? world! world. wow that word looks funny. world.
Suddenly my brain shifted into what I can only call poetry mode.  Words and phrases similar to the word, "world" began surfacing and swimming around in my mind. After about twenty minutes, I replied to her tweet with:
 @ednapiranha may your weird wonder ward our wired world for we were old and wandered and whirled from the word of the one door.
I immediately went to bed and began a night woven with those words.  They were in my dreams.  I'd wake up with chants in my head, "world - we're old" and "wonder - one door". It haunted me all night long and now continues into the next day.

Ten years ago, a dear friend, since deceased, came up with a new word, ospid, that perfectly describes what I was experiencing.
n, an object which, for a brief period after its creation, intensely fascinates its creator.  Once the fascination is over, the object is no longer an ospid.
I look forward to the moment, hopefully later today, when this is no longer an ospid. 

Wednesday, October 29, 2014

Judge the Project, Not the Contributors

I recently read a blog posting titled, The 8 Essential Traits of a Great Open Source Contributor I am disturbed by this posting. While clearly not the intended effect, I feel the posting just told a huge swath of people that they are neither qualified nor welcome to contribute to Open Source. The intent of the posting was to say that there is a wide range of skills needed in Open Source. Even if a potential contributor feels they lack an essential technical skill, here's an enumeration of other skills that are helpful.
Over the years, I’ve talked to many people who have wanted to contribute to open source projects, but think that they don’t have what it takes to make a contribution. If you’re in that situation, I hope this post helps you get out of that mindset and start contributing to the projects that matter to you.
See? The author has completely good intentions. My fear is that the posting has the opposite effect. It raises a bar as if it is an ad for a paid technical position. He uses superlatives that say to me, “we are looking for the top people as contributors, not common people”.

Unfortunately, my interpretation of this blog posting is not the need for a wide range of skills, it communicates that if you contribute, you'd better be great at doing so. In fact, if you do not have all these skills, you cannot be considered great. So where is the incentive to participate? It makes Open Source sound as if it an invitation to be judged as either great or inadequate.

Ok, I know this interpretation is through my own jaundiced eyes. So to see if my interpretation was just a reflection of my own bad day, I shared the blog posting with a couple colleagues.  Both colleagues are women that judge their own skills unnecessarily harshly, but, in my judgement are really quite good. I chose these two specifically, because I knew both suffer “imposter syndrome”, a largely unshakable feeling of inadequacy that is quite common among technical people.   Both reacted badly to the posting, one saying that it sounded like a job posting for a position for which there would be no hope of ever landing.

I want to turn this around. Let's not judge the contributors, let's judge the projects instead. In fact, we can take these eight traits and boil them down to one: essential trait of a great open source project:
Essential trait of a great open source project:
Leaders & processes that can advance the project while marshalling imperfect contributors gracefully.
That's a really tall order. By that standard, my own Open Source projects are not great. However, I feel much more comfortable saying that the project is not great, rather than sorting the contributors.

If I were paying people to work on my project, I'd have no qualms about judging their performance any where along a continuum of “great” to “inadequate”. Contributors are NOT employees subject to performance review.  In my projects, if someone contributes, I consider both the contribution and the contributor to be “great”. The contribution may not make it into the project, but it was given to me for free, so it is naturally great by that aspect alone.

Contribution: Voluntary Gift

Perhaps if the original posting had said, "these are the eight gifts we need" rather than saying the the gifts are traits of people we consider "great", I would not have been so uncomfortable.

A great Open Source project is one that produces a successful product and is inclusive. An Open Source project that produces a successful product, but is not inclusive, is merely successful.

Monday, June 09, 2014

Crontabber and Postgres

This essay is about Postgres and Crontabber, the we-need-something-more-robust-than-cron job runner that Mozilla uses in Socorro, the crash reporting system.

Sloppy database programming in an environment where autocommit is turned off leads to very sad DBAs. There are a lot of programmers out there that cut their teeth in databases that either had autocommit on by default or didn't even implement transactions.  Programmers that are used to working with relational databases in autocommit mode actually miss out on one of the most powerful features of relational databases. However, bringing the cavalier attitude of autocommit into a transactional world will lead to pain.  

In autocommit mode, every statement given to the database is committed as soon as it is done. That isn't always the best way to interact with a database, especially if there are multiple steps to a database related task.

For example say we've got database tables representing monetary accounts. To move money from one account to another requires two steps, deduct from the first account and add to the other. If using autocommit mode, there is a danger that the accounts could get out of sync if some disaster happens between the two steps.

To counter that, transactions allow the two steps to be linked together. If something goes wrong during the two steps, we can rollback any changes and not let the accounts get out of sync. However, having manual transaction requires the programmer to be more careful and make sure that there is no execution path out of the database code that doesn't pass through either a commit or rollback. Failing to do so may end up leaving connections idle in transactions. The risk is critical consumption of resources and impending deadlocks.

Crontabber provides a feature to help make sure that database transactions get closed properly and still allow the programmer to be lazy.

When writing a Crontabber application that accesses a database, there are a number of helpers. Let's jump directly to the one that guarantees proper transactional behavior.

# sets up postgres
# tells crontabber control transactions
def run(self, connection):
    # connection is a standard psycopg2 connection instance.
    # use it to do the two steps:
    cursor = connection.cursor()
        'update accounts set total = total - 10' where acc_num = '1'
        'update accounts set total = total +10' where acc_num = '2'

In this contrived example, the method decorator gave the crontabber job the a connection to the database and will ensure that that if the job runs to completion, the transaction will be commited. It also guarantees that if the the 'run' method exits abnormally (an exception), the transaction will be rolled back.

Using this class decorator is declaring that this Crontabber job represents a single database transaction.  Needless to say, if the job takes twenty minutes to run, you may not want it to be a single transaction.  

Say you have a collection of periodic database related scripts that have evolved over years by Python programmers long gone. Some of the old crustier ones from the murky past are really bad about leaving database connections “idle in transaction”. In porting it to crontabber, call that ill behaved function from within the context of a construct like that previous example. Crontabber will take on the responsibility of transactions for that function with these simple rules:

  • If the method ends normally, crontabber will issue the commit on the connection.
  • If an exception escapes from the scope of the function, crontabber will rollback the database connection.

Crontabber provides three dedicated class decorators to assist in handling periodic Postgres tasks. Their documentation can be found here: Read The Docs: Postgres Decorators.  The @with_postgres_connection_as_argument decorator will pass the connection the run method, but does not handle commit and/or rollback.  Use that decorator if you'd like to manage transactions manually within the Crontabber job. 

Transactional behavior contributes in making Crontabber robust.  Crontabber is also robust because of self healing behaviors. If a given job fails, dependent jobs will not be run. The next time the periodic job's time to execute comes around, the 'backfill' mechanism will make sure that it makes up for the previous failure. See Read The Docs: Backfill for more details.

The transactional system can also contribute to self healing by retrying failed transactions, if those failures were caused by transient issues. Temporary network glitches can cause failure. If your periodic job runs only once every 24 hours, maybe you'd rather your app retry a few times before giving up and waiting for the next scheduled run time.

Through configuration, the transactional behavior of Postrges, embodied by Crontabber's TransactionExecutor class, can do a “backing off retry”. Here's the log of an example of backoff retry, my commentary is in green:

# we cannot seem to connect to Postgres
2014-06-08 03:23:53,101 CRITICAL - MainThread - ... transaction error eligible for retry
OperationalError: ERROR: pgbouncer cannot connect to server
# the TransactorExector backs off, retrying in 10 seconds
2014-06-08 03:23:53,102 DEBUG - MainThread - retry in 10 seconds
2014-06-08 03:23:53,102 DEBUG - MainThread - waiting for retry ...: 0sec of 10sec
# it fails again, this time scheduling a retry in 30 seconds;
2014-06-08 03:24:03,159 CRITICAL - MainThread - ... transaction error eligible for retry
OperationalError: ERROR: pgbouncer cannot connect to server
2014-06-08 03:24:03,160 DEBUG - MainThread - retry in 30 seconds
2014-06-08 03:24:03,160 DEBUG - MainThread - waiting for retry ...: 0sec of 30sec
2014-06-08 03:24:13,211 DEBUG - MainThread - waiting for retry ...: 10sec of 30sec
2014-06-08 03:24:23,262 DEBUG - MainThread - waiting for retry ...: 20sec of 30sec
# it fails a third time, now opting to wait for a minute before retrying
2014-06-08 03:24:33,319 CRITICAL - MainThread - ... transaction error eligible for retry
2014-06-08 03:24:33,320 DEBUG - MainThread - retry in 60 seconds
2014-06-08 03:24:33,320 DEBUG - MainThread - waiting for retry ...: 0sec of 60sec
2014-06-08 03:25:23,576 DEBUG - MainThread - waiting for retry ...: 50sec of 60sec
2014-06-08 03:25:33,633 CRITICAL - MainThread - ... transaction error eligible for retry
2014-06-08 03:25:33,634 DEBUG - MainThread - retry in 120 seconds
2014-06-08 03:25:33,634 DEBUG - MainThread - waiting for retry ...: 0sec of 120sec
2014-06-08 03:27:24,205 DEBUG - MainThread - waiting for retry ...: 110sec of 120sec
# finally it works and the app goes on its way
2014-06-08 03:27:34,989 INFO  - Thread-2 - starting job: 065ade70-d84e-4e5e-9c65-0e9ec2140606
2014-06-08 03:27:35,009 INFO  - Thread-5 - starting job: 800f6100-c097-440d-b9d9-802842140606
2014-06-08 03:27:35,035 INFO  - Thread-1 - starting job: a91870cf-4d66-4a24-a5c2-02d7b2140606
2014-06-08 03:27:35,045 INFO  - Thread-9 - starting job: a9bfe628-9f2e-4d95-8745-887b42140606
2014-06-08 03:27:35,050 INFO  - Thread-7 - starting job: 07c55898-9c64-421f-b1b3-c18b32140606
The TransactionExecutor can be set to retry as many times as you'd like with retries at whatever interval is desired.  The default is to try only once.  If you'd like the backing off retry behavior, change TransactionExecutor in the Crontabber config file to TransactionExecutorWithLimitedBackOff or TransactionExecutorWithInifiteBackOff

While Crontabber supports Postgres by default, Socorro, the Mozilla Crash Reporter, extends the support of the TransactionExecutor to HBase, RabbitMQ, and Ceph.  It would not be hard to get it to work for MySQL or,  really, any connection based resource.

The TransactionExecutor, Coupled with Crontabber's Backfilling capabilities, nobody has to get out of bed at 3am because the crons have failed again.  They can take care of themselves.

On Tuesday, June 10, Peter Bengtsson of Mozilla will give a presentation about Crontabber to the SFPUG.  The presentation will be broadcast on AirMozilla.

SFPUG June: Crontabber manages ALL the tasks

Sunday, May 04, 2014

Crouching Argparse, Hidden Configman

I've discovered that people that persist in being programmers over age fifty do not die.  Wrapped in blankets woven from their own abstractions, they're immune to the forces of the outside world. This is the first posting in a series about a pet hacking project of mine so deep in abstractions that not even light can escape.

I've written about Configman several times over the last couple of years as it applies to the Mozilla Socorro Crash Stats project.  It is unified configuration.  Configman strives to wrap all the different ways that configuration information can be injected into a program.  In doing so, it handily passes the event threshold and becomes a configuration manager, a dependency injection framework, a dessert topping and a floor wax.

In my experimental branch of Configman, I've finally added support for argparse.  That's the canonical Python module for parsing the command line into key/value pairs, presumably as configuration.  It includes its own data definition language in the form of calls to a function called add_argument.  Through this method, you define what information you'll accept from the command line.

argparse only deals with command lines.  It won't help you with environment variables, ini files, json files, etc.  There are other libraries that handle those things.  Unfortunately, they don't integrate at all with argparse and may include their own data definition system or none at all.

Integrating Configman with argparse was tough.  argparse doesn't play well in extending it in the manner that I want.  Configman employs argparse but resorts to deception to get the work done.  Take a look at this classic first example from the argparse documentation.

from configman import ArgumentParser

parser = ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
                    help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
                    const=sum, default=max,
                    help='sum the integers (default: find the max)')

args = parser.parse_args()

Instead of importing argparse from its own module, I import it from Configman.  That just means that we're going to use my subclass of the argparse parser class.  Otherwise it looks, acts and tastes just like argparse: I don't emulate it or try to reimplement anything that it does, I use it to do what it does best.  Only at the command line, running the 'help' option, is the inner Configman revealed.

$ ./ 0 0 --help
    usage: [-h] [--sum] [--admin.print_conf ADMIN.PRINT_CONF]
                 [--admin.dump_conf ADMIN.DUMP_CONF] [--admin.strict]
                 [--admin.conf ADMIN.CONF]
                 N [N ...]

    Process some integers.

    positional arguments:
      N                     an integer for the accumulator

    optional arguments:
      -h, --help            show this help message and exit
      --sum                 sum the integers (default: find the max)
      --admin.print_conf ADMIN.PRINT_CONF
                            write current config to stdout (json, py, ini, conf)
      --admin.dump_conf ADMIN.DUMP_CONF
                            a file system pathname for new config file (types:
                            json, py, ini, conf)
      --admin.strict        mismatched options generate exceptions rather than
                            just warnings
      --admin.conf ADMIN.CONF
                            the pathname of the config file (path/filename)

There's a bunch of options with "admin" in them.  Suddenly, argparse supports all the different configuration libraries that Configman understands: that brings a rainbow of configuration files to the argparse world.  While this little toy program hardly needs them, wouldn't it be nice to have a complete system of "ini" or "json" files with no more work than your original argparse argument definitions? 

using argparse through Configman means getting ConfigObj for free

Let's make our example write out its own ini file:

    $ ./ --admin.dump_conf=x1.ini
    $ cat x1.ini
    # sum the integers (default: find the max)
    # an integer for the accumulator
Then we'll edit that file and make it automatically use the sum function instead of the max function.  Uncomment the "accumulate" line and replace the "max" with "sum".  Configman will associate an ini file with the same base name as a program file to trigger automatic loading.  From that point on, invoking the program means loading the ini file.  That means the command line arguments aren't necessary.  Rather not have a secret automatically loaded config file? Give it a different name.

    $ ./ 1 2 3
    $ ./ 4 5 6
I can even make the integer arguments get loaded from the ini file.  Revert the "sum" line change and instead change the "integers" line to be a list of numbers of your own choice.

    $ cat x1.ini
    # sum the integers (default: find the max)
    # an integer for the accumulator
    integers=1 2 3 4 5 6
    $ ./
    $ ./ --sum

By the way, making argparse not have a complete conniption fit over the missing command line arguments was quite the engineering effort.  I didn't change it, I fooled it into thinking that the command line arguments are there.

Ini files are supported in Configman by ConfigObj.  Want json files instead of ini files?  Configman figures out what you want by the file extension and searches for an appropriate handler.  Specify that you want a "py" file and Configman will write a Python module of values.  Maybe I'll write an XML reader/writer next time I'm depressed.

Configman does environment variables, too:
    $ export accumulate=sum
    $ ./ 1 2 3
    $ ./ 1 2 3 4 5 6

There is a hierarchy to all this.  Think of it as layers: at the bottom you have the defaults expressed or implied by the arguments defined for argparse.  Then next layer up is the environment.  Anything that appears in the environment will override the defaults.  The next layer up is the config file.  Values found there will override both the defaults and the environment.  Finally, the arguments supplied on the command line override everything else.

This hierarchy is configurable, you can make it any order that you want.  In fact, you can put anything that conforms to the collections.Mapping api into that hierarchy.  However, for this example, as a drop-in augmentation of argparse, the api to adjust the "values source list" is not exposed.

In the next installment, I'll show a more interesting example where I play around with the type in the definition of the argparse arguments.  By putting a function there that will dynamically load a class, we suddenly have a poor man's dependency injection framework.  That idea is used extensively in Mozilla's Socorro to allow us to switch out storage schemes on the fly.

If you want to play around with this, you can pip install Configman.  However, what I've talked about here today with argparse is not part  of the current release.  You can get this version of configman from my github repo: Configman pretends to be argparse - source from github.  Remember, this branch is not production code.  It is an interesting exercise in wrapping myself in yet another layer of abstraction. 

My somewhat outdated previous postings on this topic begin with Configuration, Part 1

Sunday, April 13, 2014

Pegboard Tool Storage - circuit tester & grounding adapter

(this post is part 8 of a longer series.  See  
Pegboard Tool Storage, Part 1 for the beginning,
or the whole series)

These are the most commonly lost tools that I own.  In fact, I finally found the left grounding adapter on the floor behind the shelf in the garage, no where near an outlet.  That's the problem: make these tools easy to find.

The solution is to make fake electrical outlets to put on the pegboard, of course.  Now all have to do is remember to put them away when I'm done with them.  If I can manage to do that, finding them again will be trivial because they'll be in plain sight.

This simple design is available for download at: pegboard outlet storage

Saturday, April 12, 2014

Pegboard Tool Storage - hex bits

(this post is part 7 of a longer series.  See  

I've got lots of hex bits for the electric drill, the electric screw driver and various interchangeable tip tools.  I've been keeping them in a jar on a shelf.  I've gotten sick of dumping them out to search for the one that I want.

This design has given me instant visual access to the whole array of hex bits.  The photo is of only one of them, I've actually made a bunch for all the different types:

This design is available for 3D printing at: Pegboard Hex Bit Holder

Next in this series Circuit Tester and Grounding Adapter