Tuesday, November 20, 2012

404 Flood Not Found

"The flood you expected is no longer in service.  If you think you've reached this message in error, please check your river and then rain again."

Our flood didn't happen and we're not quite sure why.  We wonder if the upstream river gauge was malfunctioning.  Considering how much rain we got, the gauge's behavior certainly seemed plausible.

The valley's main river, the Willamette, was not experiencing high levels. The theory goes that the lower stem of the Mary's, our river, had the capacity to drain very quickly.  The Willamette effortlessly consumed what was a major flood five miles upstream.  Had the Willamette been backed up, the upstream flood would have pooled on top of us.

It's fascinating to have observed these different flood personalities over the years.  Each is unique.  This non-flood was the strangest one yet.

I now may proceed with my regular life.

Monday, November 12, 2012

Socorro as a service

In a Socorro meeting in September in Washington DC, we started a discussion about how Socorro could be offered as a service.  This is posting is a sketch of my vision for such a service.

Would the developers with apps in the Mozilla Marketplace benefit from stats and aggregate reports from Socorro? Socorro processes and reports on binary crashes from the Firefox platform itself. Is that useful information to the Web developer? Well, they might want it to just see what Firefox problems they're triggering so that they may recode around our problems.

It might be more valuable for app developers to have Javascript stack information about their unhandled exceptions. Socorro doesn't do this right now. An unhandled Javascript exception is not something that brings down Firefox and triggers a breakpad crash submission. So the first step in getting a version of Socorro to help Web developers is to create and install a hook into our Firefox platform to submit an unhandled Javascript exception with stack information to a waiting instance of Socorro.

Since Socorro is so modular, we could yank out the breakpad binary crash handling and replace it with some Javascript stack analysis code. That implies that these crashes are submitted to a Socorro instance substantially different than the one that we use. I'll save the speculation as to what Javascript stack processing would look like for a future posting.

In this case, offering Socorro as a service means just offering a minimally configured Socorro (no Hadoop/HBase, file system storage only) in, perhaps, a VM. Otherwise, it's up to the web developer to host their own Socorro.  If offering Socorro to Web developers means reporting on Javascript unhandled exceptions, then modifications will be required for both the Firefox platform and Socorro.

Does Mozilla itself have interest in a Javascript aware Socorro? If Mozilla is dog fooding their own platform and writing Marketplace apps (or the default apps such as the phone app), then I could see a Javascript aware Socorro as a valuable tool in tracking our own app problems.

Assuming that the regular binary crash information is of interest to app developers, how could Socorro be offered as a service?  Our instance of Socorro shouldn't be burdened with all the aggregation and overhead of maintaining stats on potentially thousands of WebApps.  I'm thinking about expanding Socorro in a different dimension: cooperating multiple instances of Socorro.

Imagine this: an app developer has his own instance of Socorro that only receives crashes associated with his own app.  This instance of Socorro is either hosted by us on a VM, or hosted on some external system arranged by the app developer himself.

Whoa Nelly, can we share binary crash dumps with external developers?   This is a critical question that is likely answered with a resounding, "no".  The binary contains potentially "sensitive information" about the user. Privacy concerns would prevent us from sending the  binary to third parties.  Therefore the rest of this missive should be considered just a self indulgent speculation of which an implementation will never see the light of day.

How does the privately owned Socorro get crashes? Our Socorro will continue to receive all the crashes from our products.  If Socorro senses that a crash happened in a specific WebApp, during our own save-the-crash-phase of Socorro processing, we just forward the crash on to the registered URL for that WebApp's private instance of Socorro.  Once thrown over the wall, our Socorro continues with our normal crash processing.  It is up to the WebApp developer to use his private Socorro to the benefit his own app.

Where does the private Socorro live?  It really doesn't matter.  We could offer the hosting.  They could host it on Amazon or their own servers.  If they want a full blown HBase enabled Socorro, they can host on their own cluster.  If they only need a tiny instance, a fully functional Socorro can be configured sans HBase and run on a VM.

How does our Socorro know about the private Socorro so that it can forward crashes?  This can be implemented in several ways.  If a WebApp has a GUID and that GUID is included in the Breakpad metadata submitted with the crash, we could look up a registered crash submission URL and use it for forwarding the crash in exactly the same manner that our products submit crashes to us.

An alternative that I find more attractive, is having the private Socorro crash submission url included in the breakpad crash submission metadata.  If the key is present in the metadata, then our master Socorro would just submit the crash directly to that URL.  That way there are no lookups and we're not having to maintain a registry.   It would be up to the app developer to insure the URL is up to date and correct.  A failed submission is lost.

The changes to Socorro to enable this crash forwarding scheme are minimal.  The Socorro CrashStorage hierarchy already has the ability to do an http submit of crashes (this is used in the testing app, SubmitterApp).  There would be only minor modifications required, likely handled smoothly by inheritance.  Including that class in the CrashMoverApp configuration is all that is needed.

So why would the WebApp developer want our Firefox binary crash reports?
It would also might be possible to hook into the Javascript unhandled exception code to do the crash report submission without involving breakpad at all.  In such a case, the submission could by-pass our master Socorro entirely and directly submitting to the private Socorro.

This also leads to the idea that the app developer could skip Socorro entirely, roll their own crash reporting system that'll just receive crashes using the Socorro styled submission technique.