the stackwalker revolution

What do you do when you fail at starting a revolution? Major radical change is hard to do smoothly with no risk and disruption: it frightens management. If you cannot overthrow the existing paradigm, you need to find a way to work with it, but influence its direction. If revolution isn't in the cards, evolution may be the answer.

Okay, perhaps that's a rather dramatic introduction to a posting about Socorro, but it is apropos. It is a hard won lesson in my life as a software engineer: once I've established a software system, an API, or an application, there will be push back on radical change.

In Socorro, we take binary crash data from Firefox and feed them through a transformative system that make the crash human readable and analyzable. A key component in the transformation is a C program called “minidump_stackwalk”. We've been using the same one for years and years. A couple years ago, a replacement was offered to us: one that would allow us to achieve better analysis, more information, and a friendlier format. Unfortunately, it exported a completely different form of output than the old version. To support it would require massive re-implementation of the Socorro crash processors. Too many other systems relied on the output of the old “minidump_stackwalk” to make the switch practical. The revolution was postponed again and again.

Finally fed up with lack of progress, I conceded that an evolutionary plan would trump the revolutionary one. Since “minidump_stackwalk” was under our control, I persuaded the maintainer to create a hybrid version: when invoked, it spews forth the old style output followed immediately by the new style output. The older components are unaware that anything has changed, they still get their expected data in the old style format. Development of newer components can proceed unhampered by the old style data and can dive directly into the new style. As we have the time and resources, we can convert the older components to use the new style.

On Monday, November 18, the Socorro configuration was switched. The processors no longer load the LegacyCrashProcessor class at startup. Instead, we've got the HybridCrashProcessor that invokes the split personality “minidump_stackwalk” (now renamed “stackwalker”).

The first beneficiaries of the the new json style output of the stackwalker will be crash classifiers with in the processor and more detailed searching within Elasticsearch. Eventually, the rest of the system will follow along. Retirement of the old style will probably take years. Evolution is a slow process, but patience is a virtue.

Sample old data format (lovingly called the PIPE dump):

OS|Windows NT|6.1.7601 Service Pack 1
CPU|x86|AuthenticAMD family 20 model 2 stepping 0|2
0|0|mozjs.dll|JSObject::getGeneric(JSContext *,JS::Handle,JS::Handle,JS::Handle,JS::MutableHandle)||991|0x1c
0|3|mozjs.dll|js::RunScript(JSContext *,js::RunState &)||419|0x9
0|4|mozjs.dll|js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct)||481|0xb 0|5|mozjs.dll|js::CallOrConstructBoundFunction(JSContext *,unsigned int,JS::Value *)||1257|0x16
0|6|mozjs.dll|js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct)||462|0xb9

Sample new data format (json):

    "java_stack_trace" : null,
    "distributor_version" : null,
    "uuid" : "46621f4b-03ec-4c3a-afe5-279da2131119",
    "startedDateTime" : "2013-11-19 13:29:56.613620",
    "truncated" : false,
    "os_version" : "6.1.7601 Service Pack 1",
    "hangid" : null,
    "addons" : [
    "addons_checked" : true,
    "uptime" : 9162,
    "address" : "0x5",
    "date_processed" : "2013-11-19 13:29:47.544118",
    "success" : true,
    "install_age" : 234007,
    "cpu_info" : "AuthenticAMD family 20 model 2 stepping 0 | 2",
    "distributor" : null,
    "pluginName" : null,
    "signature" : "JSObject::getGeneric(JSContext*, JS::Handle, JS::Handle, JS::Handle, JS::MutableHandle)",
    "crashedThread" : 0,
    "client_crash_date" : "2013-11-19 13:27:57.000000",
    "completeddatetime" : "2013-11-19 13:30:02.020154",
    "release_channel" : "aurora",
    "crashing_thread" : {
    "threads_index" : 0,
    "total_frames" : 34,
    "frames" : [
            "function_offset" : "0x165",
            "function" : "JSObject::getGeneric(JSContext *,JS::Handle,JS::Handle,JS::Handle,JS::MutableHandle)",
            "trust" : "context",
            "file" : "",
            "frame" : 0,
            "module_offset" : "0xc2145",
            "module" : "mozjs.dll",
            "offset" : "0x61472145",
            "line" : 991
            "function_offset" : "0xab",
            "function" : "GetPropertyOperation",
            "trust" : "cfi",
            "file" : "",
            "frame" : 1,
            "module_offset" : "0xc19fb",
            "module" : "mozjs.dll",
            "offset" : "0x614719fb",
            "line" : 263