Tuesday, November 19, 2013

the stackwalker revolution

What do you do when you fail at starting a revolution? Major radical change is hard to do smoothly with no risk and disruption: it frightens management. If you cannot overthrow the existing paradigm, you need to find a way to work with it, but influence its direction. If revolution isn't in the cards, evolution may be the answer.

Okay, perhaps that's a rather dramatic introduction to a posting about Socorro, but it is apropos. It is a hard won lesson in my life as a software engineer: once I've established a software system, an API, or an application, there will be push back on radical change.

In Socorro, we take binary crash data from Firefox and feed them through a transformative system that make the crash human readable and analyzable. A key component in the transformation is a C program called “minidump_stackwalk”. We've been using the same one for years and years. A couple years ago, a replacement was offered to us: one that would allow us to achieve better analysis, more information, and a friendlier format. Unfortunately, it exported a completely different form of output than the old version. To support it would require massive re-implementation of the Socorro crash processors. Too many other systems relied on the output of the old “minidump_stackwalk” to make the switch practical. The revolution was postponed again and again.

Finally fed up with lack of progress, I conceded that an evolutionary plan would trump the revolutionary one. Since “minidump_stackwalk” was under our control, I persuaded the maintainer to create a hybrid version: when invoked, it spews forth the old style output followed immediately by the new style output. The older components are unaware that anything has changed, they still get their expected data in the old style format. Development of newer components can proceed unhampered by the old style data and can dive directly into the new style. As we have the time and resources, we can convert the older components to use the new style.

On Monday, November 18, the Socorro configuration was switched. The processors no longer load the LegacyCrashProcessor class at startup. Instead, we've got the HybridCrashProcessor that invokes the split personality “minidump_stackwalk” (now renamed “stackwalker”).

The first beneficiaries of the the new json style output of the stackwalker will be crash classifiers with in the processor and more detailed searching within Elasticsearch. Eventually, the rest of the system will follow along. Retirement of the old style will probably take years. Evolution is a slow process, but patience is a virtue. 

Sample old data format (lovingly called the PIPE dump):


OS|Windows NT|6.1.7601 Service Pack 1 
CPU|x86|AuthenticAMD family 20 model 2 stepping 0|2
Crash|EXCEPTION_ACCESS_VIOLATION_READ|0x5|0 
Module|firefox.exe|27.0.0.5066|firefox.pdb|B5A3AC6191AE4A499688FA936AD082612|0x01340000|0x0138efff|1  
Module|ETDApix.dll|7.0.6.1|ETDApix.pdb|154584BB765A4AFFAD640A32176C90361|0x10000000|0x10056fff|0  
Module|devenum.dll|6.6.7600.16385|devenum.pdb|728AEF77CC244D8BADC3F6255CE396B31|0x5d220000|0x5d233fff|0 
Module|msdmo.dll|6.6.7601.17514|msdmo.pdb|7E91458399E34CF99F0C993D9128BB301|0x5d2c0000|0x5d2cafff|0  
Module|mf.dll|12.0.7601.17514|mf.pdb|E53583973043441DA81867AB565AD9372|0x5d550000|0x5d861fff|0  
Module|icm32.dll|6.1.7600.16385|icm32.pdb|189BAB60FB414E258DC5E4996497EEA91|0x5d870000|0x5d8a7fff|0  
Module|StructuredQuery.dll|7.0.7601.17514|StructuredQuery.pdb|75F9CA3991244B738784DD5E65739D461|0x5f2a0000|0x5f2fbfff|0 
Module|davhlpr.dll|6.1.7600.16385|davhlpr.pdb|59C1CF63A3964C2CB96E
...
0|0|mozjs.dll|JSObject::getGeneric(JSContext *,JS::Handle,JS::Handle,JS::Handle,JS::MutableHandle)|hg:hg.mozilla.org/releases/mozilla-aurora:js/src/jsobj.h:b353e78ee8e7|991|0x1c 
0|1|mozjs.dll|GetPropertyOperation|hg:hg.mozilla.org/releases/mozilla-aurora:js/src/vm/Interpreter.cpp:b353e78ee8e7|263|0x1c 
0|2|mozjs.dll|Interpret|hg:hg.mozilla.org/releases/mozilla-aurora:js/src/vm/Interpreter.cpp:b353e78ee8e7|2249|0x1e 
0|3|mozjs.dll|js::RunScript(JSContext *,js::RunState &)|hg:hg.mozilla.org/releases/mozilla-aurora:js/src/vm/Interpreter.cpp:b353e78ee8e7|419|0x9 
0|4|mozjs.dll|js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct)|hg:hg.mozilla.org/releases/mozilla-aurora:js/src/vm/Interpreter.cpp:b353e78ee8e7|481|0xb 0|5|mozjs.dll|js::CallOrConstructBoundFunction(JSContext *,unsigned int,JS::Value *)|hg:hg.mozilla.org/releases/mozilla-aurora:js/src/jsfun.cpp:b353e78ee8e7|1257|0x16 
0|6|mozjs.dll|js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct)|hg:hg.mozilla.org/releases/mozilla-aurora:js/src/vm/Interpreter.cpp:b353e78ee8e7|462|0xb9 
...

Sample new data format (json):

{
   "java_stack_trace" : null,
   "distributor_version" : null,
   "uuid" : "46621f4b-03ec-4c3a-afe5-279da2131119",
   "startedDateTime" : "2013-11-19 13:29:56.613620",
   "truncated" : false,
   "os_version" : "6.1.7601 Service Pack 1",
   "hangid" : null,
   "addons" : [
      [
         "{0303e6fc-c062-47f1-825d-73e5e97d1d43}",
         "1.133"
      ],
      [
         "{972ce4c6-7e08-4474-a285-3208198ce6fd}",
         "27.0a2"
      ]
   ],
   "addons_checked" : true,
   "uptime" : 9162,
   "address" : "0x5",
   "date_processed" : "2013-11-19 13:29:47.544118",
   "success" : true,
   "install_age" : 234007,
   "reason" : "EXCEPTION_ACCESS_VIOLATION_READ",
   "cpu_info" : "AuthenticAMD family 20 model 2 stepping 0 | 2",
   "distributor" : null,
   "pluginName" : null,
   "signature" : "JSObject::getGeneric(JSContext*, JS::Handle, JS::Handle, JS::Handle, JS::MutableHandle)",
   "crashedThread" : 0,
   "client_crash_date" : "2013-11-19 13:27:57.000000",
   "completeddatetime" : "2013-11-19 13:30:02.020154",
   "release_channel" : "aurora",
...
      "crashing_thread" : {
         "threads_index" : 0,
         "total_frames" : 34,
         "frames" : [
            {
               "function_offset" : "0x165",
               "function" : "JSObject::getGeneric(JSContext *,JS::Handle,JS::Handle,JS::Handle,JS::MutableHandle)",
               "trust" : "context",
               "file" : "hg:hg.mozilla.org/releases/mozilla-aurora:js/src/jsobj.h:b353e78ee8e7",
               "frame" : 0,
               "module_offset" : "0xc2145",
               "module" : "mozjs.dll",
               "offset" : "0x61472145",
               "line" : 991
            },
            {
               "function_offset" : "0xab",
               "function" : "GetPropertyOperation",
               "trust" : "cfi",
               "file" : "hg:hg.mozilla.org/releases/mozilla-aurora:js/src/vm/Interpreter.cpp:b353e78ee8e7",
               "frame" : 1,
               "module_offset" : "0xc19fb",
               "module" : "mozjs.dll",
               "offset" : "0x614719fb",
               "line" : 263
            },
...
}

Wednesday, November 13, 2013

Sleep solves everything

Will I ever learn this lesson? Throughout my programming career, I will ruminate on a tricky problem for hours at a time to no avail. Banging my head against a programming problem doesn't help. Time and time again, I find, the next morning, that the problem is trivial and I fix it in short order.

Learning and problem solving are related to sleep. I see the same thing in playing music. I play woodwinds and really enjoy the tricky fast passages common in Baroque Music. Sometimes I will struggle for hours over some awkward fingering that I just can't seem to get. In music, it's called wood shedding: sitting somewhere and playing the same passage over and over until you get it right. Sometimes that works, sometimes it doesn't.

What does work every time? Practicing for a while, moving on to something else and then sleeping. The next morning, I find that performing the complicated or tricky fingering is much easier. If hack at it too much, I reinforce the errors instead of the correct fingering.

This is so true in programming. Case in point yesterday and today with a problem in the Socorro Middleware. We discovered the problem and I stepped up to fix it. Forty minutes later I submitted my patch only to find that it made the Middleware explode in a completely unrelated place. There seemed to be no logical connection between the work that I did and the failure. I banged my head on that problem for hours and hours pushing myself into a fourteen hour work day.

This morning, I looked at the problematic code and said, “I wonder...”, then spotted the problem, made the trivial fix, pushed the code to github, watched the Jenkins job feed it through the battery of tests and voilĂ  it passed.

 It is a lesson that is hard to learn. Maybe if I were to sleep on it, I'd learn it.

Friday, November 08, 2013

Configuration is eating my brain


I've created a monster and it has come back to eat my brain. I've made several blog posts about Configman, my universal configuration manager that encapsulates command line, configuration file and environment configuration systems. It is a powerful system that gave Socorro a flexible dependency injection framework. It has enabled us to swap out storage schemes and processing algorithms using configuration.

In Socorro, we've chosen to use INI files for configuration. Configman is able to create the canonical INI file for any app that employs Configman. Applications are comprised of components that declare what external resources they need. For example, a processor may need an HBase crash storage source, an HBase crash storage destination and a RabbitMQ queue. The processor code for each of these three components declare their needs in a Configman compatible manner. In turn, Configman will create an INI file for the processor that has three sections: source, destination and queue. Within each of these sections will be the configuration requirements for the external resources:

[source]
    storage_class=socorro.external.hb.crashstorage.HBaseCrashStorage
    host=localhost
    port=9090
[destination]         
    storage_class=socorro.external.hb.crashstorage.HBaseCrashStorage
    host=localhost
    port=9090
[queue]         
    queue_class=socorro.external.rabbitmq.new_crash_source
    host=rabbitmqHost
    user=rabbitmqUser
    password=rabbitmqPassword
Notice that the source and destination sections both have the same requirements. It is inconvenient to have to specify the HBase connection information twice. To solve that problem, we've chosen to extend the INI file syntax with an +include directive:
[source]
    +include common_hbase.ini
[destination]         
    +include common_hbase.ini
[queue]         
    queue_class=socorro.external.rabbitmq.new_crash_source
    host=rabbitmqHost
    user=rabbitmqUser
    password=rabbitmqPassword
Then we create the file common_hbase.ini with the HBase connection requirements and the information only has to be specified once.

This works great until some other component needs the some of the same information, but not all of it from the common_hbase.ini file. We cannot use the +include in that case because bringing extra symbols into the a section is an error as far as Configman is concerned. To get around this problem, we relaxed the requirements to allow unknown symbols in sections. Unfortunately, this immediately sacrifices important error detection: misspell a symbol and configman won't know if it is misspelled or just unused. This is not ideal.

The system of +include also enables multiple applications to share some configuration information. The processor and the crashmover both need to talk to HBase, so we could use one common_hbase.ini file for both applications. That works fine until one application needs different values for one or more of the parameters defined in the include file. This is the case in our production environment, where some applications use a different user names to connect with the same resource. We could factor the variable parameters back out of the +include file, or make nested +include files. As we get into it, however, we end up adding a whole new layer of complexity that is hard to manage.

Here is a proposal for getting around the problem. I'm going to mandate that all INI files have a [resource] section. Within that section, each external resource will have its own subsection. Configman will create this resource section automatically when it reads the resource requirements from the loaded application components.
[resources]
    [[hbase]]
        storage_class=socorro.external.hb.crashstorage.HBaseCrashStorage
        host=localhost
        port=9090
    [[rabbitmq]]
        queue_class=socorro.external.rabbitmq.new_crash_source
        host=rabbitmqHost
        user=rabbitmqUser
        password=rabbitmqPassword
[source]
    # storage_class -> resources.hbase.storage_class
    # storage_class=
    # host -> resources.hbase. host
    # host=
    # port -> resources.hbase. port
    # port=
[destination]         
    # storage_class -> resources.hbase.storage_class
    # storage_class=
    # host -> resources.hbase. host
    # host=
    # port -> resources.hbase. port
    # port=
[queue]      
    # storage_class -> resources.rabbitmq.storage_class   
    # queue_class=
    # host -> resources.rabbitmq. host
    # host=
    # user -> resources.rabbitmq. user
    # user=
    # password -> resources.rabbitmq.password
    # password=
For example, the application, when it wants its configuration value for the source storage_class, will reference the configuration object normally: config.source.storage_class. Behind the scenes, Configman knows that this configuration parameter is linked to the resource section. Configman will return the value from the resource section to the application.

In the case where a particular service needs a different value than the one defined in the resource section, it may be overridden in its original location by uncommenting it and providing an alternative value:

[resources]
    [[hbase]]
        storage_class=socorro.external.hb.crashstorage.HBaseCrashStorage
        host=localhost
        port=9090
…
[source]
    # host -> resources.hbase. host
    host=192.168.1.222
This new resource system does not preclude the use of +include files. If several applications were to need HBase configuration, a +include common_hbase.ini could be created and used inside the resource section:

[resources]
    [[hbase]]
        +include common_hbase.ini
The values read in from the +include file can be overridden in the original sections, just as in the previous example. However, because Configman employs ConfigObj for INI file processing, an override of a given value within the same section that has the +include is not allowed. This is a restriction imposed by ConfigObj.

How does this resolve the problem that we're having at Mozilla?

It consolidates the resources configs. Configuration for an app's external resources is done in one place at the top of the INI file for each app. We do not need to maintain the common_*.ini include files. The configuration files for development, staging, and production can be identical except for the resource connection details.

But now we have to repeat the resource connection information in the INI file for each app, isn't that less convenient?

We can choose to use +include files, but I discourage it. While we may be calling them 'common' files, in our production environment they aren't really common. The processors use a different HBase host than the middleware; the middleware uses a different user and host for Postgres than Crontabber; etc. Coding for exceptions to the common files is a complication.. It will be easier to maintain configuration on an app by app basis. It minimizes the number of configuration files and completely avoids +includes and their inevitable exceptions.