Thursday, February 07, 2008

The Checklist

Brian Chess wrote about a great article in the New Yorker - "The Checklist." The article is a fantastic read and I highly recommend it, even if you're not interested in medicine. It is well written and quite engaging about how doctors handle a ridiculously complex topic - intensive care.

Like Brian, I was struck by how closely the article can parallel some of the problems we face in trying to develop secure software. I agree with the basic premise of Brian's statement, that a checklist can help in the software development world just like it can in the ICU. I've had great success providing checklists to developers of common areas of concern, areas they need to make sure the document, etc.
  • Document how you handle authentication. if different from standard X, get a security reviews.
  • Document how you're handing input filtering. If not the standard library with declarative syntax, document and get a security review.....
You get the picture. You can do similar things with static analyzers for example, and even by tweaking compilers or compile environment to prevent the usage of certain easy to mess-up functions such as strcpy, messed up buffer sizes, etc.

I want to focus on two other items from the article that are worth noting.
  1. Metrics
  2. Processes
Metrics

In the paper the author talks about following the checklist and how it reduced deaths. One thing he never mentions is the cost of following the checklist. I thought it interesting, but I can only assume based on the number of lives saved, and the cost of even a single infection, that the costs of following the checklist are far outweighed by the cost savings. Still, it would have been nice to see a cost comparison between the two.

What is also interesting though is that in the hospital setting its generally quite clear what an adverse event is. We generally know when someone has an infection, we certainly know when someone dies. We do root cause analysis in many cases (though not all) to understand the general cause of death, though when there is an infection for example we don't always get to root cause.

One result of this sort of tracking, is that it occurs within a regulatory framework where hospitals must report their incident rates publicly, and there are agencies within government charged with collecting, monitoring, and even in some cases improving on these measurements and results.

As a result of this public tracking, the key doctor from the paper, Pronovost, was able pretty clearly to tell whether his process changes were having a positive or negative effect. He had lots of public data to draw from, and the incidence rate at any given hospital is large enough that we can start to make valid statistical judgments about the impact of our changes.

Contrast this with software and the differences in both area, and maturity, are quite telling. We don't have any standard measures of success/failure, we don't perform lots of root cause on adverse events, and we don't have public reporting of success and failure. So, we don't have a general body of knowledge that allows us to get better or at least measure how we're doing.

Maybe we ought to have something like that? I wrote about this last year when saying that we ought to have some sort of NTSB for security, or at least for security breaches. Maybe its time we start taking that more seriously?

Processes

I was also struck by one of Pronovost's comments about medicine that I think especially relevant to software security. When asked whether we'd get to the point that checklists are as common as a stethoscope for a Dr, he replied:

"At the current rate, it will never happen,” he said, as monitors beeped in the background. “The fundamental problem with the quality of American medicine is that we’ve failed to view delivery of health care as a science. The tasks of medical science fall into three buckets. One is understanding disease biology. One is finding effective therapies. And one is insuring those therapies are delivered effectively. That third bucket has been almost totally ignored by research funders, government, and academia. It’s viewed as the art of medicine. That’s a mistake, a huge mistake. And from a taxpayer’s perspective it’s outrageous.” We have a thirty-billion-dollar-a-year National Institutes of Health, he pointed out, which has been a remarkable powerhouse of discovery. But we have no billion-dollar National Institute of Health Care Delivery studying how best to incorporate those discoveries into daily practice.
I was reminded of Gunnar's response to the Spaf piece - "Solving the Wrong Problems." I think Gunnar hit it on the head with his criticism of Spaf's piece, and I think the situation is quite similar to the one Pronovost finds in medicine.

For the most part we fail to treat the delivery/creation of software as a science. We do lots of research on languages, we do lots of work on theories of security, and then it all breaks down because we have people implementing the processes, and we don't spend any time on that. Well, at least not in measure to how much we spend on all sorts of other efforts that we don't measure, we aren't sure achieve results, etc.

We know lots about how to theoretically secure things, but we don't know a whole lot about how to get large software development organizations to produce consistently high quality/"secure" software. Heck, we don't even know how to do it if we aren't budget constrained, much less if we are.

To be sure, medicine hasn't solved this problem either, and they aren't dealing with a huge installed base :) They are better at measuring effectiveness, but again they are in a life/death world plus they have the added joy of strict liability. Operating under those conditions they do manage to settle on newer/better techniques pretty quickly, because they are tracking how they are doing, lives are on the line, and they are pretty strongly incented to get it right.