Thursday, May 31, 2007

Analyzing Software Failures

A few months ago I wrote a small piece called "Most Web Security is Like Finding New Ways to Spend the Loot."

I was thinking again about this topic and how best to contribute to the security world, and how much good finding new attacks does versus finding new ways of defending against things.

I'm reminded of accident investigations in the real world. Airplanes rarely crash. When they do, its something noteworthy. We have an organization called the NTSB set up to investigate serious transportation accidents and they investigate every plane crash. Why investigate all airplane crashes and not all car crashes? Because airplane crashes are pretty rare and we believe we probably have an anomaly and something interesting to learn by investigating an airplane crash. The same cannot be said about car crashes. They are simply too frequent and too "routine" to bother investigating.

When we think about civil engineering and failure analysis we don't generally spend a lot of time on every roof that caves in at a cheap poorly constructed strip mall. We spend a lot of time investigating why bridges fail, why skywalks fail, etc. These are things that were presumably highly engineered to tight tolerances, where a lot of effort was spent or should have been spent to ensure safety and where nevertheless something went wrong. We start with the premise that by examining this anomaly we can learn something that will teach us how to build or design better the next time around.

In software security its pretty amazing how much time we spend doing the opposite. How much time we spend analyzing applications that were poorly designed, that were never designed to resist attacks, prevent disasters, etc. We seem to never tire of finding a new security flaw in myspace, yahoo, google mail, etc.

What do we learn from finding these vulnerabilities? That we don't in general design very good software? That we have fundamental flaws in our design, architecture, and tools that cause these failures? We've known that for years. No new analysis is going to tell me its hard to design a high assurance application in PHP. We already know that.

The types of vulnerability and attack research that interests me are those that actually show a brand new type of attack against a piece of software that was well written, had good threat analysis done, where the developers, architects, and designers really were focused on security and somehow still left themselves vulnerable to an attack.

It isn't necessarily that these are rare situations. Microsoft Vista has had a number of security vulnerabilities since its launch despite the best efforts of Microsoft to eradicate them. Analysis like that provided by Michael Howard about the ANI bug in Vista is what we need more of. We need more companies with mature secure software development practices to tell us why bugs occur despite their best efforts. We need more people to be as open as Microsoft is being about how the best designed processes fail and we need to learn from that to get it better the next time around.

In the transportation world this is the role that the NTSB plays. The NTSB shows us how despite our best efforts we still have accidents, and they do it without partisanship or prejudice. The focus of the investigation is root cause analysis and how we can get it right the next time around.

I read this same idea on someone's blog recently as applied security breaches. They proposed an NTSB sort of arrangement for investigating security breaches so that we get this same sort of learning out those situations. If anyone can point me to the piece I'll properly attribute it here.

Along these same lines we ought to have an NTSB equivalent for software security so that we can learn from the mistakes of the past. Bryan Cantrill at Sun and I shared a small exchange on his blog about this topic related to pathological systems and I referred to Henry Petroski and his work in failure analysis in civil engineering. Peter Neumann's Risks Digest is the closest we come to a general forum for this sort of thing in the software world and I'm always amazed (and depressed) by how few software engineers have ever heard of Risks, much less read it.

Why is it that so few companies are willing to be as public as Microsoft has been about their SDL, and how do we encourage more companies to participate in failure analysis so that we can learn collectively to develop better software?

1 comment:

Richard Brown said...

"Why investigate all airplane crashes and not all car crashes? Because airplane crashes are pretty rare and we believe we probably have an anomaly and something interesting to learn by investigating an airplane crash. The same cannot be said about car crashes. They are simply too frequent and too "routine" to bother investigating."

It should also be added that the reason plane crashes are investigated is because they're highly visible and lots of people die at one time, whereas the drip, drip, drip of dead motorists is less noticeable.

For much the same reason we spend *far* more making railways safe in the UK than we do making roads safe. It's utterly irrational and is due, in no small part, to media scaremongering and self-serving agitation from various unions.

If politicians really cared about saving lives - rather than giving out signals that *suggest* they do, their spending priorities would be quite different.