Back-to-Basics Weekend Reading: Why Do Computers Stop and What Can Be Done About It?
“Everything fails, all the time.” A humble computer scientist once said. With all the resources we have today, it is easier for us to achieve fault-tolerance than it was many decades ago when computers began playing a role in critical systems such as health care, air traffic control and financial market systems. In the early days, the thinking was to use a hardware approach to achieve fault-tolerance. It was not until the mid-nineties that software fault-tolerance became more acceptable.
Tandem Computer was one of the pioneers in building these fault-tolerant, mission-critical systems. They used a shared-nothing multi-cpu approach. This is where each CPU had its own memory- and io-bus, and all were connected through a replicated shared bus, over which the independent OS instances could communication and run in lock step. In the late seventies and early eighties, this was considered state of the art in fault-tolerance.
Jim Gray, the father of concepts like transactions, worked for Tandem on software fault-tolerance. To be able to build better systems, he went deep in deconstructing the kind of failures Tandem customers were experiencing. He wrote up his findings in his “Why do Computers Stop” report. For a very longtime, this would be the only study available on reliability in production computer systems.
As important as the study is, the paper additionally covers “What can be done about it.” Jim, for the first time, introduces concepts like process-pairs and transactions as the basis for software fault-tolerance. This is one of the fundamental papers of fault-tolerance in distributed systems, and I am going to enjoy reading it this weekend. I hope you will also.
“Why Do Computers Stop and What Can Be Done About It?”, Jim Gray, June 1985, Tandem Technical report 85.7