Back-to-Basics Weekend Reading - Distributed Snapshots: Determining Global States of a Distributed System
Several problems in Distributed Systems can be seen as the challenge to determine a global state. In the classical “Time, Clocks and the Ordering of Events in a Distributed System” Lamport had laid out the principles and mechanisms to solve such problems, and the Distributed Snapshots algorithm, popularly know as the Chandy-Lamport algorithm, is an application of that work. The fundamental techniques in the Distributed Snapshot paper are the secret sauce in many distributed algorithms for deadlock detection, termination detection, consistent checkpointing for fault tolerance, global predicate detection for debugging and monitoring, and distributed simulation.
An interesting anecdote about the algorithm is told by Lamport: “The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University of Texas in Austin. He posed the problem to me over dinner, but we had both had too much wine to think about it right then. The next morning, in the shower, I came up with the solution. When I arrived at Chandy’s office, he was waiting for me with the same solution.”
Distributed Snapshots: Determining Global States of a Distributed System K. Mani Chandy and Leslie Lamport, ACM Transactions on Computer Systems 3(1), February 1985.