When I first encountered formal methods, I was baffled by how the experts specified system correctness. At first, their specifications seemed unnecessarily convoluted, like they were speaking in riddles when simple statements would do the trick.
Many specifications state correctness indirectly and then they make you do the math. Like instead of making me traverse the family tree by saying "your parent's sibling's child" just say "cousin".
Let’s crack a riddle.
Distributed Transactions
Take distributed transactions as an example: A distributed transaction, also referred to as a global transaction, consists of two or more non-distributed transactions, also referred to as local transactions, executing on participants referred to as resource managers.
For a distributed transaction to be correct, the rule seems straightforward: either all participants commit or all participants abort.
Most explanations would stop there. But formal methods experts tell a different story, one that seems overcomplicated but ultimately reveals profound insights about building reliable systems.
The "Riddle" Unveiled
Instead of the simple rule above, formal specifications split correctness of distributed transactions into two separate statements:
Statement #1
There is no pair of resource managers such that one is in the commit state and the other is in the abort state.∄ rm₁, rm₂: state[rm₁] = commit ∧ state[rm₂] = abort
Statement #2
For each resource manager, the state eventually becomes and remains either committed or aborted.
∀ rm: ◊□ state[rm] = commit ∨ ◊□ state[rm] = abort
Where
∀ means for all
∄ means there does not exist
∧ means and
∨ means or
◊ means eventually
□ means always from this point forward
At first glance, this seems like academic overengineering. Why split one simple rule into two complex statements?
Safety and Liveness: The Foundation of Correctness
These two statements above represent fundamentally different types of guarantees:
Safety Properties
Statement #1 is a Safety property. A Safety property states that nothing bad must ever happen.Liveness Properties
Statement #2 is a Liveness property. A liveness property states that something good will eventually happen.
Alpern and Schneider proved that every correctness property can be expressed as the intersection of some safety and some liveness properties.
However, this is not simply mathematical elegance but the key to reasoning about systems under failure.
The Failure Scenario
Here is where the riddle turns into rigor: In a perfect world, both safety and liveness guarantees hold. But when processes crash and networks partition, we may have to compromise. And we can’t afford vague correctness.
We can define failure tolerance in terms of safety and liveness properties:
Masking Failure Tolerance
The system guarantees both Safety and Liveness guarantees in the presence of failure.
Non-Masking Failure Tolerance
The system guarantees Liveness but not Safety guarantees in the presence of failure. Informally speaking, the system makes no guarantee not to make any mistakes, but the system does guarantee to make progress.Fail-Safe Failure Tolerance
The system guarantees Safety but not Liveness guarantees in the presence of failure. Informally speaking, the system does guarantee not to make any mistakes, but the system does not guarantee to make progress.
Neither Safe nor Live?!
That’s just sad 😭
Database systems and distributed transactions prioritize consistency over availability and choose Fail-Safe Failure Tolerance: participants may stall (something good will not happen) but they will never disagree (something bad will never happen).
Conclusion
Formal methods don't decompose correctness (only) for academic pleasure. They decompose properties because:
Safety and Liveness are the fundamental building blocks of correctness properties.
In the presence of failure, Safety and Liveness allow us to clearly specify normal and abnormal correctness guarantees.
What at first sounds like a riddle turns out to be a precise language for discussing the hard choices we must make when building reliable systems in an unreliable world. Once you learn the language, you'll wonder how you ever reasoned about systems without.
Next time you read a formal specification that feels overly complicated, remember, it’s probably just separating what must never go wrong from what must eventually go right.
Want to go deeper?
If you want to read more about distributed systems, correctness, failure tolerance, and much more, please consider my book Think Distributed Systems