Jepsen & TigerBeetle

Avoiding Safety Violations in Distributed Systems with Systems Engineering

Jul 07, 2025

Recently, Jepsen released a report on TigerBeetle, a database for financial transactions. In its report, Jepsen criticized a design decision regarding the interaction of the TigerBeetle Client with the TigerBeetle Cluster: Clients do not surface networking failures. Instead, clients will continuously retry a request until receiving a reply.

As the founder of Resonate HQ, the company behind Distributed Async Await, I spend most of my time working on distributed systems. So when Jepsen turned its attention to TigerBeetle, both champions of safe system design, I was especially interested in what the report would uncover.

The report highlights a philosophical divide in system design: Should a system expose more information, enabling more code paths but increasing the risk of safety violations? Or should a system expose less information, enabling less code paths but decrease the risk of safety violations?

In this post, I’ll discuss why exposing a definite failure may only offer the illusion of control while increasing risk—and why TigerBeetle’s failure handling strategy is the safe way to achieve safety.

TigerBeetle

TigerBeetle is a special-purpose, Online Transaction Processing (OLTP) database for financial transactions (double-entry accounting). Unlike general-purpose databases, TigerBeetle does not manage data in a user-defined schema, but handles accounts and transfers.

TigerBeetle is a distributed database that prioritizes safety: ensuring that nothing bad happens. TigerBeetle implements the Viewstamped Replication consensus protocol to guarantee strong serializability, even under adverse conditions such as process failures, network failures, and storage failures.

Jepsen

Jepsen is a framework for testing distributed systems. Jepsen generates concurrent operations, injects failures such as process crashes and network partitions, and verifies safety by comparing the system’s observed behavior to its expected behavior (black box testing).

Jepsen models each operation as a pair of events: an invocation and a completion.
Jepsen distinguishes three possible outcomes (Figure 1):

Definite success
The invocation is followed by a completion indicating success.
Definite failure
The invocation is followed by a completion indicating failure.
Indefinite failure
The invocation is not followed by a completion, introducing uncertainty whether the operation succeeded or failed.

Figure 1. Jepsen distinguishes three possible outcomes: Definite Success, Definite Failure, and Indefinite Failure

TigerBeetle & Jepsen

Jepsen built a test suite using the Jepsen testing library, which combines testing with failure injection. The Jepsen Client issues operations to the TigerBeetle client, which in turn forwards them across the network to the TigerBeetle Cluster and compares the observed response with the expected response to assess whether TigerBeetle adheres to its specification (Figure 2.).

Figure 2. The Jepsen Client uses operations to the TigerBeetle Client, which in turn forwards them across the network to the TigerBeetle Cluster

Jepsen Report

Jepsen tested TigerBeetle 0.16.11 through 0.16.30 and TigerBeetle appeared to meet all of its safety guarantees. As of 0.16.45, TigerBeetle had addressed every issue the report uncovered, with the exception of indefinite retries.

The Jepsen report challenges a specific design decision in the TigerBeetle Client: the client does not time out and does not surface network failures to the application

Indeed, TigerBeetle’s documentation states:

Requests do not time out1. Clients will continuously retry requests until they receive a reply from the cluster. This is because in the case of a network partition, a lack of response from the cluster could either indicate that the request was dropped before it was processed or that the reply was dropped after the request was processed.

The Jepsen report criticizes the TigerBeetle Client's retry mechanism for converting definite failures into indefinite failures. For example, when an application issues a transfer request and the TigerBeetle client receives an ECONNREFUSED, the client knows that this request cannot possibly have been executed. Yet, since the client continues to retry and this information is not exposed to the application, the Jepsen report considers the application unnecessarily stalled (Figure 3).

Figure 3. In case of a failure, the Jepsen report considers the application unnecessarily stalled

The report suggests that TigerBeetle should return definite errors to the application:

Jepsen recommend[s] that TigerBeetle develop a first-class representation for definite and indefinite errors, and return those errors to callers when problems occur

On the surface, this seems like a reasonable suggestion. However, TigerBeetle argues that implementing this suggestion makes systems less safe.

A Systems Engineering Perspective

To understand TigerBeetle's position, we need to shift from component-level thinking to systems-level thinking. The TigerBeetle’s documentation elaborates:

With TigerBeetle’s strict consistency model, surfacing these errors at the client or application level would be misleading. An error would imply that a request did not execute, when that is not known.

Why does TigerBeetle claim that surfacing a definite failure such as ECONNREFUSED is misleading? After all, receiving that failure guarantees we didn’t send any data.

But there’s a subtle yet critical distinction between saying my attempt at sending the message failed and sending the message failed. Mistaking one for the other blurs the line between component-level knowledge and system-level knowledge.

This confusion between component and system knowledge is the perfect breeding ground for safety violations.

Systems-Level Thinking

Systems engineering focuses on complex systems and constitutes a paradigm shift from component-level thinking to systems-level thinking: Systems engineering is about understanding and controlling the system as a whole to ensure the desired properties such as end-to-end correctness.

The composition of safe components does not automatically yield a safe system

Let’s examine this through a representative system architecture for a payment system using TigerBeetle as the system of record shown in Figure 4.

Figure 4. A Systems Engineering Perspective

The system follows a common frontend–queue–backend–cluster structure, where multiple backend processes concurrently compete to dequeue and process each request from an at-least once delivery queue.

Here, we focus on a transfer request. The happy path through the system is straight forward (Figure 5). When a user initiates a transfer through a frontend process e.g. a web or mobile app:

The frontend process enqueues the transfer request with a newly generated idempotence key onto the request queue.
A backend process dequeues the transfer request and forwards the request via the TigerBeetle Client to the TigerBeetle Cluster.
The TigerBeetle Cluster processes the transfer request, accepting or rejecting the transfer.
The TigerBeetle Cluster returns the result to the backend process, which returns the result to the frontend, which returns the result to the user.

Figure 5. Happy path of a transfer request

System guarantees frequently break at the interaction points between components, so let’s highlight the interaction between message queue and backend process:

queue.onMessage(request => {
  // Forward request to TigerBeetle
  tigerbeetleClient.process(request)
  // Acknowledge to prevent redelivery
  queue.ack(request)
  // Inform the user about success
  ...
})

The backend process dequeues a request, forwards the request using the TigerBeetle Client to the TigerBeetle Cluster, and acknowledges the request to prevent redelivery. However, even in case of redelivery, the idempotence key allows the TigerBeetle Cluster to deduplicate the request.

The Temptation to Surface Failures

Now imagine the TigerBeetle Client surfaces an ECONNREFUSED to the backend. The backend knows the request was not sent and concludes it’s safe to inform the user that the transfer was not processed:

queue.onMessage(request => {
  try {
    // Forward request to TigerBeetle
    tigerbeetleClient.process(request)
    // Acknowledge to prevent redelivery
    queue.ack(request)
    // Inform the user about success
    ...
  } catch (ERRCONNREFUSED) {
    // Acknowledge to prevent redelivery
    queue.ack(request)
    // Inform the user about failure
    ...
  }
})

Since ERRCONNREFUSED guarantees that the request was definitely not sent and acknowledging the request prevents redelivery, we conclude the transfer did not happen and never will.

But this conclusion is incorrect.

A Safety Violation

Figure 6. illustrates how a seemingly correct component-level decision causes system-level safety violations:

Figure 6. A seemingly correct component-level decision causes system-level safety violation

The frontend enqueues a transfer request.
Backend A dequeues the transfer request.
Backend A attempts to forward the transfer request but stalls.
The queue makes the transfer request visible again.
Backend B dequeues the transfer request.
Backend B attempts to forward the transfer request and succeeds.
Meanwhile, backend A receives a ECONNREFUSED.
Backend A informs the user: The transfer was not processed.

The user, believing the transfer did not happen, retries. The frontend issues a new transfer request with a new idempotence key. Result: A duplicate transfer.

Component-Level vs System-Level

We fell victim to a fundamental misunderstanding: In complex systems, we must not elevate component-level knowledge to system-level knowledge. Here, we confused the statement of a component did not process the transfer request with the system did not process the request and allowed the component to inform the user of a definite outcome although the component could not know.

In this system, when a request enters the system, no single component must assume the request did or did not succeed. Only the authoritative source, the TigerBeetle Cluster, can confirm the outcome.

In this system, retrying is not optional. Retrying until the TigerBeetle Cluster returns a result is the only safe way to inform the user.

Dimensionality Reduction

TigerBeetle has a design philosophy called TigerStyle, and one of its core principles is Dimensionality Reduction:

Minimize dimensionality. Keep function signatures and return types simple to reduce the number of cases a developer has to handle. For example, prefer void over bool, bool over u64, and so on, when it suits the function's purpose.

Exposing the definite failure only offers an illusion of choice: Now, instead of the TigerBeetle Client, either the backend process or the frontend process has to implement the retry—yet retrying remains the only safe choice.

TigerBeetle’s design decision contributes to the safety of the system: As soon as the TigerBeetle Client learns of the request, the client will keep trying to forward the request until receiving a response from the TigerBeetle Cluster.

Stalled Forever?

A common objection: If the TigerBeetle Client retries forever and the system is stalled, how do we know something is wrong?

This is a valid concern but ignores monitoring. The system can and should log failures and alert operators. After all, a persistent ECONNREFUSED failure hints at a larger issue. The system just must not make control-flow decisions and branch on information it cannot safely know.

However, we do have to pay attention to resource consumption or potential resource exhaustion in the presence of any retry loop.

Conclusion

Even if the TigerBeetle Client surfaced definite failures, most applications would still need to implement the same retry loop. Just in user code rather than the client library. TigerBeetle makes the safe choice: The TigerBeetle Client retains control of the retry logic and does not surface definite or indefinite failures. That’s not a lack of information.

That’s systems engineering.

If you are interested in distributed systems or Distributed Async Await, a concurrent, distributed programming model, join the Resonate Discord and say Hello

Every runtime provides mechanisms for a caller to interrupt waiting on a callee. E.g., Java’s Future.get(timeout) enables a caller to interrupt waiting by timing out. Here, we’re talking about the callee: the TigerBeetle client does not time out.

parth desai

Is this right? I think there's a flaw with the assumption in figure 6 here. You should only make the transfer request visible again, i.e. re-enqueue with an "attempt" number once backend A gets ECONNREFUSED refused error. Why should the backend A make it visible while it's in the flight of processing request A?

Expand full comment

4 replies by Dominik Tornow and others

Son

Thanks for the post. Is it just me or fig 6 appears twice by mistake?

1 reply by Dominik Tornow

5 more comments...

Scattered Thoughts • Some Assembly Required

Discussion about this post