Jepsen & TigerBeetle

Jul 7

Avoiding Safety Violations in Distributed Systems with Systems Engineering

7 Comments

Is this right? I think there's a flaw with the assumption in figure 6 here. You should only make the transfer request visible again, i.e. re-enqueue with an "attempt" number once backend A gets ECONNREFUSED refused error. Why should the backend A make it visible while it's in the flight of processing request A?

Expand full comment

Reply (1)

Dominik Tornow

In a distributed systems, at-least once queues have to time out the lock (make the message visible again).

My favorite blog post covering distributed locks (and one of my all time favorites in general) is from Martin Kleppmann:

https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

Expand full comment

Reply (1)

parth desai

Yeah, that makes sense, but i would configure the lock timeout for the queue to be greater than the timeout of transfer request. This way, you can definitely catch definite failures, which this would

The point about indefinite failure still remains true, but you can at least guarantee the state on a subset of errors.

Expand full comment

Reply (1)

Dominik Tornow

Yes, that is possible, but comes with its own risks. First, distributed systems should not rely on physical time (except true time). Second, more importantly, that will lead to a brittle architecture: A configuration change in *the queue* leads to faulty behavior in the *backend process*. In practice, that is hard to track, especially if the queue is under different administrative control than the backend process.

Expand full comment

Reply (1)

parth desai

I think it's debatable what's more brittle, a system in a complete deadlock or you providing a way for the system to move forward when feasible.

Having correct configuration across the system is part of modern software development IMO, and should be treated as such.

> This is a valid concern but ignores monitoring. The system can and should log failures and alert operators.

As a way to avoid it, you're asking to rely on operators and human intervention. How is that not *more* brittle?

A thought experiment, think TigerBeetle was being used by banks and you were at a ATM. As an end user, would you have system in a complete deadlock because connection was refused, or would you rather get a feedback as soon as the system has it available?

Expand full comment

Son

Thanks for the post. Is it just me or fig 6 appears twice by mistake?

Expand full comment

Reply (1)

Dominik Tornow

Thank you. Fixed

Expand full comment

Scattered Thoughts • Some Assembly Required

Jepsen & TigerBeetle