Skip to main content

Distributed Transactions

Transaction is a change of state. Since change of state can be critical point in processes in software systems, we usually require transactions to be compliant to certain guaranties. For example, we can require that user wouldn't be able to see data mid-change. Nothing comes free, so the more guarantees are required from transactions - more difficult it is to maintain them. A lot of guaranties will result in complex implementation and penalty on performance.

ACID in distributed systems

A - atomicy - it is impossible to see changes while they are happening. Can be on a different scale - the whole set of changes in transaction can be atomic, or only node-local changes, or tables, or rows, or even only columns.

C - consistency - all users see same commited state. Opposite of that will be when one user commits change and can immediately see it on read, while other still reading old data.

I - isolation - different users don't see uncommited and sometimes commited changes of other users. Very expensive requirement, thus usually implemented only partly. Example: other users can see new rows, but don't see changed rows.

D - durability - commited changes are not going to disappear. Opposite of that can happen, if we on commit change data in inmemory cache and flush them asynchronously.

Two-phase commit (2PC) protocol

Old technique for simultaneously commit changes in several nodes.

Process looks like this:

  1. Prepare Master sends change request, and waits until all nodes apply it and respond "ready".
  2. Commit After receiving from all nodes confirmation, that change is applied and nodes are ready to commit, Master sends commit command to all nodes. Then waits for success responses from nodes.

Algorithm has several flaws:

  • Master is a singular point of failure.
  • Node can apply changes and be ready for commit, but then fail on commit. This and other such cases should be tracked and fixed ASAP. Usually, system should go into "not_ready" state until cluster is fixed.

Three-phase commit (3PC) protocol

Advanced version of two-phase commit protocol. In consists of 3 steps:

  1. Prepare Master sends change request, and waits until all nodes apply it, but not commit.
  2. Pre-commit Master requests nodes to prepare for commit, and waits until all nodes reply "ready";
  3. Commit After receiving from all nodes confirmation, that change is applied and nodes are ready to commit, Master sends commit command to all nodes. Then waits for success responses from nodes.

Algorithm is supposed to be better version of two-phase commit protocol. But it still comes with downsides:

  • Master is still a singular point of failure.
  • A lot of network overhead.
  • Node can apply changes and be ready for commit, but then fail on commit. This and other such cases should be tracked and fixed ASAP. Usually, system should go into "not_ready" state until cluster is fixed.

Distributed Transaction Patterns

Sagas

Main idea of sagas is to split one complex transaction into several smaller local transactions.

For example, transaction BuyItem probably can be split into several smaller ones - BookItem, BookDelivery, ProcessPayment.

If any sub-transaction fails - compensation transactions should be performed instead. It would be UnBookItem, UnBookDelivery, RefundPayment.

Orchestration Based Sagas

All sub-transactions are initiated by master service - orchestrator.

Choreography Based Sagas

Sub-transactions initiated either explicitly by each service responsible for previous sub-transactions (BookItem -> BookDelivery -> ProcessPayment), or implicitly by emitting events.

TCC (Try-Confirm/Cancel)

Pattern consists of several steps:

  1. Try to execute the transaction.
  2. Confirm in second request, that transaction was successfull.
  3. Cancel if transaction failed.

Transactional Outbox Pattern

Put transaction request first into outbox table or queue. That allows to explicitly controll success of transactions, react on failures, ensure sequential calls when needed.