Skip to main content

Distributed Systems

Replication

Master-Slave Replication

  • All writes go to master node.
  • Changes then moved from master to slave nodes, synchronously or asynchronously.
  • Reads are served from any node.

Master-Master Replication

  • Can be imagined as cluster of Master-Slave clusters
  • Very bug-prone, ideally should be avoided if possible
  • Can be useful for splitting cluster into multy-zone setup
  • Also useful when some master nodes can go offline for a long period of time
  • Masters can be organized in several topologies: circular, star, all-to-all. Circular and star are more network efficient, but prone to availability issues. All to all are mo rubust, but can have more conflicts.

Leaderless Replication

  • There is no leader, so client read and write to all nodes responsible for the record key.
  • Usually, there is a minimal amount of nodes, which should be available to make reads and writes happen. Such amounts ara called read and write quorums.
  • Usually, the requirement in quorum sizes looks like this: r + w > n, where r - read quorum size, w - write quorum size, n - total amount of nodes in shard.
  • With different values for r and w it is possible to achieve flexible behaviour of the system. For example, with w = n, r = 1 it is possible to ensure fast non-blocking reads and slow, but highly-consistent writes. Although, that configuration can be affected by the node unavailability.
  • Although high values of r and w improve chances of data being in consistent state, there is no guarantee that consistency is 100%.
  • Sometimes, during write, nodes responsible for the key might not be accessible. Some DBs allow other nodes to temporarily accept the write. Such option is called sloppy quorum. After nodes responsible for keeping the data came back online, writes are being sent to them, which is called hinted handoff.
  • Discrepancies in data consistency usually fixed by 2 processes:
    • Read Repair -- when client reads data from several nodes, it can detect inconsistencies, pick the most relevant record and fix data in outdated node.
    • Anti-Entropy -- background process which checks that data is still consistent between nodes and fixes the data if needed.
  • Leaderless DBs usually can work in multy-datacenter setup without much of the problem.

Synchronous and Asynchronous Replication

Fully Synchronous Replication -- when changes should go to all nodes synchronously before user transaction is commited. Even though this approach guarantees maximum data consistency, loss of even one slave nodes might make the whole cluster unavailable, thus making the whole architecture more fragile. Unavailability of the nodes should be taken into consideration.

Chained Replication -- special kind of Synchronous Replication, when nodes are connected to the chain, and data is being transferred from one node to another. Read performance can be improved by CRAQ variation of a pattern.

Follow-up: Chain Replication by Jordan has no life; Chain Replication for Supporting High Throughput and Availability by Robbert van Renesse and Fred B. Schneider

Semi Synchronous Replication -- when changes from master node go to one slave node synchronously, and to others - asynchronously. In such case cluster protected from losing the master node, as synchronously connected node would have all changes, and would be promoted to "master" status. Such setup has lower consistency guarantees, which should be tackled separately.

Follow-up: Semi-Synchronous Replication at Facebook by Yoshinori Matsunobu

Fully Asynchronous Replication -- when changes from master node go to all slave nodes asynchronously.

Types of Replication Logs

Statement Replication - based on log of DML statements applied over the time. It has one major disadvantage - a lot of statements are non-deterministic. For example NOW() will always have different values. In order to make this replication work all such statements should be pre-executed, and results of execution should be preserved in log files. Write-Ahead Log (WAL) Shipping - based on sharing WAL over the network. It is extremely space-efficient. Disadvantage - WAL has binary format, so it can be non-backwards compatible between different versions of the DB. That prevents from using existing high-availability mechanism for zero-downtime update of DB version on cluster. Logical (Row Based) replication - based on sharing changes of the data in each row, with unique identifiers if possible. Most flexible approach, but also potentially the most verbose. Trigger Based - custom-written replication on application level, execution is called by DB triggers. Most fragile method, but may be necessary in certain cases.

Distributed Transactions

Transaction is a change of state. Since change of state can be critical point in processes in software systems, we usually require transactions to be compliant to certain guaranties. For example, we can require that user wouldn't be able to see data mid-change. Nothing comes free, so the more guarantees are required from transactions - more difficult it is to maintain them. A lot of guaranties will result in complex implementation and penalty on performance.

ACID in distributed systems

A - atomicy - it is impossible to see changes while they are happening. Can be on a different scale - the whole set of changes in transaction can be atomic, or only node-local changes, or tables, or rows, or even only columns.

C - consistency - all users see same commited state. Opposite of that will be when one user commits change and can immediately see it on read, while other still reading old data.

I - isolation - different users don't see uncommited and sometimes commited changes of other users. Very expensive requirement, thus usually implemented only partly. Example: other users can see new rows, but don't see changed rows.

D - durability - commited changes are not going to disappear. Opposite of that can happen, if we on commit change data in inmemory cache and flush them asynchronously.

Distributed Writes

If write can be handled by multiple nodes - there will always be problem with determining consistency of the data.

  • Last Write Wins - onw of the methods of achieving data consistency in case of concurrent writes. To achieve this all changes should have a way to determine which one was before and which -- later, for example using timestamp or row version. However, such algorithm will cause DB to violate durability promise, because it can get to a situation, when transaction was marked as successful to a user, but then get overwritten by other change.

Two-phase commit (2PC) protocol

Old technique for simultaneously commit changes in several nodes.

Process looks like this:

  1. Prepare Master sends change request, and waits until all nodes apply it and respond "ready".
  2. Commit After receiving from all nodes confirmation, that change is applied and nodes are ready to commit, Master sends commit command to all nodes. Then waits for success responses from nodes.

Algorithm has several flaws:

  • Master is a singular point of failure.
  • Node can apply changes and be ready for commit, but then fail on commit. This and other such cases should be tracked and fixed ASAP. Usually, system should go into "not_ready" state until cluster is fixed.

Three-phase commit (3PC) protocol

Advanced version of two-phase commit protocol. In consists of 3 steps:

  1. Prepare Master sends change request, and waits until all nodes apply it, but not commit.
  2. Pre-commit Master requests nodes to prepare for commit, and waits until all nodes reply "ready";
  3. Commit After receiving from all nodes confirmation, that change is applied and nodes are ready to commit, Master sends commit command to all nodes. Then waits for success responses from nodes.

Algorithm is supposed to be better version of two-phase commit protocol. But it still comes with downsides:

  • Master is still a singular point of failure.
  • A lot of network overhead.
  • Node can apply changes and be ready for commit, but then fail on commit. This and other such cases should be tracked and fixed ASAP. Usually, system should go into "not_ready" state until cluster is fixed.

Levels of Consistency

Eventual Consistency means changes are spreading over the cluster nodes with a delay - replication lag.

Read Your Own Writes - slightly more strict version of consistency. If user made a write, we want them to see the changes immediately. For other users there is no such requirement. Such level of consistency allows to keep replication relaxed, but make system appear as strongly consistent to user. There are multiple ways to implement this - serve certain endpoints only directly from master node, remember the timestamp of a write and retrieve recent data from master node, e.t.c.

Monotonic Reads - even more strict version of "eventual" consistency. User should not "travel back in time" when querying the data. That can happen if first request was served by node with more recent data, and second request was served by node, which hasn't got received recent replication data. That consistency requirement can be solved by, for example, making sure that given user reads only from one assigned node. Obviously that is going to break if node is not accessible.

Consistent Prefix Reads - writes which happened in certain order should preserve such order when queried. That can be the problem with sharded systems, when one write goes to the first node but gets delayed due to replication lag, second write goes to second node but being processed immediately, thus making it appear that second event happened earlier than the first.

Partitioning

Strategies for Determining Partitions

  • By Range of the Keys - similar to partitioning a dictionary - each letter can be a partition. Very simple algorithm, allows for range queries (imagine query from A to C), but requires knowledge of full range of keys, also very prone to having hot spots.
  • By Hash of the Key - allows to evenly distribute keys by partitions, reducing risk of hot spots (but not eliminating it fully). As a downside - it is no longer supports range queries.

Secondary key partitions

  • Local - each partition stores also index table for secondary key values for data only from its partition. As result write is quick, but ready by secondary index requires look up in all partitions, and can be quite slow.
  • Global - (term-partitioned) stores index entries separately from the base data, partitioned by the indexed value. Unlike local secondary indexes, which are co-located with the data, GSIs enable flexible cross-partition and range queries but come with higher write costs, potential consistency issues, and added complexity.

Rebalancing of partitions

  • Fixed number of partitions - does not require rebalancing. Main caveat - amount of partitions should be much higher than amount of partitions really needed for a split, to accommodate future growth of the data.
  • Dynamic rebalancing - based on partition size. If certain partition becomes bigger than certain size - it is being split in 2.
  • Dynamic rebalancing by node size - based on node size. Amount of partitions are kept in sync with amount of nodes. If new nodes are introduced to the cluster - they split certain existing partitions and take split halves.

Distributed Transaction Patterns

Sagas

Main idea of sagas is to split one complex transaction into several smaller local transactions.

For example, transaction BuyItem probably can be split into several smaller ones - BookItem, BookDelivery, ProcessPayment.

If any sub-transaction fails - compensation transactions should be performed instead. It would be UnBookItem, UnBookDelivery, RefundPayment.

Orchestration Based Sagas

All sub-transactions are initiated by master service - orchestrator.

Choreography Based Sagas

Sub-transactions initiated either explicitly by each service responsible for previous sub-transactions (BookItem -> BookDelivery -> ProcessPayment), or implicitly by emitting events.

TCC (Try-Confirm/Cancel)

Pattern consists of several steps:

  1. Try to execute the transaction.
  2. Confirm in second request, that transaction was successfull.
  3. Cancel if transaction failed.

Transactional Outbox Pattern

Put transaction request first into outbox table or queue. That allows to explicitly controll success of transactions, react on failures, ensure sequential calls when needed.