|
1 | 1 | = Horizontal Scaling
|
2 | 2 |
|
| 3 | +When a single server is insufficient to meet the uptime and reliability requirements of a production application, it becomes necessary to |
| 4 | +scale out. In TypeDB, this is achieved by creating a database cluster that provides high availability and fault tolerance through data |
| 5 | +replication. This chapter explains the architecture and mechanics of a TypeDB cluster, with a focus on how replication enables fault |
| 6 | +tolerance and how transactions are processed across different nodes. |
| 7 | + |
| 8 | +== Introduction to High Availability |
| 9 | + |
| 10 | +For many applications, continuous availability is critical. A single server represents a single point of failure; if that server goes down |
| 11 | +due to a hardware failure or network issue, or is overwhelmed by the volume of requests, the application becomes unavailable. A TypeDB |
| 12 | +cluster mitigates this risk by deploying the database across multiple servers, or nodes. |
| 13 | + |
| 14 | +TypeDB achieves high availability and fault tolerance through data replication. Replication ensures that every node in the cluster maintains |
| 15 | +a complete copy of the entire database. This redundancy means that if one node fails, the other nodes can continue to serve requests without |
| 16 | +interruption or data loss, ensuring the database remains online. |
| 17 | + |
| 18 | +A TypeDB cluster operates on a leader-follower model, which is managed by the RAFT consensus algorithm. This is the core technology that |
| 19 | +makes the cluster fault-tolerant and consistent. |
| 20 | + |
| 21 | +- *Leader Node:* at any given time, the cluster elects a single node as the leader _separately for each database_. The leader is exclusively |
| 22 | + responsible for processing all schema and date writes. This design centralizes writes, which simplifies consistency and eliminates the |
| 23 | + complexities of distributed transactions that would arise if data were partitioned. |
| 24 | + |
| 25 | +- *Follower Nodes:* all other nodes in the cluster act as followers. They passively receive a stream of committed transactions from the |
| 26 | + leader's log and apply them to their own local copy of the database. This keeps them in sync with the leader. |
| 27 | + |
| 28 | +- *Leader Election:* if the leader node fails or becomes unreachable, the RAFT algorithm automatically initiates a new election among the |
| 29 | + remaining follower nodes. A new leader is chosen from the followers that have the most up-to-date log, and the cluster can resume write |
| 30 | + operations with minimal downtime, typically within seconds. |
| 31 | + |
| 32 | +Since all write operations must go through a single leader, the write throughput of the cluster is equivalent to the write throughput of a |
| 33 | +single node. To scale write performance, you must scale the leader node vertically (i.e., provide it with more powerful hardware). |
| 34 | + |
| 35 | +Read performance, however, can be scaled horizontally. Because every node in the cluster holds a complete copy of the data, read-only |
| 36 | +transactions can be executed on any node, whether it's the leader or a follower. By directing read queries to follower nodes, you can |
| 37 | +distribute the read load across the entire cluster. This allows the system to handle a much higher volume of concurrent read requests than a |
| 38 | +single server could, significantly improving read scalability. |
| 39 | + |
| 40 | +== Interacting with a Cluster |
| 41 | + |
| 42 | +Interacting with a cluster is very similar to interacting with a single server. The key difference is that the client driver must be |
| 43 | +configured with the network addresses of all nodes in the cluster. |
| 44 | + |
| 45 | +The driver uses this list to intelligently manage connections. It automatically discovers which node is the current leader for the database |
| 46 | +and routes all write transactions to it. For read transactions, the driver can be configured to distribute the load across all available |
| 47 | +nodes (both leader and followers), effectively using the entire cluster's capacity for reads. This routing is handled internally, so your |
| 48 | +application code for opening sessions and running transactions remains the same whether you are connecting to a single node or a full |
| 49 | +cluster. |
| 50 | + |
| 51 | +== Consistency and Durability in a Cluster |
| 52 | + |
| 53 | +TypeDB's replication model, managed by RAFT, provides strong consistency guarantees. When a client sends a write transaction to the leader, |
| 54 | +the following steps ensure its durability and consistency: |
| 55 | + |
| 56 | +- The leader appends the transaction to its internal, on-disk log. |
| 57 | + |
| 58 | +- The leader sends this new log entry to all follower nodes. |
| 59 | + |
| 60 | +- The leader waits until a quorum (a majority of the nodes in the cluster, including itself) has acknowledged that they have successfully |
| 61 | + written the entry to their own logs. |
| 62 | + |
| 63 | +- Only after reaching this quorum does the leader apply the transaction to its state machine and confirm the commit to the client. |
| 64 | + |
| 65 | +This process guarantees that once a transaction is committed, it is durably stored on a majority of the cluster's nodes and will survive the |
| 66 | +failure of any minority of nodes. This ensures that the database remains in a consistent state, and no committed data is ever lost. |
0 commit comments