Skip to content

Commit 391729b

Browse files
horizontal_scaling
1 parent f276e66 commit 391729b

File tree

1 file changed

+64
-0
lines changed

1 file changed

+64
-0
lines changed
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,66 @@
11
= Horizontal Scaling
22

3+
When a single server is insufficient to meet the uptime and reliability requirements of a production application, it becomes necessary to
4+
scale out. In TypeDB, this is achieved by creating a database cluster that provides high availability and fault tolerance through data
5+
replication. This chapter explains the architecture and mechanics of a TypeDB cluster, with a focus on how replication enables fault
6+
tolerance and how transactions are processed across different nodes.
7+
8+
== Introduction to High Availability
9+
10+
For many applications, continuous availability is critical. A single server represents a single point of failure; if that server goes down
11+
due to a hardware failure or network issue, or is overwhelmed by the volume of requests, the application becomes unavailable. A TypeDB
12+
cluster mitigates this risk by deploying the database across multiple servers, or nodes.
13+
14+
TypeDB achieves high availability and fault tolerance through data replication. Replication ensures that every node in the cluster maintains
15+
a complete copy of the entire database. This redundancy means that if one node fails, the other nodes can continue to serve requests without
16+
interruption or data loss, ensuring the database remains online.
17+
18+
A TypeDB cluster operates on a leader-follower model, which is managed by the RAFT consensus algorithm. This is the core technology that
19+
makes the cluster fault-tolerant and consistent.
20+
21+
- *Leader Node:* at any given time, the cluster elects a single node as the leader _separately for each database_. The leader is exclusively
22+
responsible for processing all schema and date writes. This design centralizes writes, which simplifies consistency and eliminates the
23+
complexities of distributed transactions that would arise if data were partitioned.
24+
25+
- *Follower Nodes:* all other nodes in the cluster act as followers. They passively receive a stream of committed transactions from the
26+
leader's log and apply them to their own local copy of the database. This keeps them in sync with the leader.
27+
28+
- *Leader Election:* if the leader node fails or becomes unreachable, the RAFT algorithm automatically initiates a new election among the
29+
remaining follower nodes. A new leader is chosen from the followers that have the most up-to-date log, and the cluster can resume write
30+
operations with minimal downtime, typically within seconds.
31+
32+
Since all write operations must go through a single leader, the write throughput of the cluster is equivalent to the write throughput of a
33+
single node. To scale write performance, you must scale the leader node vertically (i.e., provide it with more powerful hardware).
34+
35+
Read performance, however, can be scaled horizontally. Because every node in the cluster holds a complete copy of the data, read-only
36+
transactions can be executed on any node, whether it's the leader or a follower. By directing read queries to follower nodes, you can
37+
distribute the read load across the entire cluster. This allows the system to handle a much higher volume of concurrent read requests than a
38+
single server could, significantly improving read scalability.
39+
40+
== Interacting with a Cluster
41+
42+
Interacting with a cluster is very similar to interacting with a single server. The key difference is that the client driver must be
43+
configured with the network addresses of all nodes in the cluster.
44+
45+
The driver uses this list to intelligently manage connections. It automatically discovers which node is the current leader for the database
46+
and routes all write transactions to it. For read transactions, the driver can be configured to distribute the load across all available
47+
nodes (both leader and followers), effectively using the entire cluster's capacity for reads. This routing is handled internally, so your
48+
application code for opening sessions and running transactions remains the same whether you are connecting to a single node or a full
49+
cluster.
50+
51+
== Consistency and Durability in a Cluster
52+
53+
TypeDB's replication model, managed by RAFT, provides strong consistency guarantees. When a client sends a write transaction to the leader,
54+
the following steps ensure its durability and consistency:
55+
56+
- The leader appends the transaction to its internal, on-disk log.
57+
58+
- The leader sends this new log entry to all follower nodes.
59+
60+
- The leader waits until a quorum (a majority of the nodes in the cluster, including itself) has acknowledged that they have successfully
61+
written the entry to their own logs.
62+
63+
- Only after reaching this quorum does the leader apply the transaction to its state machine and confirm the commit to the client.
64+
65+
This process guarantees that once a transaction is committed, it is durably stored on a majority of the cluster's nodes and will survive the
66+
failure of any minority of nodes. This ensures that the database remains in a consistent state, and no committed data is ever lost.

0 commit comments

Comments
 (0)