horizontal_scaling

dmitrii-ubskii · dmitrii-ubskii · commit 391729b3f859 · 2025-08-05T14:42:17.000+01:00
diff --git a/new_core_concepts/modules/ROOT/pages/typedb/horizontal_scaling.adoc b/new_core_concepts/modules/ROOT/pages/typedb/horizontal_scaling.adoc
@@ -1,2 +1,66 @@
 = Horizontal Scaling
 
+When a single server is insufficient to meet the uptime and reliability requirements of a production application, it becomes necessary to
+scale out. In TypeDB, this is achieved by creating a database cluster that provides high availability and fault tolerance through data
+replication. This chapter explains the architecture and mechanics of a TypeDB cluster, with a focus on how replication enables fault
+tolerance and how transactions are processed across different nodes.
+
+== Introduction to High Availability
+
+For many applications, continuous availability is critical. A single server represents a single point of failure; if that server goes down
+due to a hardware failure or network issue, or is overwhelmed by the volume of requests, the application becomes unavailable. A TypeDB
+cluster mitigates this risk by deploying the database across multiple servers, or nodes.
+
+TypeDB achieves high availability and fault tolerance through data replication. Replication ensures that every node in the cluster maintains
+a complete copy of the entire database. This redundancy means that if one node fails, the other nodes can continue to serve requests without
+interruption or data loss, ensuring the database remains online.
+
+A TypeDB cluster operates on a leader-follower model, which is managed by the RAFT consensus algorithm. This is the core technology that
+makes the cluster fault-tolerant and consistent.
+
+- *Leader Node:* at any given time, the cluster elects a single node as the leader _separately for each database_. The leader is exclusively
+  responsible for processing all schema and date writes. This design centralizes writes, which simplifies consistency and eliminates the
+  complexities of distributed transactions that would arise if data were partitioned.
+
+- *Follower Nodes:* all other nodes in the cluster act as followers. They passively receive a stream of committed transactions from the
+  leader's log and apply them to their own local copy of the database. This keeps them in sync with the leader.
+
+- *Leader Election:* if the leader node fails or becomes unreachable, the RAFT algorithm automatically initiates a new election among the
+  remaining follower nodes. A new leader is chosen from the followers that have the most up-to-date log, and the cluster can resume write
+  operations with minimal downtime, typically within seconds.
+
+Since all write operations must go through a single leader, the write throughput of the cluster is equivalent to the write throughput of a
+single node. To scale write performance, you must scale the leader node vertically (i.e., provide it with more powerful hardware).
+
+Read performance, however, can be scaled horizontally. Because every node in the cluster holds a complete copy of the data, read-only
+transactions can be executed on any node, whether it's the leader or a follower. By directing read queries to follower nodes, you can
+distribute the read load across the entire cluster. This allows the system to handle a much higher volume of concurrent read requests than a
+single server could, significantly improving read scalability.
+
+== Interacting with a Cluster
+
+Interacting with a cluster is very similar to interacting with a single server. The key difference is that the client driver must be
+configured with the network addresses of all nodes in the cluster.
+
+The driver uses this list to intelligently manage connections. It automatically discovers which node is the current leader for the database
+and routes all write transactions to it. For read transactions, the driver can be configured to distribute the load across all available
+nodes (both leader and followers), effectively using the entire cluster's capacity for reads. This routing is handled internally, so your
+application code for opening sessions and running transactions remains the same whether you are connecting to a single node or a full
+cluster.
+
+== Consistency and Durability in a Cluster
+
+TypeDB's replication model, managed by RAFT, provides strong consistency guarantees. When a client sends a write transaction to the leader,
+the following steps ensure its durability and consistency:
+
+- The leader appends the transaction to its internal, on-disk log.
+
+- The leader sends this new log entry to all follower nodes.
+
+- The leader waits until a quorum (a majority of the nodes in the cluster, including itself) has acknowledged that they have successfully
+  written the entry to their own logs.
+
+- Only after reaching this quorum does the leader apply the transaction to its state machine and confirm the commit to the client.
+
+This process guarantees that once a transaction is committed, it is durably stored on a majority of the cluster's nodes and will survive the
+failure of any minority of nodes. This ensures that the database remains in a consistent state, and no committed data is ever lost.