From 9a5af7726f46edf209eaff1abc6e043d1751176d Mon Sep 17 00:00:00 2001 From: Andi Skrgat Date: Wed, 23 Jul 2025 11:24:22 +0200 Subject: [PATCH 1/2] Add docs to 2PC --- pages/clustering/high-availability.mdx | 186 +++++++++++++------------ pages/clustering/replication.mdx | 48 +++++-- 2 files changed, 134 insertions(+), 100 deletions(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index 3629f2c7..3b5e518d 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -12,12 +12,12 @@ import {CommunityLinks} from '/components/social-card/CommunityLinks' A cluster is considered highly available if, at any point, there is some instance that can respond to a user query. Our high availability relies on replication. The cluster consists of: -- The MAIN instance on which the user can execute write queries -- REPLICA instances that can only respond to read queries +- The main instance on which the user can execute write queries +- replica instances that can only respond to read queries - COORDINATOR instances that manage the cluster state. Depending on how configuration flags are set, Memgraph can run as a data instance or coordinator instance. -The coordinator instance is a new addition to enable the high availability feature and orchestrates data instances to ensure that there is always one MAIN instance in the cluster. +The coordinator instance is a new addition to enable the high availability feature and orchestrates data instances to ensure that there is always one main instance in the cluster. ## Cluster management @@ -25,10 +25,10 @@ For achieving high availability, Memgraph uses Raft consensus protocol, which is a significant advantage that it is much easier to understand. It's important to say that Raft isn't a Byzantine fault-tolerant algorithm. You can learn more about Raft in the paper [In Search of an Understandable Consensus Algorithm](https://raft.github.io/raft.pdf). -Typical Memgraph's highly available cluster consists of 3 data instances (1 MAIN and 2 REPLICAS) and 3 coordinator instances backed up by Raft protocol. +Typical Memgraph's highly available cluster consists of 3 data instances (1 main and 2 replicaS) and 3 coordinator instances backed up by Raft protocol. Users can create more than 3 coordinators, but the replication factor (RF) of 3 is a de facto standard in distributed databases. -One coordinator instance is the leader whose job is to always ensure one writeable data instance (MAIN). The other two coordinator instances replicate +One coordinator instance is the leader whose job is to always ensure one writeable data instance (main). The other two coordinator instances replicate changes the leader coordinator did in its own Raft log. Operations saved into the Raft log are those that are related to cluster management. Memgraph doesn't have its implementation of the Raft protocol. For this task, Memgraph uses an industry-proven library [NuRaft](https://github.com/eBay/NuRaft). @@ -37,7 +37,7 @@ You can start the coordinator instance by specifying `--coordinator-id`, queries related to high availability, so you cannot execute any data-oriented query on it. The coordinator port is used for the Raft protocol, which all coordinators use to ensure the consistency of the cluster's state. Data instances are distinguished from coordinator instances by specifying only `--management-port` flag. This port is used for RPC network communication between the coordinator and data -instances. When started by default, the data instance is MAIN. The coordinator will ensure that no data inconsistency can happen during and after the instance's +instances. When started by default, the data instance is main. The coordinator will ensure that no data inconsistency can happen during and after the instance's restart. Once all instances are started, the user can start adding data instances to the cluster. @@ -70,19 +70,19 @@ but from the availability perspective, it is better to separate them physically. ## Bolt+routing -Directly connecting to the MAIN instance isn't preferred in the HA cluster since the MAIN instance changes due to various failures. Because of that, users +Directly connecting to the main instance isn't preferred in the HA cluster since the main instance changes due to various failures. Because of that, users can use bolt+routing so that write queries can always be sent to the correct data instance. This will prevent a split-brain issue since clients, when writing, won't be routed to the old main but rather to the new main instance on which failover got performed. This protocol works in a way that the client first sends a ROUTE bolt message to any coordinator instance. The coordinator replies to the message by returning the routing table with three entries specifying -from which instance can the data be read, to which instance data can be written to and which instances behave as routers. In the Memgraph HA cluster, the MAIN -data instance is the only writeable instance, REPLICAs are readable instances, and COORDINATORs behave as routers. However, the cluster can be configured in such a way -that MAIN can also be used for reading. Check this [paragraph](#setting-config-for-highly-available-cluster) for more info. +from which instance can the data be read, to which instance data can be written to and which instances behave as routers. In the Memgraph HA cluster, the main +data instance is the only writeable instance, replicas are readable instances, and COORDINATORs behave as routers. However, the cluster can be configured in such a way +that main can also be used for reading. Check this [paragraph](#setting-config-for-highly-available-cluster) for more info. Bolt+routing is the client-side routing protocol, meaning network endpoint resolution happens inside drivers. For more details about the Bolt messages involved in the communication, check [the following link](https://neo4j.com/docs/bolt/current/bolt/message/#messages-route). Users only need to change the scheme they use for connecting to coordinators. This means instead of using `bolt://,` you should -use `neo4j://` to get an active connection to the current MAIN instance in the cluster. You can find examples of how to +use `neo4j://` to get an active connection to the current main instance in the cluster. You can find examples of how to use bolt+routing in different programming languages [here](https://github.com/memgraph/memgraph/tree/master/tests/drivers). It is important to note that setting up the cluster on one coordinator (registration of data instances and coordinators, setting main) must be done using bolt connection @@ -217,17 +217,17 @@ Registering instances should be done on a single coordinator. The chosen coordin Register instance query will result in several actions: 1. The coordinator instance will connect to the data instance on the `management_server` network address. 2. The coordinator instance will start pinging the data instance every `--instance-health-check-frequency-sec` seconds to check its status. -3. Data instance will be demoted from MAIN to REPLICA. +3. Data instance will be demoted from main to replica. 4. Data instance will start the replication server on `replication_server`. ```plaintext -REGISTER INSTANCE instanceName ( AS ASYNC ) WITH CONFIG {"bolt_server": boltServer, "management_server": managementServer, "replication_server": replicationServer}; +REGISTER INSTANCE instanceName ( AS ASYNC | AS STRICT_SYNC ) ? WITH CONFIG {"bolt_server": boltServer, "management_server": managementServer, "replication_server": replicationServer}; ``` This operation will result in writing to the Raft log. -In case the MAIN instance already exists in the cluster, a replica instance will be automatically connected to the MAIN. You can specify whether the replica should behave -synchronously or asynchronously by using `AS ASYNC` construct after `instanceName`. +In case the main instance already exists in the cluster, a replica instance will be automatically connected to the main. Constructs ( AS ASYNC | AS STRICT_SYNC ) serve to specify +instance's replication mode when the instance behaves as replica. ### Add coordinator instance @@ -262,23 +262,23 @@ REMOVE COORDINATOR ; ``` -### Set instance to MAIN +### Set instance to main -Once all data instances are registered, one data instance should be promoted to MAIN. This can be achieved by using the following query: +Once all data instances are registered, one data instance should be promoted to main. This can be achieved by using the following query: ```plaintext -SET INSTANCE instanceName to MAIN; +SET INSTANCE instanceName to main; ``` -This query will register all other instances as REPLICAs to the new MAIN. If one of the instances is unavailable, setting the instance to MAIN will not succeed. -If there is already a MAIN instance in the cluster, this query will fail. +This query will register all other instances as replicas to the new main. If one of the instances is unavailable, setting the instance to main will not succeed. +If there is already a main instance in the cluster, this query will fail. This operation will result in writing to the Raft log. ### Demote instance Demote instance query can be used by an admin to demote the current main to replica. In this case, the leader coordinator won't perform a failover, but as a user, -you should choose promote one of the data instances to MAIN using the `SET INSTANCE `instance` TO MAIN` query. +you should choose promote one of the data instances to main using the `SET INSTANCE `instance` TO main` query. ```plaintext DEMOTE INSTANCE instanceName; @@ -288,14 +288,13 @@ This operation will result in writing to the Raft log. -By combining the functionalities of queries `DEMOTE INSTANCE instanceName` and `SET INSTANCE instanceName TO MAIN` you get the manual failover capability. This can be useful -e.g during a maintenance work on the instance where the current MAIN is deployed. +By combining the functionalities of queries `DEMOTE INSTANCE instanceName` and `SET INSTANCE instanceName TO main` you get the manual failover capability. This can be useful +e.g during a maintenance work on the instance where the current main is deployed. - ### Unregister instance There are various reasons which could lead to the decision that an instance needs to be removed from the cluster. The hardware can be broken, @@ -306,21 +305,21 @@ UNREGISTER INSTANCE instanceName; ``` When unregistering an instance, ensure that the instance being unregistered is -**not** the MAIN instance. Unregistering MAIN can lead to an inconsistent -cluster state. Additionally, the cluster must have an **alive** MAIN instance -during the unregistration process. If no MAIN instance is available, the +**not** the main instance. Unregistering main can lead to an inconsistent +cluster state. Additionally, the cluster must have an **alive** main instance +during the unregistration process. If no main instance is available, the operation cannot be guaranteed to succeed. -The instance requested to be unregistered will also be unregistered from the current MAIN's REPLICA set. +The instance requested to be unregistered will also be unregistered from the current main's replica set. ### Force reset cluster state In case the cluster gets stuck there is an option to do the force reset of the cluster. You need to execute a command on the leader coordinator. This command will result in the following actions: -1. The coordinator instance will demote each alive instance to REPLICA. -2. From the alive instance it will choose a new MAIN instance. -3. Instances that are down will be demoted to REPLICAs once they come back up. +1. The coordinator instance will demote each alive instance to replica. +2. From the alive instance it will choose a new main instance. +3. Instances that are down will be demoted to replicas once they come back up. ```plaintext FORCE RESET CLUSTER STATE; @@ -334,7 +333,7 @@ You can check the state of the whole cluster using the `SHOW INSTANCES` query. T each server you can see the following information: 1. Network endpoints they are using for managing cluster state 2. Health state of server - 3. Role - MAIN, REPLICA, LEADER, FOLLOWER or unknown if not alive + 3. Role - main, replica, LEADER, FOLLOWER or unknown if not alive 4. The time passed since the last response time to the leader's health ping This query can be run on either the leader or followers. Since only the leader knows the exact status of the health state and last response time, @@ -439,13 +438,14 @@ for which the timeout is used is the following: - TimestampReq -> main sending to replica - SystemHeartbeatReq -> main sending to replica - ForceResetStorageReq -> main sending to replica. The timeout is set to 60s. -- SystemRecoveryReq -> main sending to replica. The timeout set to 5s. +- SystemRecoveryReq -> main sending to replica. The timeout is set to 5s. +- FinalizeCommitReq -> main sending to replica. The timeout is set to 10s. -For replication-related RPC messages — AppendDeltasRpc, CurrentWalRpc, and +For RPC messages which are sending the variable number of storage deltas — PrepareCommitRpc, CurrentWalRpc, and WalFilesRpc — it is not practical to set a strict execution timeout. The -processing time on the replica side is directly proportional to the amount of -data being transferred. To handle this, the replica sends periodic progress +processing time on the replica side is directly proportional to the number of +deltas being transferred. To handle this, the replica sends periodic progress updates to the main instance after processing every 100,000 deltas. Since processing 100,000 deltas is expected to take a relatively consistent amount of time, we can enforce a timeout based on this interval. The default timeout for @@ -453,7 +453,7 @@ these RPC messages is 30 seconds, though in practice, processing 100,000 deltas typically takes less than 3 seconds. SnapshotRpc is also a replication-related RPC message, but its execution time -is tracked differently. The replica sends an update to the main instance after +is tracked a bit differently from RPC messages shipping deltas. The replica sends an update to the main instance after completing 1,000,000 units of work. The work units are assigned as follows: - Processing nodes, edges, or indexed entities (label index, label-property index, @@ -483,93 +483,93 @@ a multiplier of `--instance-health-check-frequency-sec`. Set the multiplier coef For example, set `--instance-down-timeout-sec=5` and `--instance-health-check-frequency-sec=1` which will result in coordinator contacting each instance every second and the instance is considered dead after it doesn't respond 5 times (5 seconds / 1 second). -In case a REPLICA doesn't respond to a health check, the leader coordinator will try to contact it again every `--instance-health-check-frequency-sec`. -When the REPLICA instance rejoins the cluster (comes back up), it always rejoins as REPLICA. For MAIN instance, there are two options. -If it is down for less than `--instance-down-timeout-sec`, it will rejoin as MAIN because it is still considered alive. If it is down for more than `--instance-down-timeout-sec`, -the failover procedure is initiated. Whether MAIN will rejoin as MAIN depends on the success of the failover procedure. If the failover procedure succeeds, now old MAIN -will rejoin as REPLICA. If failover doesn't succeed, MAIN will rejoin as MAIN once it comes back up. +In case a replica doesn't respond to a health check, the leader coordinator will try to contact it again every `--instance-health-check-frequency-sec`. +When the replica instance rejoins the cluster (comes back up), it always rejoins as replica. For main instance, there are two options. +If it is down for less than `--instance-down-timeout-sec`, it will rejoin as main because it is still considered alive. If it is down for more than `--instance-down-timeout-sec`, +the failover procedure is initiated. Whether main will rejoin as main depends on the success of the failover procedure. If the failover procedure succeeds, now old main +will rejoin as replica. If failover doesn't succeed, main will rejoin as main once it comes back up. ### Failover procedure - high level description -From alive REPLICAs coordinator chooses a new potential MAIN. -This instance is only potentially new MAIN as the failover procedure can still fail due to various factors (networking issues, promote to MAIN fails, any alive REPLICA failing to -accept an RPC message, etc). The coordinator sends an RPC request to the potential new MAIN, which is still in REPLICA state, to promote itself to the MAIN instance with info -about other REPLICAs to which it will replicate data. Once that request succeeds, the new MAIN can start replication to the other instances and accept write queries. - +From alive replicas coordinator chooses a new potential main and writes a log to the Raft storage about the new main. On the next leader's ping to the instance, +it will send to the instance an RPC request to the new main, which is still in replica state, to promote itself to the main instance with info +about other replicas to which it will replicate data. Once that request succeeds, the new main can start replication to the other instances and accept write queries. -### Choosing new MAIN from available REPLICAs +### Choosing new main from available replicas -When failover is happening, some REPLICAs can also be down. From the list of alive REPLICAs, a new MAIN is chosen. First, the leader coordinator contacts each alive REPLICA +When failover is happening, some replicas can also be down. From the list of alive replicas, a new main is chosen. First, the leader coordinator contacts each alive replica to get info about each database's last commit timestamp. In the case of enabled multi-tenancy, from each instance coordinator will get info on all databases and their last commit -timestamp. Currently, the coordinator chooses an instance to become a new MAIN by comparing the latest commit timestamps of all databases. The instance which is newest on most -databases is considered the best candidate for the new MAIN. If there are multiple instances which have the same number of newest databases, we sum timestamps of all databases +timestamp. Currently, the coordinator chooses an instance to become a new main by comparing the latest commit timestamps of all databases. The instance which is newest on most +databases is considered the best candidate for the new main. If there are multiple instances which have the same number of newest databases, we sum timestamps of all databases and consider instance with a larger sum as the better candidate. -### Old MAIN rejoining to the cluster +### Old main rejoining to the cluster -Once the old MAIN gets back up, the coordinator sends an RPC request to demote the old MAIN to REPLICA. The coordinator tracks at all times which instance was the last MAIN. +Once the old main gets back up, the coordinator sends an RPC request to demote the old main to replica. The coordinator tracks at all times which instance was the last main. -The leader coordinator sends two RPC requests in the given order to demote old MAIN to REPLICA: -1. Demote MAIN to REPLICA RPC request -2. A request to store the UUID of the current MAIN, which the old MAIN, now acting as a REPLICA instance, must listen to. +The leader coordinator sends two RPC requests in the given order to demote old main to replica: +1. Demote main to replica RPC request +2. A request to store the UUID of the current main, which the old main, now acting as a replica instance, must listen to. -### How REPLICA knows which MAIN to listen +### How replica knows which main to listen -Each REPLICA has a UUID of MAIN it listens to. If a network partition happens where MAIN can talk to a REPLICA but the coordinator can't talk to the MAIN, from the coordinator's -point of view that MAIN is down. From REPLICA's point of view, the MAIN instance is still alive. The coordinator will start the failover procedure, and we can end up with multiple MAINs -where REPLICAs can listen to both MAINs. To prevent such an issue, each REPLICA gets a new UUID that no current MAIN has. The coordinator generates the new UUID, -which the new MAIN will get to use on its promotion to MAIN. +Each replica has a UUID of main it listens to. If a network partition happens where main can talk to a replica but the coordinator can't talk to the main, from the coordinator's +point of view that main is down. From replica's point of view, the main instance is still alive. The coordinator will start the failover procedure, and we can end up with multiple mains +where replicas can listen to both mains. To prevent such an issue, each replica gets a new UUID that no current main has. The coordinator generates the new UUID, +which the new main will get to use on its promotion to main. -If REPLICA was down at one point, MAIN could have changed. When REPLICA gets back up, it doesn't listen to any MAIN until the coordinator sends an RPC request to REPLICA to start -listening to MAIN with the given UUID. +If replica was down at one point, main could have changed. When replica gets back up, it doesn't listen to any main until the coordinator sends an RPC request to replica to start +listening to main with the given UUID. ### Replication concerns #### Force sync of data During a failover event, Memgraph selects the most up-to-date, alive instance to -become the new MAIN. The selection process works as follows: -1. From the list of available REPLICA instances, Memgraph chooses the one with +become the new main. The selection process works as follows: +1. From the list of available replica instances, Memgraph chooses the one with the latest commit timestamp for the default database. 2. If an instance that had more recent data was down during this selection -process, it will not be considered for promotion to MAIN. +process, it will not be considered for promotion to main. If a previously down instance had more up-to-date data but was unavailable during failover, it will go through a specific recovery process upon rejoining the cluster: -- The new MAIN will clear the returning replica’s storage. -- The returning replica will then receive all commits from the new MAIN to +- The replica will reset its storage. +- The replica will receive all commits from the new main to synchronize its state. - The replica's old durability files will be preserved in a `.old` directory in `data_directory/snapshots` and `data_directory/wal` folders, allowing admins to manually recover data if needed. -Memgraph prioritizes availability over strict consistency (leaning towards AP in -the CAP theorem). While it aims to maintain consistency as much as possible, the -current failover logic can result in a non-zero Recovery Point Objective (RPO), -that is, the loss of committed data, because: -- The promoted MAIN might not have received all commits from the previous MAIN +Depending on the replication mode used, there are different levels of data loss +that can happen upon the failover. With the default `SYNC` replication mode, +Memgraph prioritizes availability over strict consistency and can result in +a non-zero Recovery Point Objective (RPO), that is, the loss of committed data, because: +- The promoted main might not have received all commits from the previous main before the failure. -- This design ensures that the MAIN remains writable for the maximum possible +- This design ensures that the main remains writable for the maximum possible time. -If your environment requires strong consistency and can tolerate write -unavailability, [reach out to -us](https://github.com/memgraph/memgraph/discussions). We are actively exploring -support for a fully synchronous mode. +With `ASYNC` replication mode, you also risk losing some data upon the failover because +main can freely continue commiting no matter the status of ASYNC replicas. + +The `STRICT_SYNC` replication mode allows users experiencing a failover without any data loss +in all situations. It comes with reduced throughput because of the cost of running two-phase commit protocol. ## Actions on follower coordinators -From follower coordinators you can only execute `SHOW INSTANCES`. Registration of data instance, unregistration of data instances, demoting instance, setting instance to MAIN and +From follower coordinators you can only execute `SHOW INSTANCES`. Registration of data instance, unregistration of data instances, demoting instance, setting instance to main and force resetting cluster state are all disabled. ## Instances' restart ### Data instances' restart -Data instances can fail both as MAIN and as REPLICA. When an instance that was REPLICA comes back, it won't accept updates from any instance until the coordinator updates its -responsible peer. This should happen automatically when the coordinator's ping to the instance passes. When the MAIN instance comes back, any writing to the MAIN instance will be + +Data instances can fail both as main and as replica. When an instance that was replica comes back, it won't accept updates from any instance until the coordinator updates its +responsible peer. This should happen automatically when the coordinator's ping to the instance passes. When the main instance comes back, any writing to the main instance will be forbidden until a ping from the coordinator passes. ### Coordinator instances restart @@ -612,7 +612,7 @@ It will also recover the following server config information: The following information will be recovered from a common RocksDB `logs` instance: - current version of `logs` durability store - snapshots found with `snapshot_id_` prefix in database: - - coordinator cluster state - all data instances with their role (MAIN or REPLICA), all coordinator instances and UUID of MAIN instance which REPLICA is listening to + - coordinator cluster state - all data instances with their role (main or replica), all coordinator instances and UUID of main instance which replica is listening to - last log idx - last log term - last cluster config @@ -645,12 +645,16 @@ Raft is a quorum-based protocol and it needs a majority of instances alive in or the cluster stays available. With 2+ coordinator instances down (in a cluster with RF = 3), the RTO depends on the time needed for instances to come back. -Failure of REPLICA data instance isn't very harmful since users can continue writing to MAIN data instance while reading from MAIN or other -REPLICAs. The most important thing to analyze is what happens when MAIN gets down. In that case, the leader coordinator uses +Depending on the replica's replication mode, its failure can lead to different situations. If the replica was registered with STRICT_SYNC mode, then on its failure, writing +on main will be disabled. On the other hand, if replica was registered as ASYNC or SYNC, further writes on main are still allowed. In both cases, reads are still allowed from +main and other replicas. + + +The most important thing to analyze is what happens when main gets down. In that case, the leader coordinator uses user-controllable parameters related to the frequency of health checks from the leader to replication instances (`--instance-health-check-frequency-sec`) -and the time needed to realize the instance is down (`--instance-down-timeout-sec`). After collecting enough evidence, the leader concludes the MAIN is down and performs failover +and the time needed to realize the instance is down (`--instance-down-timeout-sec`). After collecting enough evidence, the leader concludes the main is down and performs failover using just a handful of RPC messages (correct time depends on the distance between instances). It is important to mention that the whole failover is performed without the loss of committed data -if the newly chosen MAIN (previously REPLICA) had all up-to-date data. +if the newly chosen main (previously replica) had all up-to-date data. ## Raft configuration parameters @@ -695,7 +699,7 @@ ADD COORDINATOR 3 WITH CONFIG {"bolt_server": "localhost:7693", "coordinator_ser REGISTER INSTANCE instance_1 WITH CONFIG {"bolt_server": "localhost:7687", "management_server": "instance1:13011", "replication_server": "instance1:10001"}; REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "localhost:7688", "management_server": "instance2:13012", "replication_server": "instance2:10002"}; REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "localhost:7689", "management_server": "instance3:13013", "replication_server": "instance3:10003"}; -SET INSTANCE instance_3 TO MAIN; +SET INSTANCE instance_3 TO main; ``` @@ -898,10 +902,10 @@ REGISTER INSTANCE instance_2 WITH CONFIG {"bolt_server": "localhost:7688", "mana REGISTER INSTANCE instance_3 WITH CONFIG {"bolt_server": "localhost:7689", "management_server": "localhost:13013", "replication_server": "localhost:10003"}; ``` -4. Set instance_3 as MAIN: +4. Set instance_3 as main: ```plaintext -SET INSTANCE instance_3 TO MAIN; +SET INSTANCE instance_3 TO main; ``` 5. Connect to the leader coordinator and check cluster state with `SHOW INSTANCES`; @@ -917,8 +921,8 @@ SET INSTANCE instance_3 TO MAIN; ### Check automatic failover -Let's say that the current MAIN instance is down for some reason. After `--instance-down-timeout-sec` seconds, the coordinator will realize -that and automatically promote the first alive REPLICA to become the new MAIN. The output of running `SHOW INSTANCES` on the leader coordinator could then look like: +Let's say that the current main instance is down for some reason. After `--instance-down-timeout-sec` seconds, the coordinator will realize +that and automatically promote the first alive replica to become the new main. The output of running `SHOW INSTANCES` on the leader coordinator could then look like: | name | bolt_server | coordinator_server | management_server | health | role | last_succ_resp_ms | | ------------- | -------------- | ------------------ | ----------------- | ------ | -------- | ------------------| diff --git a/pages/clustering/replication.mdx b/pages/clustering/replication.mdx index efd1817b..9ce205d6 100644 --- a/pages/clustering/replication.mdx +++ b/pages/clustering/replication.mdx @@ -71,20 +71,29 @@ cluster. Once demoted to REPLICA instances, they will no longer accept write queries. In order to start the replication, each REPLICA instance needs to be registered -from the MAIN instance by setting a replication mode (SYNC or ASYNC) and +from the MAIN instance by setting a replication mode (SYNC, ASYNC or STRICT_SYNC) and specifying the REPLICA instance's socket address. The replication mode defines the terms by which the MAIN instance can commit the changes to the database, thus modifying the system to prioritize either consistency or availability: -- **SYNC** - After committing a transaction, the MAIN instance will communicate -the changes to all REPLICA instances running in SYNC mode and wait until it -receives a response or information that a timeout is reached. SYNC mode ensures + +- **STRICT_SYNC** - After committing a transaction, the MAIN instance will communicate +the changes to all REPLICA instances and wait until it +receives a response or information that a timeout is reached. The STRICT_SYNC mode ensures consistency and partition tolerance (CP), but not availability for writes. If the primary database has multiple replicas, the system is highly available for reads. But, when a replica fails, the MAIN instance can't process the write due -to the nature of synchronous replication. +to the nature of synchronous replication. It is implemented as two-phase commit protocol. + + +- **SYNC** - After committing a transaction, the MAIN instance will communicate +the changes to all REPLICA instances and wait until it +receives a response or information that a timeout is reached. It is different from +**STRICT_SYNC** mode because it the MAIN can continue committing even in situations +when **SYNC** replica is down. + - **ASYNC** - The MAIN instance will commit a transaction without receiving confirmation from REPLICA instances that they have received the same @@ -234,6 +243,14 @@ the following query: REGISTER REPLICA name ASYNC TO ; ``` + +If you want to register a REPLICA instance with an STRICT_SYNC replication mode, run +the following query: + +```plaintext +REGISTER REPLICA name STRICT_SYNC TO ; +``` + The socket address must be a string value as follows: ```plaintext @@ -282,8 +299,7 @@ If you set REPLICA roles using port `10000`, you can define the socket address s When a REPLICA instance is registered, it will start replication in ASYNC mode until it synchronizes to the current state of the database. Upon -synchronization, REPLICA instances will either continue working in the ASYNC -mode or reset to SYNC mode. +synchronization, REPLICA instances will either continue working in the ASYNC, STRICT_SYNC or SYNC mode. ### Listing all registered REPLICA instances @@ -493,10 +509,12 @@ accepts read and write queries to the database and REPLICA instances accept only read queries. The changes or state of the MAIN instance are replicated to the REPLICA -instances in a SYNC or ASYNC mode. The SYNC mode ensures consistency and +instances in a SYNC, STRICT_SYNC or ASYNC mode. The STRICT_SYNC mode ensures consistency and partition tolerance (CP), but not availability for writes. The ASYNC mode ensures system availability and partition tolerance (AP), while data can only be -eventually consistent. +eventually consistent. The SYNC mode is something in between because it waits +for writes to be accepted on replicas but MAIN can still commit even in situations +when one of REPLICAs is down. By using the timestamp, the MAIN instance knows the current state of the REPLICA. If the REPLICA is not synchronized with the MAIN instance, the MAIN @@ -552,6 +570,17 @@ SYNC REPLICA doesn't answer within the expected timeout. ![](/pages/clustering/replication/workflow_diagram_data_manipulation.drawio.png) + +#### STRICT_SYNC replication mode + +The STRICT_SYNC replication mode behaves very similarly to a +SYNC mode except that MAIN won't commit a transaction locally in a situation in +which one of STRICT_SYNC replicas is down. To achieve that, all instances run +together a two-commit protocol which allows you such a synchronization. This +reduces the throughout but such a mode is super useful in a high-availability +scenario in which a failover is the most operation to support. Such a mode then +allows you a failover without the fear of experiencing a data loss. + #### ASYN replication mode In the ASYNC replication mode, the MAIN instance will commit a transaction @@ -571,6 +600,7 @@ instance. ASYNC mode ensures system availability and partition tolerance. + ### Synchronizing instances By comparing timestamps, the MAIN instance knows when a REPLICA instance is not From 8d78d63825057999f4d5128f5f3a86dbf7daa8bc Mon Sep 17 00:00:00 2001 From: Andi Skrgat Date: Wed, 23 Jul 2025 15:36:14 +0200 Subject: [PATCH 2/2] Update types of cluster --- pages/clustering/high-availability.mdx | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index 3b5e518d..7cfb9da2 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -227,7 +227,8 @@ REGISTER INSTANCE instanceName ( AS ASYNC | AS STRICT_SYNC ) ? WITH CONFIG {"bol This operation will result in writing to the Raft log. In case the main instance already exists in the cluster, a replica instance will be automatically connected to the main. Constructs ( AS ASYNC | AS STRICT_SYNC ) serve to specify -instance's replication mode when the instance behaves as replica. +instance's replication mode when the instance behaves as replica. You can only have `STRICT_SYNC` and `ASYNC` or `SYNC` and `ASYNC` replicas together in the cluster. Combining `STRICT_SYNC` +and `SYNC` replicas together doesn't have proper semantic meaning so it is forbidden. ### Add coordinator instance