A C++20 distributed key-value store with Raft consensus across a 3-node
cluster. Implements leader election with randomized election timeouts, log
replication via AppendEntries, snapshot compaction via InstallSnapshot,
and a gRPC Put/Get/Delete client API.
The point of this project is to study the Raft consensus algorithm (Ongaro and Ousterhout, "In Search of an Understandable Consensus Algorithm", USENIX ATC 2014) by writing it from scratch in C++ and proving it survives randomized faults. The load-bearing correctness claim is a 500-scenario chaos suite that partitions, kills, restarts, and adds [10ms, 500ms] random per-RPC delays to nodes under continuous client load and asserts no committed Put is lost and all alive nodes end with byte-identical state. A property-test layer separately walks random op sequences and asserts the four Raft Figure-2 log invariants hold after every step.
- The five Figure-2 safety invariants of Raft, written out as a
RaftNodestate machine in C++. - Randomized chaos as a correctness check: 500 seeded scenarios across partition, node-loss, mixed, and per-RPC random-delay fault modes (see docs/chaos-methodology.md).
- Property tests on Raft Figure-2 log invariants (Election Safety,
Log Matching, Leader Append-Only, State Machine Safety) evaluated
after every step of a randomized op sequence
(
tests/unit/log_invariants_test.cpp). - Snapshot compaction trade-offs (see docs/snapshot-design.md).
- Dynamic membership changes via joint consensus (Section 6): the
C_old -> C_old,new -> C_newtransition with double-majority quorum (see the "Membership changes" section of ARCHITECTURE.md). - The cost of linearizable reads via heartbeat-majority leader confirmation (see docs/linearizability.md).
- The Figure-2 to file mapping (see docs/raft-spec-mapping.md).
The committed run is in bench/results/chaos_local.json.
This was a 500-scenario run with a 540 s wall-clock budget; the
budget cap was hit after 184 scenarios (mix is partition / kill_restart
/ mixed / 10-500 ms random per-RPC delay, with delay scenarios dominating
wall-clock cost). Every attempted scenario passed.
{
"planned_scenarios": 500,
"attempted_scenarios": 184,
"passed": 184,
"failed": 0,
"budget_seconds": 540,
"budget_hit": true,
"elapsed_ms": 540991,
"base_seed": 1
}Wall-clock honesty: 500 scenarios at this fault-kind mix takes roughly
24 minutes locally and would exceed the 10-minute CI ceiling. CI runs
under the 540 s budget; make chaos-full runs the full 500 with no
cap.
Reproduce locally:
make chaos-full # all 500, no cap
make chaos-budget BUDGET=540 # CI behaviorThe single-cluster result is in
bench/results/bench_local.json (N=3,
2000 ops).
The bench drives the in-process cluster with one Put outstanding at a time
and ticks the cluster back-to-back, so the measured per-commit latency is
pure consensus CPU work: serializing N-1 AppendEntries at the leader,
routing them through the in-process injector, running N-1 follower
handlers, and draining N-1 reply callbacks. No gRPC, no network RTT, no
timer floor.
make bench-scaling runs the bench at 3, 5, and 7 nodes and writes one
JSON file per size under bench/results/. As N grows the leader must
fan out a wider replication round per commit, so throughput drops and
latency grows. Measured (3000 ops per size, laptop):
| Nodes | Puts/sec | Latency p50 (ms) | p95 (ms) | p99 (ms) |
|---|---|---|---|---|
| 3 | 956.6 | 1.000 | 1.046 | 2.040 |
| 5 | 608.6 | 1.007 | 2.809 | 2.891 |
| 7 | 422.7 | 2.284 | 4.944 | 5.245 |
Per-N JSON: bench_n3.json,
bench_n5.json,
bench_n7.json.
Throughput falls roughly in proportion to the per-commit fan-out: going 3 -> 7 nodes more than doubles the AppendEntries work per commit and roughly halves throughput. P99 latency grows in step. This matches the Raft expectation that a larger cluster trades write performance for fault tolerance.
make bench-regress re-runs the scaling bench into a temp directory and
diffs it against the committed per-N baselines. It fails the build if
throughput drops or P99 latency rises by more than 30% (tunable with
DRIFT=). CI runs this gate on every push to main.
make bench # single N=3 run, refreshes bench_local.json
make bench-scaling # N=3,5,7, refreshes the per-N baselines
make bench-regress # gate a fresh run against the baselines +-------------------+
+---->+ raft0 (leader) |<-----+
| | KvService :9100 | | AppendEntries / RequestVote /
| | RaftService :9000| | InstallSnapshot over gRPC
| +---------+---------+ |
gRPC Put/Get/Delete /|\ |
| / | \ |
+------+----+ +---+--+--+-------+ |
| client | | raft1 | raft2|<----+
| (Get | | follower| follower
| follows | +---------+--------+
| hint) |
+-----------+
Three processes form one Raft cluster. Each process runs:
- a Raft RPC server (peer-to-peer),
- a KV RPC server (client-facing),
- a single
RaftNodedriving the consensus state machine, - a
StateMachine(in-memorymap<string,string>) applied from committed log entries, - a
Persisterwriting(currentTerm, votedFor, log[], snapshot)to disk.
Layout (top-level only; see directory listings for the full tree):
proto/ raft.proto, kv.proto
src/core/ log, state_machine, persister, snapshot
src/raft/ node (the FSM), election (helpers), replication (helpers), peers
src/rpc/ gRPC server impls + client transport
src/chaos/ in-process transport + scenarios for the randomized test
tests/ unit + integration + chaos
bench/ throughput benchmark + committed result JSON
docs/ spec mapping, chaos methodology, snapshot, linearizability
Dependencies:
- CMake >= 3.20, a C++20 compiler.
protobuf,grpc(withgrpc_cpp_plugin),leveldb(optional; falls back to a flat-file persister when absent).
macOS:
brew install cmake protobuf grpc leveldbUbuntu:
sudo apt-get install -y build-essential cmake \
libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
libgrpc++-dev libleveldb-devBuild + run all tests (incl. property tests, but excluding the 500-run chaos):
make testRun only the chaos smoke (20 scenarios, ~25 s):
make chaos-smokeRun the full 500-scenario chaos suite:
make chaos-fullRun the chaos suite with a wall-clock budget (CI behavior; default 540 s, target 500 scenarios):
BUDGET=540 make chaos-budgetRun the ThreadSanitizer stress matrix (every unit + integration test
under TSan, --gtest_repeat=3):
make tsan-stressRun the throughput benchmark:
make benchmake build
PEERS=0@127.0.0.1:9000,1@127.0.0.1:9010,2@127.0.0.1:9020
./build/raftkv --node-id 0 --peers $PEERS \
--raft-listen 127.0.0.1:9000 --kv-listen 127.0.0.1:9100 --data-dir data/n0 &
./build/raftkv --node-id 1 --peers $PEERS \
--raft-listen 127.0.0.1:9010 --kv-listen 127.0.0.1:9110 --data-dir data/n1 &
./build/raftkv --node-id 2 --peers $PEERS \
--raft-listen 127.0.0.1:9020 --kv-listen 127.0.0.1:9120 --data-dir data/n2 &Talk to it with grpcurl:
grpcurl -plaintext -d '{"key":"foo","value":"bar"}' \
127.0.0.1:9100 raftkv.proto.KvService/Put
# Followers respond: {"ok":false, "leader_hint":"127.0.0.1:9010", "error":"not leader"}
# Send to leader_hint to commit.docker compose up --buildThree containers (raftkv-0, raftkv-1, raftkv-2) on a private bridge
network; client ports are 9100/9101/9102 on the host.
- Unit (
tests/unit/): log indexing, state-machine apply/snapshot, persister round-trip, election helpers (14 cases for theRequestVotereceiver rules), replication helpers. - Integration (
tests/integration/): in-process 3-node cluster covering leader election (5 cases), replication and log matching (5 cases), snapshot + InstallSnapshot catchup (2 cases), client ops (4 cases). - Chaos (
tests/chaos/): 200-scenario randomized partition + node-loss suite, seeded for reproducibility.
CI runs the full unit + integration set, the 20-scenario chaos smoke, a 500-op bench smoke, a three-process gRPC smoke against the built binary, and a Docker image build. The full 200-scenario chaos is run locally and the JSON committed.
Be precise about what this project does and does not claim to be:
- It is not a production-grade replicated store. For that, run etcd, Consul, or Zookeeper. This is a study project.
- No leader leases for read latency. Every linearizable read costs one heartbeat round-trip. Deferred to v4.
- No read-only replicas. Followers redirect reads to the leader.
- No multi-Raft sharding. One key space, one consensus group.
- No TLS / authentication on RPCs.
- No client-side retry beyond the leader-redirect hint. Clients must retry on their own.
- No Byzantine fault tolerance. Raft is crash-fault-tolerant only (assumes nodes either follow protocol or fail silently).
This project is part of a series of distributed-systems studies:
- SAY-5/inference-router: TCP-level inference request routing.
- SAY-5/job-controller: Docker job lifecycle with WAL recovery.
- SAY-5/orderbook-sim: single-process limit-order book matching.
- raftkv: multi-process consensus across 3 nodes.
MIT. See LICENSE.