Skip to content

SAY-5/raftkv

Repository files navigation

raftkv

A C++20 distributed key-value store with Raft consensus across a 3-node cluster. Implements leader election with randomized election timeouts, log replication via AppendEntries, snapshot compaction via InstallSnapshot, and a gRPC Put/Get/Delete client API.

The point of this project is to study the Raft consensus algorithm (Ongaro and Ousterhout, "In Search of an Understandable Consensus Algorithm", USENIX ATC 2014) by writing it from scratch in C++ and proving it survives randomized faults. The load-bearing correctness claim is a 500-scenario chaos suite that partitions, kills, restarts, and adds [10ms, 500ms] random per-RPC delays to nodes under continuous client load and asserts no committed Put is lost and all alive nodes end with byte-identical state. A property-test layer separately walks random op sequences and asserts the four Raft Figure-2 log invariants hold after every step.

What this studies

  • The five Figure-2 safety invariants of Raft, written out as a RaftNode state machine in C++.
  • Randomized chaos as a correctness check: 500 seeded scenarios across partition, node-loss, mixed, and per-RPC random-delay fault modes (see docs/chaos-methodology.md).
  • Property tests on Raft Figure-2 log invariants (Election Safety, Log Matching, Leader Append-Only, State Machine Safety) evaluated after every step of a randomized op sequence (tests/unit/log_invariants_test.cpp).
  • Snapshot compaction trade-offs (see docs/snapshot-design.md).
  • Dynamic membership changes via joint consensus (Section 6): the C_old -> C_old,new -> C_new transition with double-majority quorum (see the "Membership changes" section of ARCHITECTURE.md).
  • The cost of linearizable reads via heartbeat-majority leader confirmation (see docs/linearizability.md).
  • The Figure-2 to file mapping (see docs/raft-spec-mapping.md).

Results

Chaos suite: 184/184 passed in 540 s budget

The committed run is in bench/results/chaos_local.json. This was a 500-scenario run with a 540 s wall-clock budget; the budget cap was hit after 184 scenarios (mix is partition / kill_restart / mixed / 10-500 ms random per-RPC delay, with delay scenarios dominating wall-clock cost). Every attempted scenario passed.

{
  "planned_scenarios": 500,
  "attempted_scenarios": 184,
  "passed": 184,
  "failed": 0,
  "budget_seconds": 540,
  "budget_hit": true,
  "elapsed_ms": 540991,
  "base_seed": 1
}

Wall-clock honesty: 500 scenarios at this fault-kind mix takes roughly 24 minutes locally and would exceed the 10-minute CI ceiling. CI runs under the 540 s budget; make chaos-full runs the full 500 with no cap.

Reproduce locally:

make chaos-full   # all 500, no cap
make chaos-budget BUDGET=540   # CI behavior

Throughput and latency (in-process, no wire)

The single-cluster result is in bench/results/bench_local.json (N=3, 2000 ops).

The bench drives the in-process cluster with one Put outstanding at a time and ticks the cluster back-to-back, so the measured per-commit latency is pure consensus CPU work: serializing N-1 AppendEntries at the leader, routing them through the in-process injector, running N-1 follower handlers, and draining N-1 reply callbacks. No gRPC, no network RTT, no timer floor.

Scaling across cluster sizes

make bench-scaling runs the bench at 3, 5, and 7 nodes and writes one JSON file per size under bench/results/. As N grows the leader must fan out a wider replication round per commit, so throughput drops and latency grows. Measured (3000 ops per size, laptop):

Nodes Puts/sec Latency p50 (ms) p95 (ms) p99 (ms)
3 956.6 1.000 1.046 2.040
5 608.6 1.007 2.809 2.891
7 422.7 2.284 4.944 5.245

Per-N JSON: bench_n3.json, bench_n5.json, bench_n7.json.

Throughput falls roughly in proportion to the per-commit fan-out: going 3 -> 7 nodes more than doubles the AppendEntries work per commit and roughly halves throughput. P99 latency grows in step. This matches the Raft expectation that a larger cluster trades write performance for fault tolerance.

Regression gate

make bench-regress re-runs the scaling bench into a temp directory and diffs it against the committed per-N baselines. It fails the build if throughput drops or P99 latency rises by more than 30% (tunable with DRIFT=). CI runs this gate on every push to main.

make bench           # single N=3 run, refreshes bench_local.json
make bench-scaling   # N=3,5,7, refreshes the per-N baselines
make bench-regress   # gate a fresh run against the baselines

Architecture

                    +-------------------+
              +---->+ raft0 (leader)    |<-----+
              |     |  KvService :9100  |      |  AppendEntries / RequestVote /
              |     |  RaftService :9000|      |  InstallSnapshot over gRPC
              |     +---------+---------+      |
       gRPC Put/Get/Delete   /|\               |
              |             / | \              |
       +------+----+   +---+--+--+-------+     |
       | client    |   |  raft1  |  raft2|<----+
       | (Get      |   | follower| follower
       |  follows  |   +---------+--------+
       |  hint)    |
       +-----------+

Three processes form one Raft cluster. Each process runs:

  • a Raft RPC server (peer-to-peer),
  • a KV RPC server (client-facing),
  • a single RaftNode driving the consensus state machine,
  • a StateMachine (in-memory map<string,string>) applied from committed log entries,
  • a Persister writing (currentTerm, votedFor, log[], snapshot) to disk.

Layout (top-level only; see directory listings for the full tree):

proto/      raft.proto, kv.proto
src/core/   log, state_machine, persister, snapshot
src/raft/   node (the FSM), election (helpers), replication (helpers), peers
src/rpc/    gRPC server impls + client transport
src/chaos/  in-process transport + scenarios for the randomized test
tests/      unit + integration + chaos
bench/      throughput benchmark + committed result JSON
docs/       spec mapping, chaos methodology, snapshot, linearizability

Build and run

Local (macOS via brew, or Linux via apt)

Dependencies:

  • CMake >= 3.20, a C++20 compiler.
  • protobuf, grpc (with grpc_cpp_plugin), leveldb (optional; falls back to a flat-file persister when absent).

macOS:

brew install cmake protobuf grpc leveldb

Ubuntu:

sudo apt-get install -y build-essential cmake \
  libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
  libgrpc++-dev libleveldb-dev

Build + run all tests (incl. property tests, but excluding the 500-run chaos):

make test

Run only the chaos smoke (20 scenarios, ~25 s):

make chaos-smoke

Run the full 500-scenario chaos suite:

make chaos-full

Run the chaos suite with a wall-clock budget (CI behavior; default 540 s, target 500 scenarios):

BUDGET=540 make chaos-budget

Run the ThreadSanitizer stress matrix (every unit + integration test under TSan, --gtest_repeat=3):

make tsan-stress

Run the throughput benchmark:

make bench

Three-node cluster on localhost

make build
PEERS=0@127.0.0.1:9000,1@127.0.0.1:9010,2@127.0.0.1:9020
./build/raftkv --node-id 0 --peers $PEERS \
  --raft-listen 127.0.0.1:9000 --kv-listen 127.0.0.1:9100 --data-dir data/n0 &
./build/raftkv --node-id 1 --peers $PEERS \
  --raft-listen 127.0.0.1:9010 --kv-listen 127.0.0.1:9110 --data-dir data/n1 &
./build/raftkv --node-id 2 --peers $PEERS \
  --raft-listen 127.0.0.1:9020 --kv-listen 127.0.0.1:9120 --data-dir data/n2 &

Talk to it with grpcurl:

grpcurl -plaintext -d '{"key":"foo","value":"bar"}' \
  127.0.0.1:9100 raftkv.proto.KvService/Put
# Followers respond: {"ok":false, "leader_hint":"127.0.0.1:9010", "error":"not leader"}
# Send to leader_hint to commit.

Docker

docker compose up --build

Three containers (raftkv-0, raftkv-1, raftkv-2) on a private bridge network; client ports are 9100/9101/9102 on the host.

Tests

  • Unit (tests/unit/): log indexing, state-machine apply/snapshot, persister round-trip, election helpers (14 cases for the RequestVote receiver rules), replication helpers.
  • Integration (tests/integration/): in-process 3-node cluster covering leader election (5 cases), replication and log matching (5 cases), snapshot + InstallSnapshot catchup (2 cases), client ops (4 cases).
  • Chaos (tests/chaos/): 200-scenario randomized partition + node-loss suite, seeded for reproducibility.

CI runs the full unit + integration set, the 20-scenario chaos smoke, a 500-op bench smoke, a three-process gRPC smoke against the built binary, and a Docker image build. The full 200-scenario chaos is run locally and the JSON committed.

What this is not

Be precise about what this project does and does not claim to be:

  • It is not a production-grade replicated store. For that, run etcd, Consul, or Zookeeper. This is a study project.
  • No leader leases for read latency. Every linearizable read costs one heartbeat round-trip. Deferred to v4.
  • No read-only replicas. Followers redirect reads to the leader.
  • No multi-Raft sharding. One key space, one consensus group.
  • No TLS / authentication on RPCs.
  • No client-side retry beyond the leader-redirect hint. Clients must retry on their own.
  • No Byzantine fault tolerance. Raft is crash-fault-tolerant only (assumes nodes either follow protocol or fail silently).

Repository

This project is part of a series of distributed-systems studies:

License

MIT. See LICENSE.

About

C++20 Raft consensus distributed KV: 3-node cluster, leader election, log replication, snapshot compaction, 200-scenario chaos suite

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors