Skip to content

Conversation

@WHOIM1205
Copy link

Fix Raft ConfState Divergence After Crash During ConfChange Application

Summary

This PR fixes a critical crash-recovery bug in etcd where Raft’s in-memory ConfState can diverge from the backend-persisted ConfState if the process crashes while applying a membership change.

The issue occurs because ConfState persistence is not atomic with ApplyConfChange(). On restart, etcd trusted the backend ConfState, which may be stale, leading to incorrect cluster membership after recovery.

This PR makes the WAL the single source of truth for rebuilding ConfState during bootstrap, eliminating this inconsistency.


Problem Description

When applying a ConfChange, etcd performs the following steps:

  1. Updates Raft’s in-memory ConfState via ApplyConfChange()
  2. Marks backend ConfState as dirty using SetConfState()
  3. Persists it later during the backend transaction commit

If etcd crashes after step (1) but before the backend transaction commits, the system enters an inconsistent state:

  • Raft state (rebuilt from WAL) reflects the membership change
  • Backend ConfState remains stale
  • Bootstrap logic trusts the backend ConfState

This violates the invariant that Raft state and persisted cluster membership metadata must be consistent after recovery.


Why This Is Critical

  • Can cause incorrect cluster membership after restart
  • Leads to broken quorum calculations and failed leader elections
  • Can block all writes indefinitely
  • Impacts production Kubernetes control planes
  • Failure mode is silent and difficult to diagnose
  • Often requires manual intervention to recover

Root Cause

  • Backend ConfState is treated as authoritative during bootstrap
  • WAL already contains all committed ConfChange entries
  • No reconciliation exists between WAL-derived state and backend metadata
  • A crash between ApplyConfChange() and backend commit leaves persisted state stale

Fix Overview

Rebuild ConfState from WAL during bootstrap and reconcile backend state if necessary.

Key changes

  • Reconstruct ConfState by replaying committed ConfChange entries from WAL
  • Compare the rebuilt ConfState with the backend ConfState
  • If a mismatch is detected:
    • Log a warning
    • Persist the corrected ConfState to the backend before starting Raft

This guarantees crash-safe and deterministic recovery without changing Raft semantics.


Steps to Reproduce (Before Fix)

  1. Start a 3-node etcd cluster
  2. Remove a member using a ConfChangeRemoveNode
  3. Crash etcd after ApplyConfChange() but before backend commit
  4. Restart the node
  5. Observe:
    • Removed member reappears, or
    • Leader election fails due to incorrect quorum size

Verification (After Fix)

  • ConfState is rebuilt from WAL on startup
  • Backend ConfState is corrected automatically
  • Cluster membership is consistent
  • Leader election succeeds and writes are accepted

Tests Added

  • Unit test for WAL-based ConfState reconstruction
  • Integration test simulating crash during ConfChange using gofail
  • Ensures backend and Raft ConfState match after recovery

Impact

  • Prevents cluster membership divergence after crashes
  • Eliminates quorum deadlocks and split-brain scenarios
  • Improves etcd reliability under failure
  • No behavior change during normal operation

Notes for Reviewers

  • WAL is treated as the authoritative source of Raft state
  • Fix is isolated to bootstrap logic
  • No changes to the Raft state machine or apply path
  • Safe for backporting

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: WHOIM1205
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @WHOIM1205. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@WHOIM1205
Copy link
Author

hey @serathius
This fixes a crash-recovery edge case where ConfState could diverge if etcd crashes between ApplyConfChange() and backend commit. The fix makes WAL authoritative during bootstrap and reconciles backend state if needed.
Happy to adjust or add more tests if required.

@serathius serathius closed this Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants