mvcc: avoid panic in cancelWatcher during concurrent victim processing #21185

WHOIM1205 · 2026-01-23T21:13:33Z

Overview

Fixed a critical race condition in the MVCC watcher lifecycle where concurrent watcher cancellation and victim processing could trigger a fatal panic in the etcd server.

Under realistic high-load scenarios with slow watchers, a watcher may transiently exist outside of all watcher groups due to non-atomic state transitions. The existing code treated this state as impossible and crashed the server.

This PR makes the watcher cancellation logic resilient to this transient state by retrying until lifecycle convergence, preventing process termination.

The Issue

In server/storage/mvcc/watchable_store.go, the cancelWatcher function assumes the following invariant:

If a watcher is not present in synced or unsynced, it must be present in a victim batch.

During concurrent execution of:

victim processing (syncWatchers / moveVictims)
watcher cancellation (cancelWatcher)

this invariant does not always hold.

A watcher may be removed from the victim batch and have wa.victim cleared while a concurrent cancellation is in progress. When cancelWatcher observes this transient state, it panics with:

This panic crashes the etcd process and disrupts clients such as Kubernetes API servers.

Fix Applied

File: server/storage/mvcc/watchable_store.go
Updated cancelWatcher to retry when a watcher is transiently in-flight between lifecycle states instead of panicking.
The fix preserves existing watcher semantics and only affects the invalid error path.

This aligns cancelWatcher behavior with other watcher lifecycle code paths that already tolerate concurrent state transitions.

Testing & Verification

Added a new regression test TestCancelWatcherDuringVictimProcessing in:
- server/storage/mvcc/watchable_store_test.go
The test exercises concurrent watcher cancellation and victim processing to reproduce the race condition.
Verified that the etcd server no longer panics under this scenario.

Verified locally with:

go test ./server/storage/mvcc -run TestCancelWatcherDuringVictimProcessing -count=20
go test ./server/storage/mvcc

…ocessing Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

k8s-ci-robot · 2026-01-23T21:13:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: WHOIM1205
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-01-23T21:13:44Z

Hi @WHOIM1205. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

WHOIM1205 · 2026-01-23T21:14:14Z

hey @serathius

This fixes a real race in the watcher lifecycle where cancelWatcher could observe a transient state during concurrent victim processing and panic.

The panic relies on an invariant that does not always hold under concurrent execution. Retrying until the watcher lifecycle converges avoids crashing the server while preserving existing behavior.

A regression test is included to exercise the concurrent cancellation path.

serathius · 2026-01-26T10:32:54Z

https://github.com/kubernetes/community/blob/88bcd83f56e8b15637a8b422e2bf290c922640c3/contributors/guide/pull-requests.md#ai-guidance

fix(mvcc): prevent panic in cancelWatcher during concurrent victim pr…

9cab6d2

…ocessing Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

k8s-ci-robot added needs-ok-to-test size/L labels Jan 23, 2026

serathius closed this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvcc: avoid panic in cancelWatcher during concurrent victim processing #21185

mvcc: avoid panic in cancelWatcher during concurrent victim processing #21185

WHOIM1205 commented Jan 23, 2026

Uh oh!

k8s-ci-robot commented Jan 23, 2026

Uh oh!

k8s-ci-robot commented Jan 23, 2026

Uh oh!

WHOIM1205 commented Jan 23, 2026

Uh oh!

serathius commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

mvcc: avoid panic in cancelWatcher during concurrent victim processing #21185

mvcc: avoid panic in cancelWatcher during concurrent victim processing #21185

Conversation

WHOIM1205 commented Jan 23, 2026

Overview

The Issue

Fix Applied

Testing & Verification

Uh oh!

k8s-ci-robot commented Jan 23, 2026

Uh oh!

k8s-ci-robot commented Jan 23, 2026

Uh oh!

WHOIM1205 commented Jan 23, 2026

Uh oh!

serathius commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants