Fix snapshot transfer context leak causing BoltDB read transaction starvation #21188

WHOIM1205 · 2026-01-24T09:32:40Z

Fix snapshot transfer context leak causing BoltDB read transaction starvation

Description

This PR fixes a critical resource leak in the Maintenance.Snapshot gRPC API where a client disconnect during snapshot transfer could leave a snapshot writer goroutine running indefinitely.

When a snapshot stream is canceled (e.g. client timeout, network drop, TCP RST), the server-side goroutine writing snapshot data to an io.Pipe continues running and blocks forever. This prevents the underlying BoltDB read transaction from being closed, leading to:

leaked goroutines
unbounded BoltDB file growth
eventual write failures and cluster unavailability

The fix ensures snapshot generation respects gRPC context cancellation and always releases backend resources.

Type of change

Bug fix (non-breaking change which fixes a critical issue)

The Problem

The snapshot writer goroutine does not observe srv.Context().Done().

If the client disconnects while snap.WriteTo(pw) is running:

the reader exits early
the pipe buffer fills up
WriteTo() blocks forever
snap.Close() is never called
the BoltDB read transaction remains open indefinitely

This is a silent failure that slowly degrades the cluster until writes fail.

Root Cause

Snapshot transfer uses io.Pipe
Snapshot writing happens in a background goroutine
gRPC stream cancellation is not propagated to the snapshot writer
BoltDB read transactions are held for the lifetime of the snapshot

The Fix

Monitor srv.Context().Done() during snapshot streaming
Force snapshot cleanup on cancellation:
- close the snapshot
- close the pipe writer to unblock WriteTo
Ensure cleanup also occurs on send errors

This guarantees snapshot resources are released in all exit paths.

Reproduction Steps

Start a multi-node etcd cluster with a non-trivial DB size (hundreds of MB)

Start a snapshot transfer:

etcdctl snapshot save snapshot.db --command-timeout=2s

##Test

all the tests are passed locally

Signed-off-by: WHOIM1205 <[email protected]>

k8s-ci-robot · 2026-01-24T09:32:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: WHOIM1205
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-01-24T09:32:51Z

Hi @WHOIM1205. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

WHOIM1205 · 2026-01-24T09:33:36Z

hey @serathius
Fixes a snapshot streaming edge case where client disconnects could leak a goroutine and BoltDB read transaction, leading to resource exhaustion over time.

serathius · 2026-01-26T10:33:12Z

https://github.com/kubernetes/community/blob/88bcd83f56e8b15637a8b422e2bf290c922640c3/contributors/guide/pull-requests.md#ai-guidance

v3rpc: fix snapshot context leak on client disconnect

adbf99e

Signed-off-by: WHOIM1205 <[email protected]>

k8s-ci-robot added the needs-ok-to-test label Jan 24, 2026

k8s-ci-robot added the size/L label Jan 24, 2026

serathius closed this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix snapshot transfer context leak causing BoltDB read transaction starvation #21188

Fix snapshot transfer context leak causing BoltDB read transaction starvation #21188

WHOIM1205 commented Jan 24, 2026

Uh oh!

k8s-ci-robot commented Jan 24, 2026

Uh oh!

k8s-ci-robot commented Jan 24, 2026

Uh oh!

WHOIM1205 commented Jan 24, 2026

Uh oh!

serathius commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Fix snapshot transfer context leak causing BoltDB read transaction starvation #21188

Fix snapshot transfer context leak causing BoltDB read transaction starvation #21188

Conversation

WHOIM1205 commented Jan 24, 2026

Fix snapshot transfer context leak causing BoltDB read transaction starvation

Description

Type of change

The Problem

Root Cause

The Fix

Reproduction Steps

Uh oh!

k8s-ci-robot commented Jan 24, 2026

Uh oh!

k8s-ci-robot commented Jan 24, 2026

Uh oh!

WHOIM1205 commented Jan 24, 2026

Uh oh!

serathius commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants