Skip to content

Conversation

@WHOIM1205
Copy link

Fix snapshot transfer context leak causing BoltDB read transaction starvation

Description

This PR fixes a critical resource leak in the Maintenance.Snapshot gRPC API where a client disconnect during snapshot transfer could leave a snapshot writer goroutine running indefinitely.

When a snapshot stream is canceled (e.g. client timeout, network drop, TCP RST), the server-side goroutine writing snapshot data to an io.Pipe continues running and blocks forever. This prevents the underlying BoltDB read transaction from being closed, leading to:

  • leaked goroutines
  • unbounded BoltDB file growth
  • eventual write failures and cluster unavailability

The fix ensures snapshot generation respects gRPC context cancellation and always releases backend resources.


Type of change

  • Bug fix (non-breaking change which fixes a critical issue)

The Problem

The snapshot writer goroutine does not observe srv.Context().Done().

If the client disconnects while snap.WriteTo(pw) is running:

  • the reader exits early
  • the pipe buffer fills up
  • WriteTo() blocks forever
  • snap.Close() is never called
  • the BoltDB read transaction remains open indefinitely

This is a silent failure that slowly degrades the cluster until writes fail.


Root Cause

  • Snapshot transfer uses io.Pipe
  • Snapshot writing happens in a background goroutine
  • gRPC stream cancellation is not propagated to the snapshot writer
  • BoltDB read transactions are held for the lifetime of the snapshot

The Fix

  • Monitor srv.Context().Done() during snapshot streaming
  • Force snapshot cleanup on cancellation:
    • close the snapshot
    • close the pipe writer to unblock WriteTo
  • Ensure cleanup also occurs on send errors

This guarantees snapshot resources are released in all exit paths.


Reproduction Steps

  1. Start a multi-node etcd cluster with a non-trivial DB size (hundreds of MB)
  2. Start a snapshot transfer:
    etcdctl snapshot save snapshot.db --command-timeout=2s
    
    

##Test

  • all the tests are passed locally
image

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: WHOIM1205
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @WHOIM1205. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@WHOIM1205
Copy link
Author

hey @serathius
Fixes a snapshot streaming edge case where client disconnects could leak a goroutine and BoltDB read transaction, leading to resource exhaustion over time.

@serathius serathius closed this Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants