lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21122

zhijun42 · 2026-01-13T09:42:06Z

Follow-up to the previous reproduction PR #21050. Refer there for full context.

k8s-ci-robot · 2026-01-13T09:42:17Z

Hi @zhijun42. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius · 2026-01-13T10:46:31Z

/ok-to-test

serathius · 2026-01-13T10:48:07Z

Have you read the original motivation for returning NoLeader? #7630 #7275

Could we add a test for WithRequireLeader?

Maybe we could ask @heyitsanthony @xiang90

codecov · 2026-01-13T11:19:02Z

Codecov Report

❌ Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.43%. Comparing base (c492848) to head (3102816).
⚠️ Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
server/etcdserver/v3_server.go	75.00%	2 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
server/etcdserver/api/v3rpc/lease.go	`81.81% <ø> (-0.47%)`	⬇️
server/etcdserver/api/v3rpc/util.go	`67.74% <ø> (ø)`
server/lease/leasehttp/http.go	`62.16% <100.00%> (ø)`
server/etcdserver/v3_server.go	`75.59% <75.00%> (+0.06%)`	⬆️

... and 20 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21122      +/-   ##
==========================================
+ Coverage   68.40%   68.43%   +0.02%     
==========================================
  Files         429      429              
  Lines       35242    35271      +29     
==========================================
+ Hits        24109    24137      +28     
  Misses       9737     9737              
- Partials     1396     1397       +1

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c492848...3102816. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zhijun42 · 2026-01-14T07:40:24Z

@serathius I think this is the story: In the original PR, when the server stream is canceled due to no leader, the author converts the error code from Canceled to NoLeader because he incorrectly assumed that the error code from stream is Canceled, which is in fact not true - it's already NoLeader.

We can prove this by commenting out these lines on main branch and will see that relevant tests still pass.

func (ls *LeaseServer) LeaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) (err error) {
	errc := make(chan error, 1)
	go func() {
	    errc <- ls.leaseKeepAlive(stream)
	}()
	select {
	case err = <-errc:
	case <-stream.Context().Done():
        err = stream.Context().Err()
        //if errors.Is(err, context.Canceled) {
        //	err = rpctypes.ErrGRPCNoLeader
        //}
	}
	return err
}

As for WithRequireLeader, it's already covered by tests TestLeaseWithRequireLeader and TestV3LeaseRequireLeader in that original PR.

But yeah you make a good point ,so I do add a new test case almost identical to TestV3LeaseRequireLeader (except that I don't use WithRequireLeader) to showcase that we will always receive NoLeader error regardless of WithRequireLeader.

zhijun42 · 2026-01-14T09:37:18Z

On a separate yet related note, I find an opportunity for improvement. If a lease client sets this withRequireLeader and closes its receive channel due to no leader, it should also stop sending more keepAlive requests to the server (it wouldn't have a channel to receive response anyway). Opened a new PR #21126 for this.

tests/integration/v3_lease_test.go

zhijun42 · 2026-01-15T06:33:28Z

Note about grpcproxy behavior: During testing, I found that the grpcproxy doesn't propagate NoLeader errors when WithRequireLeader is not used. The keepAliveLoop in server/proxy/grpcproxy/lease.go (lines 263-265) discards errors and only calls cancel(), resulting in context.Canceled instead of the actual server error.

Not sure if this is intentional design or an actual bug, but since this code has been stable since 2017, I decided not to touch it. Instead, I modified my test cases to handle such difference.

tests/integration/clientv3/watch/v3_watch_test.go

server/etcdserver/api/v3rpc/lease.go

server/etcdserver/v3_server.go

tests/integration/v3_lease_test.go

zhijun42 · 2026-01-16T01:50:32Z

Rebased. Great idea moving the SkipGoFail and additional test case into separate PRs!

serathius · 2026-01-16T08:37:38Z

The tests for grpc client looks good, have you looked into testing how chaning error here impacts etcd client LeaseKeepAlive?

cc @fuweid @ahrtr

zhijun42 · 2026-01-16T09:52:03Z

how chaning error here impacts etcd client LeaseKeepAlive?

Good call. There's no impact on the etcd client LeaseKeepAlive. Opened a separate PR #21147 to prove that the implementation today is already returning NoLeader error as expected.

And since the etcd client wrapper (c.KeepAlive()) doesn't expose the underlying error to callers - it only closes channels, the existing test TestLeaseWithRequireLeader already covers that.

These two PRs are independent, either one can be merged first.

serathius · 2026-01-16T10:19:44Z

Good call. There's no impact on the etcd client LeaseKeepAlive.

Do we have a test for that?

server/etcdserver/v3_server.go

ahrtr · 2026-01-16T12:49:05Z

server/etcdserver/v3_server.go

+	switch {
+	case errorspkg.Is(err, context.DeadlineExceeded):
+		return -1, errors.ErrTimeout
+	case errorspkg.Is(err, context.Canceled):
+		return -1, errors.ErrCanceled
+	// Should be unreachable, but we keep it defensive.
+	default:
+		s.Logger().Warn("Unexpected lease renew context error", zap.Error(err))


What's the exact issue you are fixing? If there is no a real issue, suggest not to change this. It might have impact on client side.

The issue is LeaseKeepAlive Unavailable #13632 and it might (we don't have enough info from the issue) have multiple root causes. I identified one: When the client cancels the LeaseKeepAlive stream, the server will return gRPC NoLeader error, but it should have generated Canceled error code instead.

The fix is done in ‎server/etcdserver/v3_server.go‎ file.

The change here is refactoring (no behavior change). The previous implementation is saying "if the error is not ErrTimeout, we return ErrCanceled by default", but the fact is "The error is always either ErrTimeout or ErrCanceled". It's impossible to get other error code (I have TestV3LeaseKeepAliveForwardingCatchError to prove it), so I refactor it to make this fact more obvious to readers.

fuweid · 2026-01-16T19:56:26Z

server/etcdserver/v3_server.go

+		}
 	}

-	if errorspkg.Is(cctx.Err(), context.DeadlineExceeded) {


Regarding line 404, could we just keep it as it is right now? I don’t think it’s related to this fix.

time.Sleep(50 * time.Millisecond) is a short delay, and the loop will already break if any error occurs.

cctx is created using context.WithTimeout, not WithCancelCause, WithDeadlineCause, or WithTimeoutCause. If I understand correctly, the error should only be context.DeadlineExceeded or context.Canceled.

Yep you're right it's not related to the fix and it's just a mirror speedup to exit earlier when error occurs. Reverted this.

fuweid

LGTM

Left comment for changes in LeaseRenew function

fuweid · 2026-01-16T21:26:55Z

server/etcdserver/api/v3rpc/lease.go

+		// 2. Server cancellation: the client ctx is wrapped with WithRequireLeader,
+		//		monitorLeader() detects no leader and thus cancels this stream with ErrGRPCNoLeader.
+		// 3. Server cancellation: the server is shutting down.
 		err = stream.Context().Err()


This change looks good. We don’t need to adjust that error handling. If the issue is caused by monitorLeader, RenewLeader will be canceled and will return no-leader to the client as well. At least we don’t log canceled events for the no-leader case.

etcd/server/etcdserver/v3_server.go

Lines 543 to 561 in 6d5e199

func (s *EtcdServer) waitLeader(ctx context.Context) (*membership.Member, error) {

leader := s.cluster.Member(s.Leader())

for leader == nil {

// wait an election

dur := time.Duration(s.Cfg.ElectionTicks) * time.Duration(s.Cfg.TickMs) * time.Millisecond

select {

case <-time.After(dur):

leader = s.cluster.Member(s.Leader())

case <-s.stopping:

return nil, errors.ErrStopped

case <-ctx.Done():

return nil, errors.ErrNoLeader

}

}

if len(leader.PeerURLs) == 0 {

return nil, errors.ErrNoLeader

}

return leader, nil

}

If it's timeout and client cancels it, client won't receive response from server

If it's timeout and it's caused by monitorLeader, client should receive no-leader error. This patch doesn't change this behaviour.

(Line 554)

Yep! Your analysis is exactly correct!

k8s-ci-robot · 2026-01-16T21:27:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fuweid, zhijun42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [fuweid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Zhijun <[email protected]>

k8s-ci-robot · 2026-01-17T12:31:54Z

@zhijun42: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-coverage-report	`3102816`	link	true	`/test pull-etcd-coverage-report`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

zhijun42 · 2026-01-17T12:45:31Z

Good call. There's no impact on the etcd client LeaseKeepAlive.

Do we have a test for that?

Yes, TestLeaseWithRequireLeader it is.

I didn't explain clearly. Recall that we only have 3 scenarios when the stream context is Done:

case <-stream.Context().Done():
	// We end up here due to:
	// 1. Client cancellation
	// 2. Server cancellation: the client ctx is wrapped with WithRequireLeader,
	//		monitorLeader() detects no leader and thus cancels this stream with ErrGRPCNoLeader.
	// 3. Server cancellation: the server is shutting down.
-		if errors.Is(err, context.Canceled) {
-			err = rpctypes.ErrGRPCNoLeader
-		}

When the client cancels, the client will immediately receive gRPC cancelled regardless what the server returns. So this scenario is unchanged.
This scenario is covered by the test case TestLeaseWithRequireLeader already.
When the server shuts down, the client will receive rpc error: code = Unavailable desc = error reading from server: EOF regardless of what the server actually returns. So this scenario is also unchanged, same reasoning as scenario 1 above.

So there's no behavior impact on the etcd client.

k8s-ci-robot added area/testing needs-ok-to-test labels Jan 13, 2026

k8s-ci-robot added the size/M label Jan 13, 2026

zhijun42 force-pushed the fix-lease-keep-alive-unavailable branch from 430f664 to c1b8ad3 Compare January 13, 2026 09:45

zhijun42 mentioned this pull request Jan 13, 2026

lease: Reproduce incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21050

Merged

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jan 13, 2026

k8s-ci-robot added size/L size/M and removed size/M size/L labels Jan 14, 2026

serathius reviewed Jan 14, 2026

View reviewed changes

tests/integration/v3_lease_test.go Outdated Show resolved Hide resolved

zhijun42 mentioned this pull request Jan 14, 2026

lease: Migrate TestV3LeaseRequireLeader into TestV3LeaseKeepAliveForwardingCatchError #21127

Merged

k8s-ci-robot added the needs-rebase label Jan 15, 2026

zhijun42 force-pushed the fix-lease-keep-alive-unavailable branch from b74031b to 7ba114f Compare January 15, 2026 03:13

k8s-ci-robot added size/L and removed needs-rebase size/M labels Jan 15, 2026

serathius reviewed Jan 15, 2026

View reviewed changes

tests/integration/clientv3/watch/v3_watch_test.go Outdated Show resolved Hide resolved

serathius reviewed Jan 15, 2026

View reviewed changes

server/etcdserver/api/v3rpc/lease.go Show resolved Hide resolved

serathius reviewed Jan 15, 2026

View reviewed changes

server/etcdserver/v3_server.go Outdated Show resolved Hide resolved

serathius reviewed Jan 15, 2026

View reviewed changes

tests/integration/v3_lease_test.go Show resolved Hide resolved

zhijun42 mentioned this pull request Jan 15, 2026

lease: Add new test to catch NoLeader error when client not using WithRequireLeader #21135

Merged

zhijun42 mentioned this pull request Jan 15, 2026

Consolidate gofail check in integration tests #21136

Merged

zhijun42 force-pushed the fix-lease-keep-alive-unavailable branch from 21dc2a9 to c9f280e Compare January 15, 2026 12:10

k8s-ci-robot added size/M and removed size/L labels Jan 15, 2026

zhijun42 force-pushed the fix-lease-keep-alive-unavailable branch from c9f280e to d899fa7 Compare January 16, 2026 00:33

zhijun42 mentioned this pull request Jan 16, 2026

Proposal: Disable Github bot auto-closing stale issues #21051

Open

ahrtr reviewed Jan 16, 2026

View reviewed changes

fuweid reviewed Jan 16, 2026

View reviewed changes

fuweid approved these changes Jan 16, 2026

View reviewed changes

k8s-ci-robot added the approved label Jan 16, 2026

zhijun42 force-pushed the fix-lease-keep-alive-unavailable branch from d899fa7 to b4f0d00 Compare January 17, 2026 11:45

Helps solve the leaseKeepAlive grpc Unavailable issue

3102816

Signed-off-by: Zhijun <[email protected]>

zhijun42 force-pushed the fix-lease-keep-alive-unavailable branch from b4f0d00 to 3102816 Compare January 17, 2026 11:48

	func (s EtcdServer) waitLeader(ctx context.Context) (membership.Member, error) {
	leader := s.cluster.Member(s.Leader())
	for leader == nil {
	// wait an election
	dur := time.Duration(s.Cfg.ElectionTicks) * time.Duration(s.Cfg.TickMs) * time.Millisecond
	select {
	case <-time.After(dur):
	leader = s.cluster.Member(s.Leader())
	case <-s.stopping:
	return nil, errors.ErrStopped
	case <-ctx.Done():
	return nil, errors.ErrNoLeader
	}
	}
	if len(leader.PeerURLs) == 0 {
	return nil, errors.ErrNoLeader
	}
	return leader, nil
	}

lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21122

Are you sure you want to change the base?

lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21122

Conversation

zhijun42 commented Jan 13, 2026

Uh oh!

k8s-ci-robot commented Jan 13, 2026

Uh oh!

serathius commented Jan 13, 2026

Uh oh!

serathius commented Jan 13, 2026

Uh oh!

codecov bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhijun42 commented Jan 14, 2026

Uh oh!

zhijun42 commented Jan 14, 2026

Uh oh!

Uh oh!

zhijun42 commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhijun42 commented Jan 16, 2026

Uh oh!

serathius commented Jan 16, 2026

Uh oh!

zhijun42 commented Jan 16, 2026

Uh oh!

serathius commented Jan 16, 2026

Uh oh!

Uh oh!

ahrtr Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

zhijun42 Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

fuweid Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

zhijun42 Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

fuweid Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhijun42 Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jan 16, 2026

Uh oh!

k8s-ci-robot commented Jan 17, 2026

Uh oh!

zhijun42 commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jan 13, 2026 •

edited

Loading

fuweid Jan 16, 2026 •

edited

Loading