-
Notifications
You must be signed in to change notification settings - Fork 10.3k
lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @zhijun42. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
430f664 to
c1b8ad3
Compare
|
/ok-to-test |
|
Have you read the original motivation for returning NoLeader? #7630 #7275 Could we add a test for Maybe we could ask @heyitsanthony @xiang90 |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 20 files with indirect coverage changes @@ Coverage Diff @@
## main #21122 +/- ##
==========================================
+ Coverage 68.40% 68.43% +0.02%
==========================================
Files 429 429
Lines 35242 35271 +29
==========================================
+ Hits 24109 24137 +28
Misses 9737 9737
- Partials 1396 1397 +1 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
@serathius I think this is the story: In the original PR, when the server stream is canceled due to no leader, the author converts the error code from Canceled to NoLeader because he incorrectly assumed that the error code from stream is Canceled, which is in fact not true - it's already NoLeader. We can prove this by commenting out these lines on main branch and will see that relevant tests still pass. func (ls *LeaseServer) LeaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) (err error) {
errc := make(chan error, 1)
go func() {
errc <- ls.leaseKeepAlive(stream)
}()
select {
case err = <-errc:
case <-stream.Context().Done():
err = stream.Context().Err()
//if errors.Is(err, context.Canceled) {
// err = rpctypes.ErrGRPCNoLeader
//}
}
return err
}As for But yeah you make a good point ,so I do add a new test case almost identical to |
|
On a separate yet related note, I find an opportunity for improvement. If a lease client sets this |
b74031b to
7ba114f
Compare
|
Note about grpcproxy behavior: During testing, I found that the grpcproxy doesn't propagate NoLeader errors when Not sure if this is intentional design or an actual bug, but since this code has been stable since 2017, I decided not to touch it. Instead, I modified my test cases to handle such difference. |
21dc2a9 to
c9f280e
Compare
c9f280e to
d899fa7
Compare
|
Rebased. Great idea moving the SkipGoFail and additional test case into separate PRs! |
Good call. There's no impact on the etcd client LeaseKeepAlive. Opened a separate PR #21147 to prove that the implementation today is already returning NoLeader error as expected. And since the etcd client wrapper ( These two PRs are independent, either one can be merged first. |
Do we have a test for that? |
| switch { | ||
| case errorspkg.Is(err, context.DeadlineExceeded): | ||
| return -1, errors.ErrTimeout | ||
| case errorspkg.Is(err, context.Canceled): | ||
| return -1, errors.ErrCanceled | ||
| // Should be unreachable, but we keep it defensive. | ||
| default: | ||
| s.Logger().Warn("Unexpected lease renew context error", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the exact issue you are fixing? If there is no a real issue, suggest not to change this. It might have impact on client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The issue is LeaseKeepAlive Unavailable #13632 and it might (we don't have enough info from the issue) have multiple root causes. I identified one: When the client cancels the LeaseKeepAlive stream, the server will return gRPC NoLeader error, but it should have generated Canceled error code instead.
- The fix is done in
server/etcdserver/v3_server.gofile. - The change here is refactoring (no behavior change). The previous implementation is saying "if the error is not ErrTimeout, we return ErrCanceled by default", but the fact is "The error is always either ErrTimeout or ErrCanceled". It's impossible to get other error code (I have
TestV3LeaseKeepAliveForwardingCatchErrorto prove it), so I refactor it to make this fact more obvious to readers.
| } | ||
| } | ||
|
|
||
| if errorspkg.Is(cctx.Err(), context.DeadlineExceeded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding line 404, could we just keep it as it is right now? I don’t think it’s related to this fix.
-
time.Sleep(50 * time.Millisecond) is a short delay, and the loop will already break if any error occurs.
-
cctxis created usingcontext.WithTimeout, notWithCancelCause,WithDeadlineCause, orWithTimeoutCause. If I understand correctly, the error should only becontext.DeadlineExceededorcontext.Canceled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you're right it's not related to the fix and it's just a mirror speedup to exit earlier when error occurs. Reverted this.
fuweid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Left comment for changes in LeaseRenew function
| // 2. Server cancellation: the client ctx is wrapped with WithRequireLeader, | ||
| // monitorLeader() detects no leader and thus cancels this stream with ErrGRPCNoLeader. | ||
| // 3. Server cancellation: the server is shutting down. | ||
| err = stream.Context().Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good. We don’t need to adjust that error handling. If the issue is caused by monitorLeader, RenewLeader will be canceled and will return no-leader to the client as well. At least we don’t log canceled events for the no-leader case.
etcd/server/etcdserver/v3_server.go
Lines 543 to 561 in 6d5e199
| func (s *EtcdServer) waitLeader(ctx context.Context) (*membership.Member, error) { | |
| leader := s.cluster.Member(s.Leader()) | |
| for leader == nil { | |
| // wait an election | |
| dur := time.Duration(s.Cfg.ElectionTicks) * time.Duration(s.Cfg.TickMs) * time.Millisecond | |
| select { | |
| case <-time.After(dur): | |
| leader = s.cluster.Member(s.Leader()) | |
| case <-s.stopping: | |
| return nil, errors.ErrStopped | |
| case <-ctx.Done(): | |
| return nil, errors.ErrNoLeader | |
| } | |
| } | |
| if len(leader.PeerURLs) == 0 { | |
| return nil, errors.ErrNoLeader | |
| } | |
| return leader, nil | |
| } |
- If it's timeout and client cancels it, client won't receive response from server
- If it's timeout and it's caused by monitorLeader, client should receive no-leader error. This patch doesn't change this behaviour.
(Line 554)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Your analysis is exactly correct!
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fuweid, zhijun42 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d899fa7 to
b4f0d00
Compare
Signed-off-by: Zhijun <[email protected]>
b4f0d00 to
3102816
Compare
|
@zhijun42: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Yes, I didn't explain clearly. Recall that we only have 3 scenarios when the stream context is Done:
So there's no behavior impact on the etcd client. |
Follow-up to the previous reproduction PR #21050. Refer there for full context.