-
Notifications
You must be signed in to change notification settings - Fork 4.6k
xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache #8521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8521 +/- ##
==========================================
- Coverage 80.91% 80.70% -0.21%
==========================================
Files 413 413
Lines 40751 40773 +22
==========================================
- Hits 32972 32907 -65
- Misses 6155 6224 +69
- Partials 1624 1642 +18 🚀 New features to boost your workflow:
|
ef9f9cb
to
9352248
Compare
Hi @hugehoo I have a few questions/requests to help me reviewing this fix:
|
@arjan-bal i updated PR comment as you mentioned for 1, 2. but can't reproduce the flakiness yet. |
I was able to repro the failure by running the test around 1000 times and adding logs. The problem is indeed a write to the grpc-go/internal/xds/resolver/helpers_test.go Lines 122 to 125 in 6524c7b
Once the xDS client receives an invalid LDS resource, it informs the resolver. This causes 1 write to the errCh and stateCh as the xDS resolver sends an empty service config before reporting an error here: grpc-go/internal/xds/resolver/xds_resolver.go Lines 294 to 308 in 6524c7b
Then the xDS client sends a NACK for the invalid LDS resource and the xDS management server re-sends the same resource again, causing a second write to both the channels. The write to the stateCh blocks the xds resolver forever. I think increasing the buffer capacity will just make the deadlock more unlikely, but not impossible. Dropping state updates if the buffered channel is full may cause races if a test wants to verify every state update. I'm trying to think of better ways to resolve this, potentially by allowing the test functions to pass callbacks for handling the error and state updates, instead of using a channel to pass them. |
There are 19 usages of grpc-go/internal/xds/resolver/helpers_test.go Line 100 in 6524c7b
Solution 1 (Simplest)Don't use Solution 2We could change type resolverOpts struct {
target resolver.Target
bootstrapContents []byte
onError func()
onResolverState func()
}
func buildResolverForTarget(t *testing.T, opts resolverOpts) resolver.Resolver This would allow callers to get the previous behaviour by calling the function similar to the following: stateCh := make(chan resolver.State, 1)
updateStateF := func(s resolver.State) error {
stateCh <- s
return nil
}
errCh := make(chan error, 1)
reportErrorF := func(err error) {
select {
case errCh <- err:
default:
}
}
opts := resolverOpts{
target: target,
bootstrapContents: bootstrapContents,
onError: reportErrorF,
onResolverState: updateStateF,
}
r := buildResolverForTarget(t, opts) This would also allow callers to ignore state updates if they want. Solution 3To avoid updating 19 tests and adding redundant code, we can make changes in option 2, but make a new function with this signature, say I'm leaning towards option 1, @easwars wanted to get a second opinion on this. |
Scenarios that involve NACKs aren't great with the go-control-plane management server because it will continuously keep re-sending the same resource that was NACKed. This means that if we are writing the resource/update/error to channel based on information from these responses, this will always lead to blocked writes on them. I'm fine with option 1, since it seems to be the least invasive. Another thing that could be added is the use of the |
Hi @hugehoo are you still working on this fix? Based on the last two comments above, we can avoid using the |
yes, i will update as to copy the |
1a06368
to
9352248
Compare
9352248
to
2bae182
Compare
@arjan-bal I updated flaky tests following option1.
if err := waitForErrorFromResolver(ctx, errCh, "no RouteSpecifier", nodeID); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good, left a couple of minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
LGTM |
Fixes: #8435
root cause of issue:
Changes
goroutine leaks
during test cleanup.RELEASE NOTES: N/A