Skip to content

Conversation

hugehoo
Copy link
Contributor

@hugehoo hugehoo commented Aug 18, 2025

Fixes: #8435

root cause of issue:

  • I think there was a race condition when channel communicates between the xDS resolver and test infrastructure
    • insufficient buffer size: original channels (stateCh and errCh) had only buffer size of 1
    • blocking sends: When buffer is full, the resolver would block trying to send the next update
    • test deadlock: test infra might be waiting for a specific update while the resolver was blocked trying to send a different update, creating a deadlock

Changes

  1. Increased buffer size (1 → 10):
  stateCh := make(chan resolver.State, 10)
  errCh := make(chan error, 10)
  1. Non-blocking send pattern:
 select {
 case stateCh <- s:  // the resolver try to send updates
 default:            // If channel is full, drain old message and retry
     select {
     case <-stateCh:
         stateCh <- s
     default:
     }
 }
  • make it drain old messages preventing the resolver from blocking and just keeping the most latest updates.
  1. Cleanup with draining goroutines:
  go func() {
      for range stateCh { }  // Drain any remaining messages
  }()
  • it ensures the resolver never blocks on sends and prevents goroutine leaks during test cleanup.

RELEASE NOTES: N/A

Copy link

codecov bot commented Aug 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.70%. Comparing base (8420f3f) to head (657c28a).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8521      +/-   ##
==========================================
- Coverage   80.91%   80.70%   -0.21%     
==========================================
  Files         413      413              
  Lines       40751    40773      +22     
==========================================
- Hits        32972    32907      -65     
- Misses       6155     6224      +69     
- Partials     1624     1642      +18     

see 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hugehoo hugehoo force-pushed the flaky-test-nackedWithoutCache branch from ef9f9cb to 9352248 Compare August 18, 2025 16:29
@hugehoo hugehoo marked this pull request as ready for review August 18, 2025 16:37
@arjan-bal
Copy link
Contributor

Hi @hugehoo I have a few questions/requests to help me reviewing this fix:

  1. Can you describe the root cause in the linked issue?
  2. Can you also explain the fix in the PR description?
  3. Were you able to repro the flakiness? If yes, can you mention the go test command that was used?

@arjan-bal arjan-bal self-requested a review August 21, 2025 06:56
@arjan-bal arjan-bal added this to the 1.76 Release milestone Aug 21, 2025
@hugehoo
Copy link
Contributor Author

hugehoo commented Aug 23, 2025

@arjan-bal i updated PR comment as you mentioned for 1, 2. but can't reproduce the flakiness yet.

@arjan-bal arjan-bal assigned arjan-bal and unassigned hugehoo Aug 26, 2025
@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 4, 2025

I was able to repro the failure by running the test around 1000 times and adding logs. The problem is indeed a write to the stateCh blocking indefinitely.

updateStateF := func(s resolver.State) error {
stateCh <- s
return nil
}

Once the xDS client receives an invalid LDS resource, it informs the resolver. This causes 1 write to the errCh and stateCh as the xDS resolver sends an empty service config before reporting an error here:

if ok && len(r.activeClusters) == 0 {
// There are no clusters and we are sending a failing configSelector.
// Send an empty config, which picks pick-first, with no address, and
// puts the ClientConn into transient failure.
//
// This call to UpdateState is expected to return ErrBadResolverState
// since pick_first doesn't like an update with no addresses.
r.cc.UpdateState(resolver.State{ServiceConfig: r.cc.ParseServiceConfig("{}")})
// Send a resolver error to pick_first so that RPCs will fail with a
// more meaningful error, as opposed to one that says that pick_first
// received no addresses.
r.cc.ReportError(errCS.err)
return true
}

Then the xDS client sends a NACK for the invalid LDS resource and the xDS management server re-sends the same resource again, causing a second write to both the channels. The write to the stateCh blocks the xds resolver forever.

I think increasing the buffer capacity will just make the deadlock more unlikely, but not impossible. Dropping state updates if the buffered channel is full may cause races if a test wants to verify every state update. I'm trying to think of better ways to resolve this, potentially by allowing the test functions to pass callbacks for handling the error and state updates, instead of using a channel to pass them.

@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 4, 2025

There are 19 usages of buildResolverForTarget which may need to be updated.

func buildResolverForTarget(t *testing.T, target resolver.Target, bootstrapContents []byte) (chan resolver.State, chan error, resolver.Resolver) {

Solution 1 (Simplest)

Don't use buildResolverForTarget in the flaking test. Duplicate its internals in the flaking test and get rid of the stateCh. This would result in around 25 lines of duplicated code.

Solution 2

We could change buildResolverForTarget to accept callbacks for error and state updates, making the signature similar to the following:

type resolverOpts struct {
	target            resolver.Target
	bootstrapContents []byte
	onError           func()
	onResolverState   func()
}

func buildResolverForTarget(t *testing.T, opts resolverOpts) resolver.Resolver

This would allow callers to get the previous behaviour by calling the function similar to the following:

stateCh := make(chan resolver.State, 1)
updateStateF := func(s resolver.State) error {
    stateCh <- s
    return nil
}

errCh := make(chan error, 1)
reportErrorF := func(err error) {
    select {
    case errCh <- err:
    default:
    }
}

opts := resolverOpts{
    target:            target,
    bootstrapContents: bootstrapContents,
    onError:           reportErrorF,
    onResolverState:   updateStateF,
}
r := buildResolverForTarget(t, opts)

This would also allow callers to ignore state updates if they want.

Solution 3

To avoid updating 19 tests and adding redundant code, we can make changes in option 2, but make a new function with this signature, say buildResolverForTargetWithOpts. Have the existing buildResolverForTarget create the channels and delegate to the new function. This way the existing callers don't need to be changed and only the flaking test can call buildResolverForTargetWithOpts directly.

I'm leaning towards option 1, @easwars wanted to get a second opinion on this.

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Sep 4, 2025
@easwars
Copy link
Contributor

easwars commented Sep 8, 2025

Scenarios that involve NACKs aren't great with the go-control-plane management server because it will continuously keep re-sending the same resource that was NACKed. This means that if we are writing the resource/update/error to channel based on information from these responses, this will always lead to blocked writes on them.

I'm fine with option 1, since it seems to be the least invasive.

Another thing that could be added is the use of the testutils.Channel type instead of a vanilla channel. The former provides an API to replace the most recent value written to the channel, and we use this approach in some tests that expect NACKs from the management server. See: https://github.com/grpc/grpc-go/blob/master/internal/testutils/channel.go#L100

@easwars easwars assigned hugehoo and unassigned easwars Sep 8, 2025
@arjan-bal
Copy link
Contributor

Hi @hugehoo are you still working on this fix? Based on the last two comments above, we can avoid using the buildResolverForTarget and copy its internals to remove the stateCh.

@hugehoo
Copy link
Contributor Author

hugehoo commented Sep 11, 2025

Hi @hugehoo are you still working on this fix? Based on the last two comments above, we can avoid using the buildResolverForTarget and copy its internals to remove the stateCh.

yes, i will update as to copy the buildResolverForTarget logic for the flaky tests.

@hugehoo hugehoo force-pushed the flaky-test-nackedWithoutCache branch from 1a06368 to 9352248 Compare September 11, 2025 14:03
@hugehoo hugehoo force-pushed the flaky-test-nackedWithoutCache branch from 9352248 to 2bae182 Compare September 11, 2025 14:08
@hugehoo
Copy link
Contributor Author

hugehoo commented Sep 11, 2025

@arjan-bal I updated flaky tests following option1.

  • Implemented buildResolverForTarget method logic inside TestResolverBadServiceUpdate_NACKedWithoutCache.

  • Instead of calling waitForErrorFromResolver as in buildResolverForTarget, i also implemented waitForErrorFromResolver logic directly because of type incompatiblity.
    Because waitForErrorFromResolverexpects chan error, but testutils.Channel has .C filed of type chan any.

	if err := waitForErrorFromResolver(ctx, errCh, "no RouteSpecifier", nodeID); err != nil {

Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, left a couple of minor comments.

@hugehoo hugehoo requested a review from arjan-bal September 12, 2025 08:30
@hugehoo hugehoo requested a review from arjan-bal September 15, 2025 11:07
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix!

@arjan-bal arjan-bal changed the title fix: Flaky test ResolverBadServiceUpdate_NACKedWithoutCache xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache Sep 15, 2025
@arjan-bal arjan-bal assigned arjan-bal and unassigned hugehoo Sep 15, 2025
@arjan-bal arjan-bal merged commit ca78c90 into grpc:master Sep 15, 2025
17 checks passed
@easwars
Copy link
Contributor

easwars commented Sep 15, 2025

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: Test/ResolverBadServiceUpdate_NACKedWithoutCache
3 participants