xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache #8521

hugehoo · 2025-08-18T16:16:46Z

root cause of issue:

I think there was a race condition when channel communicates between the xDS resolver and test infrastructure
- insufficient buffer size: original channels (stateCh and errCh) had only buffer size of 1
- blocking sends: When buffer is full, the resolver would block trying to send the next update
- test deadlock: test infra might be waiting for a specific update while the resolver was blocked trying to send a different update, creating a deadlock

Changes

Increased buffer size (1 → 10):

  stateCh := make(chan resolver.State, 10)
  errCh := make(chan error, 10)

Non-blocking send pattern:

 select {
 case stateCh <- s:  // the resolver try to send updates
 default:            // If channel is full, drain old message and retry
     select {
     case <-stateCh:
         stateCh <- s
     default:
     }
 }

make it drain old messages preventing the resolver from blocking and just keeping the most latest updates.

Cleanup with draining goroutines:

  go func() {
      for range stateCh { }  // Drain any remaining messages
  }()

it ensures the resolver never blocks on sends and prevents goroutine leaks during test cleanup.

RELEASE NOTES: N/A

codecov · 2025-08-18T16:21:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.70%. Comparing base (8420f3f) to head (657c28a).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8521      +/-   ##
==========================================
- Coverage   80.91%   80.70%   -0.21%     
==========================================
  Files         413      413              
  Lines       40751    40773      +22     
==========================================
- Hits        32972    32907      -65     
- Misses       6155     6224      +69     
- Partials     1624     1642      +18

see 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arjan-bal · 2025-08-21T06:56:27Z

Hi @hugehoo I have a few questions/requests to help me reviewing this fix:

Can you describe the root cause in the linked issue?
Can you also explain the fix in the PR description?
Were you able to repro the flakiness? If yes, can you mention the go test command that was used?

hugehoo · 2025-08-23T17:01:46Z

@arjan-bal i updated PR comment as you mentioned for 1, 2. but can't reproduce the flakiness yet.

arjan-bal · 2025-09-04T16:28:54Z

I was able to repro the failure by running the test around 1000 times and adding logs. The problem is indeed a write to the stateCh blocking indefinitely.

grpc-go/internal/xds/resolver/helpers_test.go

Lines 122 to 125 in 6524c7b

    
           updateStateF := func(s resolver.State) error { 
        
           	stateCh <- s 
        
           	return nil 
        
           }

Once the xDS client receives an invalid LDS resource, it informs the resolver. This causes 1 write to the errCh and stateCh as the xDS resolver sends an empty service config before reporting an error here:

grpc-go/internal/xds/resolver/xds_resolver.go

Lines 294 to 308 in 6524c7b

    
           if ok && len(r.activeClusters) == 0 { 
        
           	// There are no clusters and we are sending a failing configSelector. 
        
           	// Send an empty config, which picks pick-first, with no address, and 
        
           	// puts the ClientConn into transient failure. 
        
           	// 
        
           	// This call to UpdateState is expected to return ErrBadResolverState 
        
           	// since pick_first doesn't like an update with no addresses. 
        
           	r.cc.UpdateState(resolver.State{ServiceConfig: r.cc.ParseServiceConfig("{}")}) 
        
           	// Send a resolver error to pick_first so that RPCs will fail with a 
        
           	// more meaningful error, as opposed to one that says that pick_first 
        
           	// received no addresses. 
        
           	r.cc.ReportError(errCS.err) 
        
           	return true 
        
           }

Then the xDS client sends a NACK for the invalid LDS resource and the xDS management server re-sends the same resource again, causing a second write to both the channels. The write to the stateCh blocks the xds resolver forever.

I think increasing the buffer capacity will just make the deadlock more unlikely, but not impossible. Dropping state updates if the buffered channel is full may cause races if a test wants to verify every state update. I'm trying to think of better ways to resolve this, potentially by allowing the test functions to pass callbacks for handling the error and state updates, instead of using a channel to pass them.

arjan-bal · 2025-09-04T17:02:04Z

There are 19 usages of buildResolverForTarget which may need to be updated.

grpc-go/internal/xds/resolver/helpers_test.go

Line 100 in 6524c7b

    
           func buildResolverForTarget(t *testing.T, target resolver.Target, bootstrapContents []byte) (chan resolver.State, chan error, resolver.Resolver) {

Solution 1 (Simplest)

Don't use buildResolverForTarget in the flaking test. Duplicate its internals in the flaking test and get rid of the stateCh. This would result in around 25 lines of duplicated code.

Solution 2

We could change buildResolverForTarget to accept callbacks for error and state updates, making the signature similar to the following:

type resolverOpts struct {
	target            resolver.Target
	bootstrapContents []byte
	onError           func()
	onResolverState   func()
}

func buildResolverForTarget(t *testing.T, opts resolverOpts) resolver.Resolver

This would allow callers to get the previous behaviour by calling the function similar to the following:

stateCh := make(chan resolver.State, 1)
updateStateF := func(s resolver.State) error {
    stateCh <- s
    return nil
}

errCh := make(chan error, 1)
reportErrorF := func(err error) {
    select {
    case errCh <- err:
    default:
    }
}

opts := resolverOpts{
    target:            target,
    bootstrapContents: bootstrapContents,
    onError:           reportErrorF,
    onResolverState:   updateStateF,
}
r := buildResolverForTarget(t, opts)

This would also allow callers to ignore state updates if they want.

Solution 3

To avoid updating 19 tests and adding redundant code, we can make changes in option 2, but make a new function with this signature, say buildResolverForTargetWithOpts. Have the existing buildResolverForTarget create the channels and delegate to the new function. This way the existing callers don't need to be changed and only the flaking test can call buildResolverForTargetWithOpts directly.

I'm leaning towards option 1, @easwars wanted to get a second opinion on this.

easwars · 2025-09-08T20:07:51Z

Scenarios that involve NACKs aren't great with the go-control-plane management server because it will continuously keep re-sending the same resource that was NACKed. This means that if we are writing the resource/update/error to channel based on information from these responses, this will always lead to blocked writes on them.

I'm fine with option 1, since it seems to be the least invasive.

Another thing that could be added is the use of the testutils.Channel type instead of a vanilla channel. The former provides an API to replace the most recent value written to the channel, and we use this approach in some tests that expect NACKs from the management server. See: https://github.com/grpc/grpc-go/blob/master/internal/testutils/channel.go#L100

arjan-bal · 2025-09-11T06:38:48Z

Hi @hugehoo are you still working on this fix? Based on the last two comments above, we can avoid using the buildResolverForTarget and copy its internals to remove the stateCh.

hugehoo · 2025-09-11T07:28:46Z

Hi @hugehoo are you still working on this fix? Based on the last two comments above, we can avoid using the buildResolverForTarget and copy its internals to remove the stateCh.

yes, i will update as to copy the buildResolverForTarget logic for the flaky tests.

hugehoo · 2025-09-11T16:22:37Z

@arjan-bal I updated flaky tests following option1.

Implemented buildResolverForTarget method logic inside TestResolverBadServiceUpdate_NACKedWithoutCache.
Instead of calling waitForErrorFromResolver as in buildResolverForTarget, i also implemented waitForErrorFromResolver logic directly because of type incompatiblity.
Because waitForErrorFromResolverexpects chan error, but testutils.Channel has .C filed of type chan any.

	if err := waitForErrorFromResolver(ctx, errCh, "no RouteSpecifier", nodeID); err != nil {

arjan-bal

Looks mostly good, left a couple of minor comments.

internal/xds/resolver/xds_resolver_test.go

arjan-bal

LGTM, thanks for the fix!

easwars · 2025-09-15T21:22:09Z

LGTM

hugehoo force-pushed the flaky-test-nackedWithoutCache branch from ef9f9cb to 9352248 Compare August 18, 2025 16:29

hugehoo marked this pull request as ready for review August 18, 2025 16:37

arjan-bal self-requested a review August 21, 2025 06:56

arjan-bal assigned hugehoo Aug 21, 2025

arjan-bal added this to the 1.76 Release milestone Aug 21, 2025

arjan-bal added the Type: Testing label Aug 21, 2025

arjan-bal assigned arjan-bal and unassigned hugehoo Aug 26, 2025

arjan-bal assigned easwars and unassigned arjan-bal Sep 4, 2025

easwars assigned hugehoo and unassigned easwars Sep 8, 2025

hugehoo force-pushed the flaky-test-nackedWithoutCache branch from 1a06368 to 9352248 Compare September 11, 2025 14:03

update buildResolverForTarget

2bae182

hugehoo force-pushed the flaky-test-nackedWithoutCache branch from 9352248 to 2bae182 Compare September 11, 2025 14:08

hugehoo added 4 commits September 12, 2025 00:32

implement buildResolverForTarget internal

43131c6

rollback buildResolverForTarget implements

4d80b6b

rollback auto sorting

98e2aa4

fix code

708a398

arjan-bal reviewed Sep 12, 2025

View reviewed changes

internal/xds/resolver/xds_resolver_test.go Outdated Show resolved Hide resolved

internal/xds/resolver/xds_resolver_test.go Outdated Show resolved Hide resolved

internal/xds/resolver/xds_resolver_test.go Outdated Show resolved Hide resolved

adapts review

c0770b4

fix defer behind of err check

f39bd67

hugehoo requested a review from arjan-bal September 12, 2025 08:30

arjan-bal reviewed Sep 15, 2025

View reviewed changes

internal/xds/resolver/xds_resolver_test.go Outdated Show resolved Hide resolved

remove if bc != nil surrounds

657c28a

hugehoo requested a review from arjan-bal September 15, 2025 11:07

arjan-bal approved these changes Sep 15, 2025

View reviewed changes

arjan-bal changed the title ~~fix: Flaky test ResolverBadServiceUpdate_NACKedWithoutCache~~ xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache Sep 15, 2025

arjan-bal assigned arjan-bal and unassigned hugehoo Sep 15, 2025

arjan-bal merged commit ca78c90 into grpc:master Sep 15, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache #8521

xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache #8521

Uh oh!

hugehoo commented Aug 18, 2025 •

edited by arjan-bal

Loading

Uh oh!

codecov bot commented Aug 18, 2025 •

edited

Loading

Uh oh!

arjan-bal commented Aug 21, 2025

Uh oh!

hugehoo commented Aug 23, 2025

Uh oh!

arjan-bal commented Sep 4, 2025 •

edited

Loading

Uh oh!

arjan-bal commented Sep 4, 2025 •

edited

Loading

Uh oh!

easwars commented Sep 8, 2025

Uh oh!

arjan-bal commented Sep 11, 2025

Uh oh!

hugehoo commented Sep 11, 2025

Uh oh!

hugehoo commented Sep 11, 2025

Uh oh!

arjan-bal left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arjan-bal left a comment

Uh oh!

Uh oh!

easwars commented Sep 15, 2025

Uh oh!

Uh oh!

xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache #8521

xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache #8521

Uh oh!

Conversation

hugehoo commented Aug 18, 2025 • edited by arjan-bal Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

root cause of issue:

Changes

Uh oh!

codecov bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arjan-bal commented Aug 21, 2025

Uh oh!

hugehoo commented Aug 23, 2025

Uh oh!

arjan-bal commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arjan-bal commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solution 1 (Simplest)

Solution 2

Solution 3

Uh oh!

easwars commented Sep 8, 2025

Uh oh!

arjan-bal commented Sep 11, 2025

Uh oh!

hugehoo commented Sep 11, 2025

Uh oh!

hugehoo commented Sep 11, 2025

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

easwars commented Sep 15, 2025

Uh oh!

Uh oh!

hugehoo commented Aug 18, 2025 •

edited by arjan-bal

Loading

codecov bot commented Aug 18, 2025 •

edited

Loading

arjan-bal commented Sep 4, 2025 •

edited

Loading

arjan-bal commented Sep 4, 2025 •

edited

Loading