Skip to content

Conversation

@eshitachandwani
Copy link
Member

Part of A74 changes.
This PR add the functions and functionalities to be used for cluster subscription for cluster refcounts and also for dynamic cluster subscription. These functions will be used in subsequent PRs.
RELEASE NOTES: None

@eshitachandwani eshitachandwani added this to the 1.79 Release milestone Dec 26, 2025
@eshitachandwani eshitachandwani added Type: Internal Cleanup Refactors, etc Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Dec 26, 2025
@codecov
Copy link

codecov bot commented Dec 26, 2025

Codecov Report

❌ Patch coverage is 67.34694% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.32%. Comparing base (4046676) to head (b656507).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
internal/xds/xdsdepmgr/xds_dependency_manager.go 63.41% 12 Missing and 3 partials ⚠️
internal/xds/resolver/xds_resolver.go 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8792      +/-   ##
==========================================
- Coverage   83.42%   83.32%   -0.10%     
==========================================
  Files         418      417       -1     
  Lines       32897    32963      +66     
==========================================
+ Hits        27443    27466      +23     
- Misses       4069     4094      +25     
- Partials     1385     1403      +18     
Files with missing lines Coverage Δ
internal/xds/resolver/xds_resolver.go 88.70% <87.50%> (-0.06%) ⬇️
internal/xds/xdsdepmgr/xds_dependency_manager.go 80.17% <63.41%> (-0.71%) ⬇️

... and 55 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arjan-bal
arjan-bal previously approved these changes Jan 7, 2026
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, otherwise LGTM!

ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
defer cancel()

configureAllResourcesOnManagementServer(ctx, t, mgmtServer, nodeID, listeners, route, clusters, endpoints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This function name is pretty long, can it be shortened to something like setupManagementServer? The part AllResourcesOn doesn't seem to add too much value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate setupManagementServerForTest function which starts the management server. And we already had a function called configureResourcesOnManagementServer which configures listener and route resources only.
We can make it configureAllResources if that sounds better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having multiple helpers that do very similar things is a bit confusing. Could we modify an existing function to support the extra functionality needed for these tests? Or do you think that would make the function too complex?

@arjan-bal arjan-bal dismissed their stale review January 7, 2026 21:05

Questions regarding the cluster subscription API.

if _, ok := m.clustersFromRouteConfig[name]; !ok {
m.maybeSendUpdateLocked()
}
return m.clusterSubscriptions[name].unsubscribe
Copy link
Contributor

@arjan-bal arjan-bal Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap this in a sync.Once so that the it's safe for callers to call it multiple times? This would make the API safer. Also mention in the godoc that the returned function is idempotent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Close method, can you add an error log if the clusterSubscriptions map is non-empty? This would serve as a leak check and help tests catch subscribers that fail to release their references in time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that there are ongoing RPCs when the Close is called, there will still be subscriptions to the cluster and that is an expected case, so I am not sure that an error log will be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolver should only be closed when the channel is closed (or when it enters IDLE, at which point there are no ongoing RPCs). When the channel is closed, any ongoing RPCs should fail. I suspect the issue is that the resolver is being closed before the RPCs fail; consequently, the Unsubscribe call might be arriving too late for the Close method to track it.

grpc-go/clientconn.go

Lines 1213 to 1228 in b3603ab

cc.resolverWrapper.close()
// The order of closing matters here since the balancer wrapper assumes the
// picker is closed before it is closed.
cc.pickerWrapper.close()
cc.balancerWrapper.close()
<-cc.resolverWrapper.serializer.Done()
<-cc.balancerWrapper.serializer.Done()
var wg sync.WaitGroup
for ac := range conns {
wg.Add(1)
go func(ac *addrConn) {
defer wg.Done()
ac.tearDown(ErrClientConnClosing)
}(ac)
}

We likely need a leak check to track the execution of the callbacks returned by the ClusterSubscription method. @mbissa recently added something similar for async gauge metrics here, which should be easy to reuse. This can be done in a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Internal Cleanup Refactors, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants