xds/clusterresolver: Revise configbuilder childname generator to fix leakage with merge and split scenarios #8531

erenboz · 2025-08-21T12:17:44Z

At my workplace, we heavily use proxyless grpc set-up provided by GCP cloud service mesh/traffic director (TD) that provides zone-aware load balancing and we've been experiencing sporadic connection leaks that were inflating number of open connections to 5-10x of normal levels for about 2 years now.

Recently I got to dig into the matter to find out, TD heavily uses mixture of priority lb and weighted rr lb to switch between fully sending to same zone using priority lb and back to wrr lb, e.g. flapping between [L1], [L2, L3] and [L1, w:95, L2, w:5],[L3]. In such a merge and split scenario, current name generator forgets/overrides the names and generate new ones constantly. In this PR, I implemented bit new approach to name generation purely based on generating one upon seeing a new locality and follow same reuse logic.

I acknowledge that new logic is a bit weaker reuse in cases like [L1,L2], [L3] -> [L1],[L2,L3], I have made an alternative implementation PR if more similar behavior to current is desired. Nevertheless there is not so much value that can be captured purely on locality level...

This change fixes our priority-lb churn issue and cuts the uncontrolled subconn leaks that goes with priority-lb leak, however some connection churn still occur due to locality move between different priority-lb but those seem to be shutdown accordingly. We'd need to look for further fixes to completely avoid churns. Would appreciate some guidance. Subconn pool reuse is one option, which might be in roadmap already? Not using priority-lb at all on xds server side and sticking to weighted rr lb is also a quick fix option.

RELEASE NOTES:
xds: Revise name generator to fix xds priority-lb leakage issues where localities shuffle between priorities

…ge and split scenarios

linux-foundation-easycla · 2025-08-21T12:17:50Z

The committers listed above are authorized under a signed CLA.

✅ login: erenboz / name: Eren Boz (ae83d85, a01d901, 0a26c2f, 5e01f0f)

codecov · 2025-08-21T12:21:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.00%. Comparing base (5ed7cf6) to head (0a26c2f).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #8531   +/-   ##
=======================================
  Coverage   81.99%   82.00%           
=======================================
  Files         412      412           
  Lines       40520    40457   -63     
=======================================
- Hits        33225    33177   -48     
+ Misses       5904     5903    -1     
+ Partials     1391     1377   -14

Files with missing lines	Coverage Δ
...alancer/clusterresolver/configbuilder_childname.go	`100.00% <100.00%> (ø)`

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…l used up

erenboz · 2025-08-25T13:23:18Z

@dfawley @easwars sorry for direct ping would be nice to address this sooner as the issue leads to fairly large connection leaks and associated perf/resource overheads in GCP proxyless grpc use-case.

dfawley · 2025-08-25T15:56:05Z

The intended behavior for this should be (?) documented by:

https://github.com/grpc/proposal/blob/4a8687b77fd19cd6fb445812ee4e612f86d92e5f/A37-xds-aggregate-and-logical-dns-clusters.md#xds_cluster_resolver_experimental-lb-policy
https://github.com/grpc/proposal/blob/4a8687b77fd19cd6fb445812ee4e612f86d92e5f/A52-xds-custom-lb-policies.md#xds-cluster-resolver-load-balancer

Anything that deviates from that would need a gRFC or at least cross-language agreement.

Subconn pool reuse is one option, which might be in roadmap already?

It's not specifically planned at this time, but there is upcoming work that might make this a priority for us.

We're pretty swamped with things this week, so apologies if we are slow to respond to things.

erenboz · 2025-08-25T17:22:38Z

The intended behavior for this should be (?) documented by:

https://github.com/grpc/proposal/blob/4a8687b77fd19cd6fb445812ee4e612f86d92e5f/A37-xds-aggregate-and-logical-dns-clusters.md#xds_cluster_resolver_experimental-lb-policy https://github.com/grpc/proposal/blob/4a8687b77fd19cd6fb445812ee4e612f86d92e5f/A52-xds-custom-lb-policies.md#xds-cluster-resolver-load-balancer

Anything that deviates from that would need a gRFC or at least cross-language agreement.

Subconn pool reuse is one option, which might be in roadmap already?

It's not specifically planned at this time, but there is upcoming work that might make this a priority for us.

Thanks for the reply and heads-up. We certainly can start benefiting from the fix using our own fork while waiting, no hurries there.

Skimming through gRFCs it doesn't seem to particularly specify how name generation and pattern for reuse should be. So we should be able to legitimately change the namegen logic to prevent unbounded leakage. So there shouldn't be blocker for namegen change?

However, certainly in changing how zone-aware lb is implemented in order to avoid subconn churn there could be frictions w.r.t. gRFC specs.

easwars · 2025-08-25T19:50:15Z

Also, FYI, A74 implementation is underway in grpc-go, but could take a few weeks to land completely. And when that happens the clusterresolver LB policy will be completely removed. Although the name generation logic might move to the cds LB policy. Please correct me if I'm wrong @eshitachandwani.

eshitachandwani · 2025-08-26T04:59:49Z

Also, FYI, A74 implementation is underway in grpc-go, but could take a few weeks to land completely. And when that happens the clusterresolver LB policy will be completely removed. Although the name generation logic might move to the cds LB policy. Please correct me if I'm wrong @eshitachandwani.

Yes, that is right! Since the name generator is to generate names for priorities, it will still be used , but in CDS LB policy.

arjan-bal · 2025-09-03T13:12:00Z

I was looking up the history for the child re-use code. The name generation and child balancer re-use code was added in #5268. The algorithm to generate the name is described in an internal doc: go/grpc-xds-policy-refactoring

The reason for incrementing the numbers used in the child name is described in #5268 (comment). It is to ensure the fallback timer is started when a new child balancer is created. Incorrectly re-using a deactivated child balancer could result in the fallback timer not being restarted and an immediate fallback to a lower priority. This may not be a concern with the change in this PR since it's guaranteed that there is a common locality in the child that is being re-used, @easwars can you confirm this?

There may be a concern with storing previously seen locality names in map and never cleaning them up. I think this should be solvable by restricting the list size.

It looks to me like the changes in the name generation algorithm in the alternate approach, #8532, are always better than the existing algorithm. @erenboz can you confirm if this is true?

I'm discussing this with other maintainers.

erenboz · 2025-09-03T14:43:22Z

There may be a concern with storing previously seen locality names in map and never cleaning them up. I think this should be solvable by restricting the list size.

Sure that'd be ok defensive addition, shouldn't affect reasonable environments :).

It looks to me like the changes in the name generation algorithm in the alternate approach, #8532, are always better than the existing algorithm. @erenboz can you confirm if this is true?

Yeah @arjan-bal the alternative is indeed a bit better behavior, since there isn't an established benchmark criteria I didn't actively spent time on picking one over other. We can build on the alternative or if we have some additional criteria we want to meet we can also try to come up with another, anything works as long as generated names are bounded with number of locality.

easwars · 2025-09-05T06:03:54Z

It is to ensure the fallback timer is started when a new child balancer is created. Incorrectly re-using a deactivated child balancer could result in the fallback timer not being restarted and an immediate fallback to a lower priority

I feel that we are handling that case explicitly here:

grpc-go/internal/xds/balancer/priority/balancer_priority.go

Line 177 in 5780703

    
           func (b *priorityBalancer) handleChildStateUpdate(childName string, s balancer.State) {

easwars · 2025-09-05T06:24:33Z

internal/xds/balancer/clusterresolver/configbuilder_childname_test.go

+			inputs: [][][]xdsresource.Locality{
+				{
+					{{ID: clients.Locality{Zone: "L0"}}, {ID: clients.Locality{Zone: "L1"}}},
+					{{ID: clients.Locality{Zone: "L2"}}},
+				},
+				{
+					{{ID: clients.Locality{Zone: "L0"}}},
+					{{ID: clients.Locality{Zone: "L1"}}}, // This gets a newly generated name, since "0-0" was already picked.
+					{{ID: clients.Locality{Zone: "L2"}}},
+				},
+			},
+			want: []string{"priority-0-0", "priority-0-1", "priority-0-2"},


At the end of the first input step, the generated names should be: []string{"priority-0-0", "priority-0-1"}.

When the next step happens, and we are processing the first priority in the list, we will reuse the name priority-0-0. When the process the next priority in the list, and see that L1 was present in the previous config, but its name is already used in the new config, we should ideally generate priority-0-2 for it, as priority-0-1 is already used in the previous config. So, the end output should be:
[]string{"priority-0-0", "priority-0-2", "priority-0-1"},

Am I missing something?

With the old algorithm yes, however in this PR I've just had different approach by generating names upon seeing a new locality, but alternative does it that way here. You could look into this case, old algo generates [priority-0-2 priority-0-3] for it, so when this pattern repeats over many updates we're constantly leaking priority-lbs.

So effort here is to have a new algorithm that does not generate unbounded number of names, yet there likely will be some trade-offs in new algorithm in a way merges in certain cases could be less than "ideal".

easwars · 2025-09-05T17:14:15Z

At my workplace, we heavily use proxyless grpc set-up provided by GCP cloud service mesh/traffic director (TD) that provides zone-aware load balancing and we've been experiencing sporadic connection leaks that were inflating number of open connections to 5-10x of normal levels for about 2 years now.

When this happens, do you eventually see the number of connections come back down to normal levels?

erenboz · 2025-09-05T17:22:23Z

When this happens, do you eventually see the number of connections come back down to normal levels?

@easwars if you run the test cases I added with the old algo you can see there in some merge and split scenarios new names generated so in certain occassions TD goes into such phases while trying to keep zone aware balancing leading to a lot of priority-lbs created and leaked, in practice the leak tends to slow down and come down every now and then but that's due to cache timeout and removal of those priority-lbs.

easwars · 2025-09-08T21:43:41Z

We had a discussion on this with folks from different gRPC languages. The algorithm that is currently implemented was meant as a heuristic. So, there can absolutely be cases where it can be sub-optimal. The implementation in this PR and the other one (that is currently closed) might be optimal for your use case, but could prove sub optimal for other use cases. So, I'm hesitant to change it. It might be time to prioritize the subchannel pooling implementation. @dfawley : What are your thoughts?

@erenboz : What is your appetite for maintaining this as a patch in your deployment until we have something better, i.e., subchannel pooling?

erenboz · 2025-09-09T05:52:27Z

What is your appetite for maintaining this as a patch in your deployment until we have something better, i.e., subchannel pooling?

@easwars that's doable, it's fairly isolated patch.

Certainly subchannel pooling would be overall superior fix for subconn lifecycle as that would get rid of additional subconn being destroyed and re-created as well whenever some locality is switched over. Nevertheless new chain of objects created by logic of this algo still do feel like a waste of cpu cycles and might benefit from more stable re-use eventually.

revise configbuilder childname generator to fix reuse issues with mer…

ae83d85

…ge and split scenarios

erenboz changed the title ~~xds/clusterresolver: Revise configbuilder childname generator to fix reuse issues with merge and split scenarios to address priority-lb leaks~~ xds/clusterresolver: Revise configbuilder childname generator to fix reuse issues with merge and split scenarios Aug 21, 2025

Fix goimports formatting in configbuilder_childname.go

a01d901

erenboz mentioned this pull request Aug 21, 2025

xds/clusterresolver: [alternative] Revise configbuilder childname generator to fix leakage issues with merge and split scenarios #8532

Closed

erenboz added 2 commits August 24, 2025 21:59

add some more test for defensive cases and generate new names upon al…

5e01f0f

…l used up

vet

0a26c2f

erenboz changed the title ~~xds/clusterresolver: Revise configbuilder childname generator to fix reuse issues with merge and split scenarios~~ xds/clusterresolver: Revise configbuilder childname generator to fix leakage with merge and split scenarios Aug 24, 2025

erenboz mentioned this pull request Sep 1, 2025

Reimplement priority-lb name generator with name per locality logic to prevent leaks erenboz/grpc-go#1

Merged

arjan-bal self-assigned this Sep 3, 2025

easwars reviewed Sep 5, 2025

View reviewed changes

eshitachandwani assigned easwars Sep 8, 2025

xds/clusterresolver: Revise configbuilder childname generator to fix leakage with merge and split scenarios #8531

Are you sure you want to change the base?

xds/clusterresolver: Revise configbuilder childname generator to fix leakage with merge and split scenarios #8531

Uh oh!

Conversation

erenboz commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

erenboz commented Aug 25, 2025

Uh oh!

dfawley commented Aug 25, 2025

Uh oh!

erenboz commented Aug 25, 2025

Uh oh!

easwars commented Aug 25, 2025

Uh oh!

eshitachandwani commented Aug 26, 2025

Uh oh!

arjan-bal commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erenboz commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

easwars commented Sep 5, 2025

Uh oh!

easwars Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

erenboz Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

easwars commented Sep 5, 2025

Uh oh!

erenboz commented Sep 5, 2025

Uh oh!

easwars commented Sep 8, 2025

Uh oh!

erenboz commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

erenboz commented Aug 21, 2025 •

edited

Loading

linux-foundation-easycla bot commented Aug 21, 2025 •

edited

Loading

codecov bot commented Aug 21, 2025 •

edited

Loading

arjan-bal commented Sep 3, 2025 •

edited

Loading

erenboz commented Sep 3, 2025 •

edited

Loading

erenboz commented Sep 9, 2025 •

edited

Loading