-
Notifications
You must be signed in to change notification settings - Fork 4.6k
xds/clusterresolver: Revise configbuilder childname generator to fix leakage with merge and split scenarios #8531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ge and split scenarios
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8531 +/- ##
=======================================
Coverage 81.99% 82.00%
=======================================
Files 412 412
Lines 40520 40457 -63
=======================================
- Hits 33225 33177 -48
+ Misses 5904 5903 -1
+ Partials 1391 1377 -14
🚀 New features to boost your workflow:
|
The intended behavior for this should be (?) documented by: https://github.com/grpc/proposal/blob/4a8687b77fd19cd6fb445812ee4e612f86d92e5f/A37-xds-aggregate-and-logical-dns-clusters.md#xds_cluster_resolver_experimental-lb-policy Anything that deviates from that would need a gRFC or at least cross-language agreement.
It's not specifically planned at this time, but there is upcoming work that might make this a priority for us. We're pretty swamped with things this week, so apologies if we are slow to respond to things. |
Thanks for the reply and heads-up. We certainly can start benefiting from the fix using our own fork while waiting, no hurries there. Skimming through gRFCs it doesn't seem to particularly specify how name generation and pattern for reuse should be. So we should be able to legitimately change the namegen logic to prevent unbounded leakage. So there shouldn't be blocker for namegen change? However, certainly in changing how zone-aware lb is implemented in order to avoid subconn churn there could be frictions w.r.t. gRFC specs. |
Also, FYI, A74 implementation is underway in grpc-go, but could take a few weeks to land completely. And when that happens the |
Yes, that is right! Since the name generator is to generate names for priorities, it will still be used , but in CDS LB policy. |
I was looking up the history for the child re-use code. The name generation and child balancer re-use code was added in #5268. The algorithm to generate the name is described in an internal doc: go/grpc-xds-policy-refactoring The reason for incrementing the numbers used in the child name is described in #5268 (comment). It is to ensure the fallback timer is started when a new child balancer is created. Incorrectly re-using a deactivated child balancer could result in the fallback timer not being restarted and an immediate fallback to a lower priority. This may not be a concern with the change in this PR since it's guaranteed that there is a common locality in the child that is being re-used, @easwars can you confirm this? There may be a concern with storing previously seen locality names in map and never cleaning them up. I think this should be solvable by restricting the list size. It looks to me like the changes in the name generation algorithm in the alternate approach, #8532, are always better than the existing algorithm. @erenboz can you confirm if this is true? I'm discussing this with other maintainers. |
Sure that'd be ok defensive addition, shouldn't affect reasonable environments :).
Yeah @arjan-bal the alternative is indeed a bit better behavior, since there isn't an established benchmark criteria I didn't actively spent time on picking one over other. We can build on the alternative or if we have some additional criteria we want to meet we can also try to come up with another, anything works as long as generated names are bounded with number of locality. |
I feel that we are handling that case explicitly here:
|
inputs: [][][]xdsresource.Locality{ | ||
{ | ||
{{ID: clients.Locality{Zone: "L0"}}, {ID: clients.Locality{Zone: "L1"}}}, | ||
{{ID: clients.Locality{Zone: "L2"}}}, | ||
}, | ||
{ | ||
{{ID: clients.Locality{Zone: "L0"}}}, | ||
{{ID: clients.Locality{Zone: "L1"}}}, // This gets a newly generated name, since "0-0" was already picked. | ||
{{ID: clients.Locality{Zone: "L2"}}}, | ||
}, | ||
}, | ||
want: []string{"priority-0-0", "priority-0-1", "priority-0-2"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end of the first input step, the generated names should be: []string{"priority-0-0", "priority-0-1"}
.
When the next step happens, and we are processing the first priority in the list, we will reuse the name priority-0-0
. When the process the next priority in the list, and see that L1
was present in the previous config, but its name is already used in the new config, we should ideally generate priority-0-2
for it, as priority-0-1
is already used in the previous config. So, the end output should be:
[]string{"priority-0-0", "priority-0-2", "priority-0-1"},
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the old algorithm yes, however in this PR I've just had different approach by generating names upon seeing a new locality, but alternative does it that way here. You could look into this case, old algo generates [priority-0-2 priority-0-3]
for it, so when this pattern repeats over many updates we're constantly leaking priority-lbs.
So effort here is to have a new algorithm that does not generate unbounded number of names, yet there likely will be some trade-offs in new algorithm in a way merges in certain cases could be less than "ideal".
When this happens, do you eventually see the number of connections come back down to normal levels? |
@easwars if you run the test cases I added with the old algo you can see there in some merge and split scenarios new names generated so in certain occassions TD goes into such phases while trying to keep zone aware balancing leading to a lot of priority-lbs created and leaked, in practice the leak tends to slow down and come down every now and then but that's due to cache timeout and removal of those priority-lbs. |
We had a discussion on this with folks from different gRPC languages. The algorithm that is currently implemented was meant as a heuristic. So, there can absolutely be cases where it can be sub-optimal. The implementation in this PR and the other one (that is currently closed) might be optimal for your use case, but could prove sub optimal for other use cases. So, I'm hesitant to change it. It might be time to prioritize the subchannel pooling implementation. @dfawley : What are your thoughts? @erenboz : What is your appetite for maintaining this as a patch in your deployment until we have something better, i.e., subchannel pooling? |
@easwars that's doable, it's fairly isolated patch. Certainly subchannel pooling would be overall superior fix for subconn lifecycle as that would get rid of additional subconn being destroyed and re-created as well whenever some locality is switched over. Nevertheless new chain of objects created by logic of this algo still do feel like a waste of cpu cycles and might benefit from more stable re-use eventually. |
At my workplace, we heavily use proxyless grpc set-up provided by GCP cloud service mesh/traffic director (TD) that provides zone-aware load balancing and we've been experiencing sporadic connection leaks that were inflating number of open connections to 5-10x of normal levels for about 2 years now.
Recently I got to dig into the matter to find out, TD heavily uses mixture of priority lb and weighted rr lb to switch between fully sending to same zone using priority lb and back to wrr lb, e.g. flapping between [L1], [L2, L3] and [L1, w:95, L2, w:5],[L3]. In such a merge and split scenario, current name generator forgets/overrides the names and generate new ones constantly. In this PR, I implemented bit new approach to name generation purely based on generating one upon seeing a new locality and follow same reuse logic.
I acknowledge that new logic is a bit weaker reuse in cases like [L1,L2], [L3] -> [L1],[L2,L3], I have made an alternative implementation PR if more similar behavior to current is desired. Nevertheless there is not so much value that can be captured purely on locality level...
This change fixes our priority-lb churn issue and cuts the uncontrolled subconn leaks that goes with priority-lb leak, however some connection churn still occur due to locality move between different priority-lb but those seem to be shutdown accordingly. We'd need to look for further fixes to completely avoid churns. Would appreciate some guidance. Subconn pool reuse is one option, which might be in roadmap already? Not using priority-lb at all on xds server side and sticking to weighted rr lb is also a quick fix option.
RELEASE NOTES:
xds: Revise name generator to fix xds priority-lb leakage issues where localities shuffle between priorities