EndpointSlices implementation is broken

Hello!

In bigger k8s cluster we noticed very chaotic behavior when grpc client uses kuberesolver. There was no load balancing, just simple round robin used, but backend pods received very inconsistent amount of requests (which also changed over time). Given there were many pods making requests (using kuberesolver) and many backend pods we've expected more or less equal distribution of requests.

Upon investigation we believe endpoint slices implementation in kuberesolver is broken: it would seem that watching endpointslices object in k8s assumes that whenever changed endpointslice object is received it contains list of all endpoints (whole state).
This is true only when count of pods is low - single endpointslice may contain up to 100 endpoints (configurable in api-server).
When there are hundreds of pods there are many endpointslices and all of them should be used.

Basically it appears kuberesolver only uses a subset of endpoints at any given time: specifically endpoints from endpointslice that was modified most recently. This:
- limits endpoints provided by kuberesolver by 100 (no matter if there are 200 or 5000 endpoints in reality)
- causes chaotic changes when endpointslices are being modifed by k8s

This can be observed also in `kuberesolver_endpoints_total` metric, which never surpasses 100 and also generally returns count of endpoints in endpointslice that changed in k8s server most recently

How to reproduce:
- use kuberesolver with service with > 100 pods, check behavior and/or kuberesolver metrics



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EndpointSlices implementation is broken #66

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

EndpointSlices implementation is broken #66

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions