feat: implement discovery of apiservices#2854
feat: implement discovery of apiservices#2854alexandernorth wants to merge 8 commits intokubernetes:mainfrom
Conversation
|
This issue is currently awaiting triage. If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @alexandernorth! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alexandernorth The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
internal/discovery/discovery.go
Outdated
| group := serviceSpec["group"].(string) | ||
| version := serviceSpec["version"].(string) | ||
|
|
||
| resourceList, err := discoveryClient.ServerResourcesForGroupVersion(fmt.Sprintf("%s/%s", group, version)) |
There was a problem hiding this comment.
This call runs on every add/update event for an APIService and can make it quite chatty, especially since such updates can occur frequently. Do we want any dedup or backoff per group/version to avoid repeated discovery calls and potential API churn?
There was a problem hiding this comment.
Good point - I will look into this
There was a problem hiding this comment.
I have improved the logic so that the APIService must be Available before querying the API Server for resources, reducing calls which would return no data (and thus hopefully the amount of churn too).
I am under the impression we should always perform this query whenever there is an update as we need to track all available resources and a change to an APIService could result in new/removed Kinds, and this means that it becomes tricky with dedup/backoff as we might miss updates. As resources are queried by group+version which should be unique within the cluster, I don't see a case where we have multiple calls per update for the same group/version combinations - please correct me if I am missing something here though.
There was a problem hiding this comment.
Thanks, that makes sense. The availability check sounds like a good improvement and should help reduce chattiness.
My original thought was mainly around repeated updates where the APIService object itself changes (like status churn) without a change in group/version, but I agree dedup/backoff gets tricky if we want to avoid missing updates.
There was a problem hiding this comment.
Yes that's true, I also considered this but I didn't find a "nice" solution which would ensure we received all updates - although in my (short term) observations around this, unless the aggregation service is very unstable there are not that many updates triggered.
bhope
left a comment
There was a problem hiding this comment.
Since runInformer() is now used for both CRDs and APIServices, the CRDsAddEventsCounter and CRDsCacheCountGauge metrics will also count APIService events. That might be a bit confusing from a metrics perspective. Should we consider renaming them or splitting by source (CRD vs APIService)?
… be managed in the same way
|
As APIServices are dynamic I realised I could not use exactly the same system as was present for CRDs, as if the APIService is no longer available then it cannot be queried for its Kinds (to remove them from the cache map). I refactored the cache map, which is now keyed by the source of the discovered resource - the benefit being that we can handle the case where an APIService becomes unavailable. It does mean that we no longer index by Group/Version, but this could be implemented if it is a requirement. Regarding the generated metrics, I have consolidated and renamed them to apply for both CRDs and APIServices. I also removed the I refactored the The refactor also fixes a missing synchronisation where the cache map could be read outside of being locked (via the |
bhope
left a comment
There was a problem hiding this comment.
Thanks for those refactors and detailed walkthrough.
On the metrics side, since these are already released (hence probably consumed), I’d prefer we keep the delete metric rather than folding it into update. Having delete counted separately is still useful to understand churn (resources dropping vs being refreshed). Besides, removing it would be a breaking change for existing dashboards.
|
That makes sense - I have added back the |
bhope
left a comment
There was a problem hiding this comment.
Thanks for the update and incorporating the feedback. Overall, looks good to me.
dgrisonnet
left a comment
There was a problem hiding this comment.
Nice work @alexandernorth 👍
This is looking good, but would you mind splitting this in two PRs?
One to refactor the CRD discovery and another one adding the APIService discovery? This would make this work easier to review.
| } | ||
| if r.GVKToReflectorStopChanMap == nil { | ||
| r.GVKToReflectorStopChanMap = map[string]chan struct{}{} | ||
| // UpdateSource replaces all resources for a source with new resources. |
There was a problem hiding this comment.
I don't think this will work. The EventHandler will call UpdateSource with only one CRD or one Aggregated API, not the full list. So with the way you wrote this function, the latest updated resource will always override the other.
There was a problem hiding this comment.
I had a look into this, but I think it should work as expected.
The new implementation groups the resources by source (either CRD or APIService), if a CRD/APIService triggers an update, the discovered types under that source will be removed and replaced with the new data (from the CRD itself or the discovered types from the apiservice). The intention being that if we update a source, we need to 'rediscover' any resources owned by it, but other sources are not affected
| r.GVKToReflectorStopChanMap = map[string]chan struct{}{} | ||
| // UpdateSource replaces all resources for a source with new resources. | ||
| // If resources is nil, this is a noop. | ||
| // If resources is empty, all resources for the source are removed. |
There was a problem hiding this comment.
This shouldn't happen because the Updates and Creations will always pass a valid resource here. The only scenario where source are removed is on Delete
There was a problem hiding this comment.
This is needed mainly for APIServices.
A noop is for the case where there was an interruption in the discovery (specifically for APIServices) - if ServerResourcesForGroupVersion fails, I thought it would be better to not clear the discovered types, as they might still be available on the cluster.
The removal of the source's resources is based on the fact that an APIService might change/remove the resources it offers, or become not ready without a deletion of the apiservice resource itself (so the informer would only see an update to the status field, for example, but that could mean that the offered types might be empty e.g. apiservice is not ready, or have new/fewer types e.g. update to the aggregation layer)
3d7ac58 to
87f0685
Compare
|
Thanks for the positive first feedback @dgrisonnet . I have attempted to split the PR into this one and #2872, where this one implements the APIService discovery (due to the branch name making more sense) and then #2872 does the refactor of the discovery logic - I hope this is ok for you. I was unable to change the target branch, so this also contains the same commits as the new PR but this has the APIService changes added back, hence the size of changes. For the APIService changes only, please refer to alexandernorth/kube-state-metrics@feature/refactor-discovery...alexandernorth:kube-state-metrics:feature/discover-aggregation-layer-resources |
This reverts commit a0367ca.
87f0685 to
b42ed1c
Compare
What this PR does / why we need it:
Enables discovery and metric collection for Custom Resources managed by aggregated API servers which do not have a local CRD. It does this by querying non-local apiservices for the resources they handle.
How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
By default, no change
Which issue(s) this PR fixes: (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged)Fixes #2471