feat: implement discovery of apiservices by alexandernorth · Pull Request #2854 · kubernetes/kube-state-metrics

alexandernorth · 2026-01-19T16:18:07Z

What this PR does / why we need it:
Enables discovery and metric collection for Custom Resources managed by aggregated API servers which do not have a local CRD. It does this by querying non-local apiservices for the resources they handle.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
By default, no change

Which issue(s) this PR fixes: (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged)
Fixes #2471

k8s-ci-robot · 2026-01-19T16:18:16Z

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-01-19T16:18:17Z

Welcome @alexandernorth!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-01-19T16:18:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alexandernorth
Once this PR has been reviewed and has the lgtm label, please assign catherinef-dev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bhope · 2026-01-20T09:04:40Z

internal/discovery/discovery.go

+		group := serviceSpec["group"].(string)
+		version := serviceSpec["version"].(string)
+
+		resourceList, err := discoveryClient.ServerResourcesForGroupVersion(fmt.Sprintf("%s/%s", group, version))


This call runs on every add/update event for an APIService and can make it quite chatty, especially since such updates can occur frequently. Do we want any dedup or backoff per group/version to avoid repeated discovery calls and potential API churn?

Good point - I will look into this

I have improved the logic so that the APIService must be Available before querying the API Server for resources, reducing calls which would return no data (and thus hopefully the amount of churn too).

I am under the impression we should always perform this query whenever there is an update as we need to track all available resources and a change to an APIService could result in new/removed Kinds, and this means that it becomes tricky with dedup/backoff as we might miss updates. As resources are queried by group+version which should be unique within the cluster, I don't see a case where we have multiple calls per update for the same group/version combinations - please correct me if I am missing something here though.

Thanks, that makes sense. The availability check sounds like a good improvement and should help reduce chattiness.
My original thought was mainly around repeated updates where the APIService object itself changes (like status churn) without a change in group/version, but I agree dedup/backoff gets tricky if we want to avoid missing updates.

Yes that's true, I also considered this but I didn't find a "nice" solution which would ensure we received all updates - although in my (short term) observations around this, unless the aggregation service is very unstable there are not that many updates triggered.

Makes sense, thanks

bhope

Since runInformer() is now used for both CRDs and APIServices, the CRDsAddEventsCounter and CRDsCacheCountGauge metrics will also count APIService events. That might be a bit confusing from a metrics perspective. Should we consider renaming them or splitting by source (CRD vs APIService)?

… be managed in the same way

alexandernorth · 2026-01-20T16:39:12Z

As APIServices are dynamic I realised I could not use exactly the same system as was present for CRDs, as if the APIService is no longer available then it cannot be queried for its Kinds (to remove them from the cache map). I refactored the cache map, which is now keyed by the source of the discovered resource - the benefit being that we can handle the case where an APIService becomes unavailable. It does mean that we no longer index by Group/Version, but this could be implemented if it is a requirement.

Regarding the generated metrics, I have consolidated and renamed them to apply for both CRDs and APIServices. I also removed the add metric as in the new implementation add and update are implemented the same way. I also wonder if the delete metric is necessary or if it makes sense to simply have that also increment the update counter. If I should not remove existing metrics I can add them back, and optionally add a consolidated counter with add/update/deletes.

I refactored the GVK->chan map, moving the stop channel to be part of the DiscoveredResource, and lifecycling it as part of the Update/Delete process.

The refactor also fixes a missing synchronisation where the cache map could be read outside of being locked (via the ResolveGVKToGVKPs function called in pkg/customresourcestate/config.go#L191)

bhope

Thanks for those refactors and detailed walkthrough.

On the metrics side, since these are already released (hence probably consumed), I’d prefer we keep the delete metric rather than folding it into update. Having delete counted separately is still useful to understand churn (resources dropping vs being refreshed). Besides, removing it would be a breaking change for existing dashboards.

alexandernorth · 2026-01-20T18:02:37Z

That makes sense - I have added back the add metric - the metrics have the same name as before, but now are counting the 'merged' CRD/APIServer events (the naming was not CRD specific). I updated the help text to be more generic, but I think this is fine as it should not change anything for existing consumers unless they now specify aggregation layer resources

bhope

Thanks for the update and incorporating the feedback. Overall, looks good to me.

dgrisonnet

Nice work @alexandernorth 👍

This is looking good, but would you mind splitting this in two PRs?
One to refactor the CRD discovery and another one adding the APIService discovery? This would make this work easier to review.

dgrisonnet · 2026-02-11T14:11:41Z

internal/discovery/types.go

-	}
-	if r.GVKToReflectorStopChanMap == nil {
-		r.GVKToReflectorStopChanMap = map[string]chan struct{}{}
+// UpdateSource replaces all resources for a source with new resources.


I don't think this will work. The EventHandler will call UpdateSource with only one CRD or one Aggregated API, not the full list. So with the way you wrote this function, the latest updated resource will always override the other.

I had a look into this, but I think it should work as expected.

The new implementation groups the resources by source (either CRD or APIService), if a CRD/APIService triggers an update, the discovered types under that source will be removed and replaced with the new data (from the CRD itself or the discovered types from the apiservice). The intention being that if we update a source, we need to 'rediscover' any resources owned by it, but other sources are not affected

dgrisonnet · 2026-02-11T14:12:36Z

internal/discovery/types.go

-		r.GVKToReflectorStopChanMap = map[string]chan struct{}{}
+// UpdateSource replaces all resources for a source with new resources.
+// If resources is nil, this is a noop.
+// If resources is empty, all resources for the source are removed.


This shouldn't happen because the Updates and Creations will always pass a valid resource here. The only scenario where source are removed is on Delete

This is needed mainly for APIServices.
A noop is for the case where there was an interruption in the discovery (specifically for APIServices) - if ServerResourcesForGroupVersion fails, I thought it would be better to not clear the discovered types, as they might still be available on the cluster.
The removal of the source's resources is based on the fact that an APIService might change/remove the resources it offers, or become not ready without a deletion of the apiservice resource itself (so the informer would only see an update to the status field, for example, but that could mean that the offered types might be empty e.g. apiservice is not ready, or have new/fewer types e.g. update to the aggregation layer)

alexandernorth · 2026-02-11T15:36:47Z

Thanks for the positive first feedback @dgrisonnet .

I have attempted to split the PR into this one and #2872, where this one implements the APIService discovery (due to the branch name making more sense) and then #2872 does the refactor of the discovery logic - I hope this is ok for you.

I was unable to change the target branch, so this also contains the same commits as the new PR but this has the APIService changes added back, hence the size of changes. For the APIService changes only, please refer to alexandernorth/kube-state-metrics@feature/refactor-discovery...alexandernorth:kube-state-metrics:feature/discover-aggregation-layer-resources

This reverts commit a0367ca.

split discovery into apiservice and crd discovery and monitoring

7345a5c

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 19, 2026

k8s-ci-robot requested review from CatherineF-dev and dgrisonnet January 19, 2026 16:18

github-project-automation bot added this to SIG Instrumentation Jan 19, 2026

github-project-automation bot moved this to Needs Triage in SIG Instrumentation Jan 19, 2026

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 19, 2026

alexandernorth changed the title ~~split discovery into apiservice and crd discovery and monitoring~~ feat: split discovery into apiservice and crd discovery and monitoring Jan 19, 2026

alexandernorth mentioned this pull request Jan 19, 2026

☂️ Replace gardener-metrics-exporter with CustomResourceState metrics via kube-state-metrics gardener/monitoring#28

Closed

bhope reviewed Jan 20, 2026

View reviewed changes

alexandernorth added 2 commits January 20, 2026 11:44

refactor gvkExtraction to interface

1bf641a

refactor CRDiscoverer so that Resources from CRDs and APIServices can…

5009bb5

… be managed in the same way

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 20, 2026

alexandernorth added 2 commits January 20, 2026 17:44

add licence header for lint check

149b22a

fix linter discovered issues

9586f0b

alexandernorth requested a review from bhope January 20, 2026 17:13

bhope reviewed Jan 20, 2026

View reviewed changes

add back 'add' metric

a65967f

bhope reviewed Jan 20, 2026

View reviewed changes

alexandernorth mentioned this pull request Feb 9, 2026

fix(discovery): prevent duplicate GVKs and avoid stop channel leaks #2869

Open

dgrisonnet reviewed Feb 11, 2026

View reviewed changes

split pr: remove APIService discovery components

a0367ca

alexandernorth mentioned this pull request Feb 11, 2026

feat: Refactor discovery logic #2872

Open

alexandernorth force-pushed the feature/discover-aggregation-layer-resources branch 3 times, most recently from 3d7ac58 to 87f0685 Compare February 11, 2026 15:31

alexandernorth changed the title ~~feat: split discovery into apiservice and crd discovery and monitoring~~ feat: implement discovery of apiservices Feb 11, 2026

implement APIService discovery

b42ed1c

This reverts commit a0367ca.

alexandernorth force-pushed the feature/discover-aggregation-layer-resources branch from 87f0685 to b42ed1c Compare February 11, 2026 16:06

Comments

Conversation

alexandernorth commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 19, 2026

Uh oh!

k8s-ci-robot commented Jan 19, 2026

Uh oh!

k8s-ci-robot commented Jan 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexandernorth Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhope left a comment

Choose a reason for hiding this comment

Uh oh!

alexandernorth commented Jan 20, 2026

Uh oh!

bhope left a comment

Choose a reason for hiding this comment

Uh oh!

alexandernorth commented Jan 20, 2026

Uh oh!

bhope left a comment

Choose a reason for hiding this comment

Uh oh!

dgrisonnet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexandernorth commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexandernorth commented Jan 19, 2026 •

edited

Loading

alexandernorth Jan 20, 2026 •

edited

Loading

alexandernorth commented Feb 11, 2026 •

edited

Loading