-
Notifications
You must be signed in to change notification settings - Fork 210
OCPBUGS-57585: CVO protects /metrics with authorization #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-57585: CVO protects /metrics with authorization #1215
Conversation
|
/retest |
|
/retest-required |
The cluster bot job:
$ TOKEN=$(oc create token -n openshift-monitoring prometheus-k8s)
$ oc exec -n openshift-monitoring prometheus-k8s-0 -- curl -s -k -I -H "Authorization: Bearer $TOKEN" https://10.0.0.3:9099/metrics
HTTP/1.1 200 OK
Content-Type: text/plain; version=0.0.4; charset=utf-8; escaping=values
Date: Tue, 22 Jul 2025 21:37:03 GMT
$ TOKEN=$(oc create token -n openshift-monitoring default)
$ oc exec -n openshift-monitoring prometheus-k8s-0 -- curl -s -k -I -H "Authorization: Bearer $TOKEN" https://10.0.0.3:9099/metrics
HTTP/1.1 401 Unauthorized
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 22 Jul 2025 21:38:49 GMT
Content-Length: 20
$ oc debug node/ci-ln-zms0nzb-72292-mq2qr-master-0
Starting pod/ci-ln-zms0nzb-72292-mq2qr-master-0-debug-qhmxg ...
To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`.
Pod IP: 10.0.0.3
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# curl -k https://10.0.0.3:9099/metrics
failed to get the Authorization header |
|
/test e2e-hypershift-conformance |
|
/label qe-approved |
3b92430 to
c5fbda0
Compare
|
/cc |
| client tokenReviewInterface | ||
| } | ||
|
|
||
| func (a *authHandler) authorize(token string) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code and noticed that we would only support Bearer token auth but I remember from the handbook we are supposed to auth a cert-presenting client:
As described in the Client certificate scraping enhancement proposal, we recommend that the components rely on client TLS certificates for authentication/authorization. This is more efficient and robust than using bearer tokens because token-based authn/authz add a dependency (and additional load) on the Kubernetes API.
It seems that it is actually us telling the monitoring stack how it should auth to us through the ServiceMonitor manifest .spec.endpoints[].bearerTokenFile.
In that aspect this PR is incomplete but maybe doing just Bearer token auth is fine for a fast OCPBUGS-57585 bandaid that allows us to start backporting, and we would tackle the cert auth separately and only forwards (not necessary to backport). But also:
$ oc explain servicemonitor.spec.endpoints.bearerTokenFile
GROUP: monitoring.coreos.com
KIND: ServiceMonitor
VERSION: v1
FIELD: bearerTokenFile <string>
DESCRIPTION:
File to read bearer token for scraping the target.
Deprecated: use `authorization` instead.
😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shortly after the discussion in our meeting, I realized i have done the same thing in ci-tools for the same reason: We do not want to reply on K8S API server for scraping because it is too slow and it may create burden for K8S API server.
I will create a card to replace the deprecated servicemonitor.spec.endpoints.bearerTokenFile with servicemonitor.spec.endpoints.authorization and I will argue in the card that with a solution that addresses the concern above and we do not need to move to the cert-based auth. But that is not in the scope of this pull. I will move the discussion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c5fbda0 to
833a491
Compare
|
@hongkailiu: This pull request references Jira Issue OCPBUGS-57585, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn-techpreview |
1 similar comment
|
/test e2e-aws-ovn-techpreview |
|
Test from openshift/origin#30014 is PASSED. |
|
Test from https://github.com/openshift/origin/blob/main/test/extended/prometheus/prometheus.go#L514 which is covering for |
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the Kube API server load of the TokenReview calls for now, with the future off ramp being investigated in OTA-1594.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
|
tech-preview failure: is unrelated to this pull request. /override ci/prow/e2e-aws-ovn-techpreview |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-aws-ovn-techpreview In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@hongkailiu: Jira Issue OCPBUGS-57585: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged:
These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-57585 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@wking: Jira Issue OCPBUGS-57585: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-57585 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@wking: new pull request created: #1222 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[ART PR BUILD NOTIFIER] Distgit: cluster-version-operator |
…Review OCPBUGS-57585: CVO protects /metrics with authorization
…ile option In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint begain requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard, but for other ServiceMonitors, HyperShift configures keySecret [4] asking the scraper to get its client certificate from a metrics-client Secret that the HostedControlPlane controller maintains in the management cluster namespace [5]. This commit adds a new --serving-client-certificate-authorities-file option, which HyperShift can set when it configures the CVO Deployment [6], while mounting the root CA ConfigMap into the Pod. Now that there are three files (serving key, serving cert, client CAs) that we might be watching for changes, I've taken the opportinity to refactor the checksumming and change-tracking to use a map from filename to checksum. That way we can extend more easily if further configuration files are added in the future, without having to pass around a series of paths and checksums as distinct arguments. I'm a bit sad about AppendCertsFromPEM only returning a boolean [7], leaving us unsure about whether all the certificates in the file were parsed, or, if there were parsing issues, what those issues were. But with HyperShift hopefully reliably setting up CA Secrets that do not cause parsing issues, I'm ok not coding that defensively. If the standard library grows a parser that is more transparent about parsing issues, we can pivot to that once it exists. [1]: https://issues.redhat.com/browse/OCPBUGS-62858 [2]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/servicemonitor.yaml#L18 [3]: https://kubernetes.io/docs/concepts/configuration/secret/#serviceaccount-token-secrets [4]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/servicemonitor.yaml#L24-L26 [5]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/pki/sa.go#L35-L36 [6]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/deployment.yaml#L25-L35 [7]: https://pkg.go.dev/crypto/x509#CertPool.AppendCertsFromPEM
…ile option In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. But for other ServiceMonitors, HyperShift configures keySecret [4] asking the scraper to get its client certificate from a metrics-client Secret that the HostedControlPlane controller maintains in the management cluster namespace [5]. This commit adds a new --serving-client-certificate-authorities-file option, which HyperShift can set when it configures the CVO Deployment [6], while mounting the root CA ConfigMap into the Pod. Now that there are three files (serving key, serving cert, client CAs) that we might be watching for changes, I've taken the opportunity to refactor the checksumming and change-tracking to use a map from filename to checksum. That way we can extend more easily if further configuration files are added in the future, without having to pass around a series of paths and checksums as distinct arguments. I'm a bit sad about AppendCertsFromPEM only returning a boolean [7], leaving us unsure about whether all the certificates in the file were parsed, or, if there were parsing issues, what those issues were. But with HyperShift hopefully reliably setting up CA Secrets that do not cause parsing issues, I'm ok not coding that defensively. If the standard library grows a parser that is more transparent about parsing issues, we can pivot to that once it exists. [1]: https://issues.redhat.com/browse/OCPBUGS-62858 [2]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/servicemonitor.yaml#L18 [3]: https://kubernetes.io/docs/concepts/configuration/secret/#serviceaccount-token-secrets [4]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/servicemonitor.yaml#L24-L26 [5]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/pki/sa.go#L35-L36 [6]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/deployment.yaml#L25-L35 [7]: https://pkg.go.dev/crypto/x509#CertPool.AppendCertsFromPEM
…ile option In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. But for other ServiceMonitors, HyperShift configures keySecret [4] asking the scraper to get its client certificate from a metrics-client Secret that the HostedControlPlane controller maintains in the management cluster namespace [5]. This commit adds a new --serving-client-certificate-authorities-file option, which HyperShift can set when it configures the CVO Deployment [6], while mounting the root CA ConfigMap into the Pod. Now that there are three files (serving key, serving cert, client CAs) that we might be watching for changes, I've taken the opportunity to refactor the checksumming and change-tracking to use a map from filename to checksum. That way we can extend more easily if further configuration files are added in the future, without having to pass around a series of paths and checksums as distinct arguments. I'm a bit sad about AppendCertsFromPEM only returning a boolean [7], leaving us unsure about whether all the certificates in the file were parsed, or, if there were parsing issues, what those issues were. But with HyperShift hopefully reliably setting up CA Secrets that do not cause parsing issues, I'm ok not coding that defensively. If the standard library grows a parser that is more transparent about parsing issues, we can pivot to that once it exists. [1]: https://issues.redhat.com/browse/OCPBUGS-62858 [2]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/servicemonitor.yaml#L18 [3]: https://kubernetes.io/docs/concepts/configuration/secret/#serviceaccount-token-secrets [4]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/servicemonitor.yaml#L24-L26 [5]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/pki/sa.go#L35-L36 [6]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/deployment.yaml#L25-L35 [7]: https://pkg.go.dev/crypto/x509#CertPool.AppendCertsFromPEM
…ile option In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. But for other ServiceMonitors, HyperShift configures keySecret [4] asking the scraper to get its client certificate from a metrics-client Secret that the HostedControlPlane controller maintains in the management cluster namespace [5]. This commit adds a new --serving-client-certificate-authorities-file option, which HyperShift can set when it configures the CVO Deployment [6], while mounting the root CA ConfigMap into the Pod. Now that there are three files (serving key, serving cert, client CAs) that we might be watching for changes, I've taken the opportunity to refactor the checksumming and change-tracking to use a map from filename to checksum. That way we can extend more easily if further configuration files are added in the future, without having to pass around a series of paths and checksums as distinct arguments. I'm a bit sad about AppendCertsFromPEM only returning a boolean [7], leaving us unsure about whether all the certificates in the file were parsed, or, if there were parsing issues, what those issues were. But with HyperShift hopefully reliably setting up CA Secrets that do not cause parsing issues, I'm ok not coding that defensively. If the standard library grows a parser that is more transparent about parsing issues, we can pivot to that once it exists. [1]: https://issues.redhat.com/browse/OCPBUGS-62858 [2]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/servicemonitor.yaml#L18 [3]: https://kubernetes.io/docs/concepts/configuration/secret/#serviceaccount-token-secrets [4]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/servicemonitor.yaml#L24-L26 [5]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/pki/sa.go#L35-L36 [6]: https://github.com/openshift/hypershift/blob/2728a4e2899ee58690e3d4e58b319eb5e78643af/control-plane-operator/controllers/hostedcontrolplane/v2/assets/cluster-version-operator/deployment.yaml#L25-L35 [7]: https://pkg.go.dev/crypto/x509#CertPool.AppendCertsFromPEM
In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. This commit buys time to sort out a HyperShift metrics authentication strategy by wiring the existing --hypershift option to code that disables the authentication requirement in that environment. Standalone clusters will continue to require prometheus-k8s ServiceAccount tokens.
In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. This commit buys time to sort out a HyperShift metrics authentication strategy by wiring the existing --hypershift option to code that disables the authentication requirement in that environment. Standalone clusters will continue to require prometheus-k8s ServiceAccount tokens.
In 313f8fb (CVO protects /metrics with authorization, 2025-07-22, openshift#1215) and 833a491 (CVO protects /metrics with authorization, 2025-07-22, openshift#1215), the /metrics endpoint began requiring client auth. The only authentication system was Bearer tokens, and the only authorization system was validating that the token belonged to system:serviceaccount:openshift-monitoring:prometheus-k8s. That worked well for standalone clusters, where the ServiceMonitor scraper is the Prometheus from the openshift-monitoring namespace. But it broke scraping on HyperShift [1], where the ServiceMonitor does not request any client authorization [2]. Getting ServiceAccount tokens (and keeping them fresh [3]) from the hosted cluster into a Prometheus scraper running on the management cluster is hard. This commit buys time to sort out a HyperShift metrics authentication strategy by wiring the existing --hypershift option to code that disables the authentication requirement in that environment. Standalone clusters will continue to require prometheus-k8s ServiceAccount tokens.


The
/metricsis protected byauthHandlerintroduced from this pull.authHandlerallows only for requests with the bearer token associated withsystem:serviceaccount:openshift-monitoring:prometheus-k8s.