## Summary Expose more client-side metrics offered by client-go in the controller process by default, similar to how Kubernetes builtin controllers/apiserver does Time and time again, lack of these metrics exposed our internal controllers has prevented us from being able to monitor how long we're getting stuck in the client-side rate limiter, or what is the observed latency of the REST client requests in the controller etc (without writing our own instrumented REST transport wrapper). ## Details client-go currently exposes the following hooks that a metrics collector can register to https://github.com/kubernetes/client-go/blob/v0.33.0/tools/metrics/metrics.go#L114-L127: | **Metric Name** | **Type** | **Dimensions** | **Description** | |------------------------------------------|----------------|------------------------------------|---------------------------------------------------------------------------------| | `rest_client_request_duration_seconds` | Histogram | `verb`, `host` | Request latency in seconds. <br><br>Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0] | | `rest_client_dns_resolution_duration_seconds` | Histogram | `host` | DNS resolver latency in seconds. <br><br>Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0] | | `rest_client_request_size_bytes` | Histogram | `verb`, `host` | Request size in bytes. <br><br>Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216] | | `rest_client_response_size_bytes` | Histogram | `verb`, `host` | Response size in bytes. <br><br>Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216] | | `rest_client_rate_limiter_duration_seconds` | Histogram | `verb`, `host` | Client-side rate limiter latency in seconds. <br><br>Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0] | | `rest_client_requests_total` | Counter | `code`, `method`, `host` | Number of HTTP requests. | | `rest_client_request_retries_total` | Counter | `code`, `verb`, `host` | Number of request retries. | | `rest_client_transport_cache_entries` | Gauge | *(none)* | Number of transport entries in the internal cache. | | `rest_client_transport_create_calls_total` | Counter | `result` | Number of calls to get a new transport, partitioned by the result of the operation. | Among these, the only metric currently exposed with controller-runtime is `rest_client_requests_total`. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is the `host` dimension (which is presumably just however many apiserver `host:port`s you have). ## Proposal 1. controller-runtime starts exposing all of the listed metrics (by copying them [from k8s.io/component-base](https://github.com/kubernetes/kubernetes/blob/v1.33.0/staging/src/k8s.io/component-base/metrics/prometheus/restclient/metrics.go#L31-L185)) in controller-runtime by default. 2. Existing `rest_client_requests_total` metric should remain unmodified. 3. `ExecPluginCalls` hook (i.e. `rest_client_exec_plugin_call_total` metric) should be left out as it is very rarely if ever useful for a controller process. ## Considerations 1. **Stability:** ALL of the metrics listed above are [listed in `ALPHA` stage in component-base](https://github.com/kubernetes/kubernetes/blob/v1.33.0/staging/src/k8s.io/component-base/metrics/prometheus/restclient/metrics.go#L31-L185) and in [k8s.io Metrics Documentation](https://kubernetes.io/docs/reference/instrumentation/metrics/), presumably for components like `kube-scheduler`, `kube-controller-manager` etc. Do we also offer them as stable? Or do we break users later? 1. **Cardinality:** Some histogram metrics have `10-12 buckets`. In a large cluster setup with `10 apiservers` x `4 verbs`, it can easily reach 400+ time series per metric (still bounded though). 1. **Future improvements:** Client-go offers a `url` value in one of the hook functions. This `url` is actually a value that's [free of resource {namespace,name}](https://github.com/kubernetes/kubernetes/blob/c519248e8a865d837f3f40308eaf9559e605306d/staging/src/k8s.io/client-go/rest/request.go#L585-L589) (i.e. it's bounded cardinality for us!) but is available [only in one metric hook](https://github.com/kubernetes/client-go/blob/master/tools/metrics/metrics.go#L42)😢. `component-base` basically uses that `url.URL` value to find the `host` label. However, if `client-go` some day starts providing `url` label for every metric, it would be even more useful, but we'd likely need to break the metrics. /kind design /cc @alvaroaleman