Conversation
bf0024b to
aabc548
Compare
|
The current metrics-exporter implementation fetches stats/states from io-engine when it gets queries by the Prometheus on We use to have a separate polling period but not anymore. Its in-line with prometheus scraping. Please refer |
We absolutely could, we are aligned with that approach. The inline fetch at handler.rs:18 (store_resource_data(grpc_client()).await) is the pattern we will be following. Node status is going to be fetched the way via an inline REST call in metrics_handler() right alongside the existing gRPC fetch, so values will always reflect the actual state at scrape time: store_resource_data(grpc_client()).await; // existing gRPC fetch The background poller, cache, and the polling-specific metrics are all going to be removed. Thank you for the pointer, let us know if you agree that we are working in the right direction with these ideas. |
|
Hi everyone, just wanted to follow up on the review feedback. The background poller, cache, and polling-specific metrics have been removed; node status is now fetched inline at scrape time in |
a5ac233 to
3e451cb
Compare
tiagolobocastro
left a comment
There was a problem hiding this comment.
LGTM
@IyanekiB could you please squash your commits?
108b051 to
70d65c9
Compare
@tiagolobocastro just finished squashing the commits and applied the changes you suggested. Would like a re-review if you don't mind. Thanks! |
tiagolobocastro
left a comment
There was a problem hiding this comment.
thanks for contributing this @IyanekiB !
Thanks @tiagolobocastro! Just wanted to ask, is there anything else needed for the OEP that we should finalize? |
Nothing I can think of, just need a few more reviews here, CC @abhilashshetty04 @niladrih @Abhinandan-Purkait |
|
bors merge |
1 similar comment
|
bors merge |
|
bors merge |
|
@pchandra19 any clues why this is stuck? |
70d65c9 to
17124a1
Compare
|
bors merge |
|
bors merge |
|
bors merge |
|
bors try |
|
bors cancel |
Expose three Prometheus gauge metrics per io-engine node, fetched
inline at scrape time from the control-plane REST API:
- mayastor_node_online (1 = Online, 0 = Offline)
- mayastor_node_cordoned (1 = cordoned, draining, or drained)
- mayastor_node_draining (1 = draining or drained)
Node data is fetched on demand (GET /v0/nodes/{node_id}) at each
Prometheus scrape, matching the pull-model semantics used by the
existing pool, nexus, and replica metric collectors. No background
polling thread or persistent cache is used, eliminating the risk of
stale data being recorded with a scrape timestamp.
Key implementation details:
- REST endpoint configured via MAYASTOR_REST_ENDPOINT env var /
--rest-endpoint CLI flag (optional; metrics omitted if not set)
- Scrape timeout configurable via MAYASTOR_SCRAPE_TIMEOUT (default 10s)
- Uses openapi::models types directly (Node, NodeSpec, NodeState,
CordonDrainState) instead of custom duplicates
- NodeStatusClient fetches a single node by ID using the tower-based
openapi client; endpoint URL parsed via the url crate
- Graceful degradation: if the REST call fails, node status metrics
are omitted for that scrape (consistent with gRPC failure behaviour)
Also includes upstream develop syncs and CI workflow additions picked
up while the branch was in development.
Signed-off-by: IyanekiB <iyan.n@outlook.com>
17124a1 to
9482d14
Compare
|
bors ping |
|
pong |
|
bors merge |
Implements OEP-4111 to expose Mayastor node status metrics through the metrics-exporter. Aims to resolve #4111.
Description
This PR implements OEP-4111, extending the metrics-exporter with REST client capabilities to expose Mayastor node status metrics. The implementation adds a lightweight REST client that periodically polls the control-plane
/v0/nodesendpoint and exposes node state as Prometheus-compatible gauges.Key changes:
REST client for
/v0/nodesendpoint with connection pooling and timeout handlingFive new Prometheus metrics:
mayastor_node_online(0/1) - node online statusmayastor_node_cordoned(0/1) - node cordoned statusmayastor_node_draining(0/1) - node draining statusmayastor_node_status_last_fetch_seconds- staleness detectionmayastor_node_status_fetch_errors_total- failure trackingPeriodic polling with jitter (15s interval + 0-5s random jitter)
In-memory caching with thread-safe RwLock
Immediate fetch on startup (no initial delay)
Configuration via CLI flags and environment variables
Motivation and Context
This consolidates node-status metrics into the metrics-exporter, eliminating duplicated business logic and fragmented observability surfaces. Based on what we learned from PR #1035, this approach aims to create a maintainable pattern for REST-backed metrics while preserving compatibility with existing Prometheus/Grafana pipelines.
The control-plane remains the authoritative source of node state; the exporter simply polls, caches, and exposes this data in a Prometheus-compatible format.
Regression
No
How Has This Been Tested?
Client tests (wiremock-based):
Collector tests:
Manual validation:
Types of changes
Checklist:
CC: @pjgranieri