Skip to content

Feat: OEP 4111 node status metrics#816

Open
IyanekiB wants to merge 1 commit intoopenebs:developfrom
IyanekiB:feat/oep-4111-node-status-metrics
Open

Feat: OEP 4111 node status metrics#816
IyanekiB wants to merge 1 commit intoopenebs:developfrom
IyanekiB:feat/oep-4111-node-status-metrics

Conversation

@IyanekiB
Copy link
Copy Markdown

@IyanekiB IyanekiB commented Feb 6, 2026

Implements OEP-4111 to expose Mayastor node status metrics through the metrics-exporter. Aims to resolve #4111.

Description

This PR implements OEP-4111, extending the metrics-exporter with REST client capabilities to expose Mayastor node status metrics. The implementation adds a lightweight REST client that periodically polls the control-plane /v0/nodes endpoint and exposes node state as Prometheus-compatible gauges.

Key changes:

  • REST client for /v0/nodes endpoint with connection pooling and timeout handling

  • Five new Prometheus metrics:

    • mayastor_node_online (0/1) - node online status
    • mayastor_node_cordoned (0/1) - node cordoned status
    • mayastor_node_draining (0/1) - node draining status
    • mayastor_node_status_last_fetch_seconds - staleness detection
    • mayastor_node_status_fetch_errors_total - failure tracking
  • Periodic polling with jitter (15s interval + 0-5s random jitter)

  • In-memory caching with thread-safe RwLock

  • Immediate fetch on startup (no initial delay)

  • Configuration via CLI flags and environment variables

Motivation and Context

This consolidates node-status metrics into the metrics-exporter, eliminating duplicated business logic and fragmented observability surfaces. Based on what we learned from PR #1035, this approach aims to create a maintainable pattern for REST-backed metrics while preserving compatibility with existing Prometheus/Grafana pipelines.
The control-plane remains the authoritative source of node state; the exporter simply polls, caches, and exposes this data in a Prometheus-compatible format.

Regression

No

How Has This Been Tested?

Client tests (wiremock-based):

  • Successful multi-node fetch
  • Cordoned/draining/offline node detection
  • HTTP error handling (500 responses)
  • Invalid JSON handling
  • Empty response handling
  • Rapid state transitions

Collector tests:

  • Metric descriptor validation (5 metrics)
  • Empty cache behavior
  • Online/offline/cordoned/draining state verification
  • State transition handling
  • OEP-4111 metric naming compliance

Manual validation:

  • Local Prometheus scrape validation confirmed metric format and labels
  • Verified immediate startup fetch (no 15s delay)
  • Tested staleness detection via last_fetch_seconds gauge
  • Validated error counter increments on REST failures

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added unit tests to cover my changes.

CC: @pjgranieri

Comment thread metrics-exporter/src/bin/io_engine/node_status/client.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/node_status/client.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/node_status/types.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated
Comment thread metrics-exporter/Cargo.toml Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated
@abhilashshetty04
Copy link
Copy Markdown
Member

abhilashshetty04 commented Feb 18, 2026

The current metrics-exporter implementation fetches stats/states from io-engine when it gets queries by the Prometheus on /metrics? Cant we use the the same procedure. Prometheus stores the data in time series manner. We will introduce incorrect states w.r.t time if we don't fetch when polled by Prometheus.

We use to have a separate polling period but not anymore. Its in-line with prometheus scraping. Please refer

https://github.com/openebs/mayastor-extensions/blob/develop/metrics-exporter/src/bin/io_engine/serve/handler.rs#L18

@pjgranieri
Copy link
Copy Markdown

The current metrics-exporter implementation fetches stats/states from io-engine when it gets queries by the Prometheus on /metrics? Cant we use the the same procedure. Prometheus stores the data in time series manner. We will introduce incorrect states w.r.t time if we don't fetch when polled by Prometheus.

We use to have a separate polling period but not anymore. Its in-line with prometheus scraping. Please refer

https://github.com/openebs/mayastor-extensions/blob/develop/metrics-exporter/src/bin/io_engine/serve/handler.rs#L18

We absolutely could, we are aligned with that approach. The inline fetch at handler.rs:18 (store_resource_data(grpc_client()).await) is the pattern we will be following. Node status is going to be fetched the way via an inline REST call in metrics_handler() right alongside the existing gRPC fetch, so values will always reflect the actual state at scrape time:

store_resource_data(grpc_client()).await; // existing gRPC fetch
let node = fetch_node(&node_id, &client).await.ok(); // new inline REST fetch
let node_status_collector = NodeStatusCollector::new(node); // takes Option

The background poller, cache, and the polling-specific metrics are all going to be removed. Thank you for the pointer, let us know if you agree that we are working in the right direction with these ideas.

@IyanekiB
Copy link
Copy Markdown
Author

Hi everyone, just wanted to follow up on the review feedback. The background poller, cache, and polling-specific metrics have been removed; node status is now fetched inline at scrape time in metrics_handler(), which should be consistent with the existing pattern at handler.rs:18. The custom Node/NodeSpec/NodeState types in types.rs (now deleted) have been replaced with openapi::models equivalents, and the CLI has been cleaned up (polling_interval removed, rest_endpoint typed as Uri, scrape_timeout as humantime::Duration). Would appreciate a re-review when you get a chance.

Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs
Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/node_status/client.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/node_status/client.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs
@pjgranieri pjgranieri force-pushed the feat/oep-4111-node-status-metrics branch from a5ac233 to 3e451cb Compare March 12, 2026 20:48
Copy link
Copy Markdown
Member

@tiagolobocastro tiagolobocastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@IyanekiB could you please squash your commits?

Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs
Comment thread metrics-exporter/src/bin/io_engine/node_status/client.rs Outdated
Comment thread metrics-exporter/src/bin/io_engine/main.rs
Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated
@IyanekiB IyanekiB force-pushed the feat/oep-4111-node-status-metrics branch 2 times, most recently from 108b051 to 70d65c9 Compare March 19, 2026 14:04
@IyanekiB
Copy link
Copy Markdown
Author

IyanekiB commented Mar 19, 2026

LGTM

@IyanekiB could you please squash your commits?

@tiagolobocastro just finished squashing the commits and applied the changes you suggested. Would like a re-review if you don't mind. Thanks!

Copy link
Copy Markdown
Member

@tiagolobocastro tiagolobocastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for contributing this @IyanekiB !

@IyanekiB
Copy link
Copy Markdown
Author

thanks for contributing this @IyanekiB !

Thanks @tiagolobocastro! Just wanted to ask, is there anything else needed for the OEP that we should finalize?

@tiagolobocastro
Copy link
Copy Markdown
Member

thanks for contributing this @IyanekiB !

Thanks @tiagolobocastro! Just wanted to ask, is there anything else needed for the OEP that we should finalize?

Nothing I can think of, just need a few more reviews here, CC @abhilashshetty04 @niladrih @Abhinandan-Purkait

@tiagolobocastro
Copy link
Copy Markdown
Member

bors merge

1 similar comment
@tiagolobocastro
Copy link
Copy Markdown
Member

bors merge

@tiagolobocastro
Copy link
Copy Markdown
Member

bors merge

@tiagolobocastro
Copy link
Copy Markdown
Member

@pchandra19 any clues why this is stuck?

@tiagolobocastro tiagolobocastro force-pushed the feat/oep-4111-node-status-metrics branch from 70d65c9 to 17124a1 Compare April 16, 2026 11:00
@tiagolobocastro
Copy link
Copy Markdown
Member

bors merge

@pchandra19 pchandra19 closed this Apr 16, 2026
@pchandra19 pchandra19 reopened this Apr 16, 2026
@pchandra19
Copy link
Copy Markdown
Contributor

bors merge

@pchandra19
Copy link
Copy Markdown
Contributor

bors merge

@pchandra19
Copy link
Copy Markdown
Contributor

bors try

@tiagolobocastro
Copy link
Copy Markdown
Member

bors cancel

Expose three Prometheus gauge metrics per io-engine node, fetched
inline at scrape time from the control-plane REST API:

  - mayastor_node_online    (1 = Online, 0 = Offline)
  - mayastor_node_cordoned  (1 = cordoned, draining, or drained)
  - mayastor_node_draining  (1 = draining or drained)

Node data is fetched on demand (GET /v0/nodes/{node_id}) at each
Prometheus scrape, matching the pull-model semantics used by the
existing pool, nexus, and replica metric collectors. No background
polling thread or persistent cache is used, eliminating the risk of
stale data being recorded with a scrape timestamp.

Key implementation details:
- REST endpoint configured via MAYASTOR_REST_ENDPOINT env var /
  --rest-endpoint CLI flag (optional; metrics omitted if not set)
- Scrape timeout configurable via MAYASTOR_SCRAPE_TIMEOUT (default 10s)
- Uses openapi::models types directly (Node, NodeSpec, NodeState,
  CordonDrainState) instead of custom duplicates
- NodeStatusClient fetches a single node by ID using the tower-based
  openapi client; endpoint URL parsed via the url crate
- Graceful degradation: if the REST call fails, node status metrics
  are omitted for that scrape (consistent with gRPC failure behaviour)

Also includes upstream develop syncs and CI workflow additions picked
up while the branch was in development.

Signed-off-by: IyanekiB <iyan.n@outlook.com>
@tiagolobocastro tiagolobocastro force-pushed the feat/oep-4111-node-status-metrics branch from 17124a1 to 9482d14 Compare April 16, 2026 13:52
@tiagolobocastro
Copy link
Copy Markdown
Member

bors ping

@bors-openebs-mayastor
Copy link
Copy Markdown
Contributor

pong

@tiagolobocastro
Copy link
Copy Markdown
Member

bors merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OEP 4111]: Mayastor Node-Status Metrics via Metrics-Exporter REST Integration

6 participants