Feat: OEP 4111 node status metrics by IyanekiB · Pull Request #816 · openebs/mayastor-extensions

IyanekiB · 2026-02-06T02:34:34Z

Implements OEP-4111 to expose Mayastor node status metrics through the metrics-exporter. Aims to resolve #4111.

Description

This PR implements OEP-4111, extending the metrics-exporter with REST client capabilities to expose Mayastor node status metrics. The implementation adds a lightweight REST client that periodically polls the control-plane /v0/nodes endpoint and exposes node state as Prometheus-compatible gauges.

Key changes:

REST client for /v0/nodes endpoint with connection pooling and timeout handling
Five new Prometheus metrics:
- mayastor_node_online (0/1) - node online status
- mayastor_node_cordoned (0/1) - node cordoned status
- mayastor_node_draining (0/1) - node draining status
- mayastor_node_status_last_fetch_seconds - staleness detection
- mayastor_node_status_fetch_errors_total - failure tracking
Periodic polling with jitter (15s interval + 0-5s random jitter)
In-memory caching with thread-safe RwLock
Immediate fetch on startup (no initial delay)
Configuration via CLI flags and environment variables

Motivation and Context

This consolidates node-status metrics into the metrics-exporter, eliminating duplicated business logic and fragmented observability surfaces. Based on what we learned from PR #1035, this approach aims to create a maintainable pattern for REST-backed metrics while preserving compatibility with existing Prometheus/Grafana pipelines.
The control-plane remains the authoritative source of node state; the exporter simply polls, caches, and exposes this data in a Prometheus-compatible format.

Regression

No

How Has This Been Tested?

Client tests (wiremock-based):

Successful multi-node fetch
Cordoned/draining/offline node detection
HTTP error handling (500 responses)
Invalid JSON handling
Empty response handling
Rapid state transitions

Collector tests:

Metric descriptor validation (5 metrics)
Empty cache behavior
Online/offline/cordoned/draining state verification
State transition handling
OEP-4111 metric naming compliance

Manual validation:

Local Prometheus scrape validation confirmed metric format and labels
Verified immediate startup fetch (no 15s delay)
Tested staleness detection via last_fetch_seconds gauge
Validated error counter increments on REST failures

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added unit tests to cover my changes.

CC: @pjgranieri

abhilashshetty04 · 2026-02-18T06:15:52Z

The current metrics-exporter implementation fetches stats/states from io-engine when it gets queries by the Prometheus on /metrics? Cant we use the the same procedure. Prometheus stores the data in time series manner. We will introduce incorrect states w.r.t time if we don't fetch when polled by Prometheus.

We use to have a separate polling period but not anymore. Its in-line with prometheus scraping. Please refer

https://github.com/openebs/mayastor-extensions/blob/develop/metrics-exporter/src/bin/io_engine/serve/handler.rs#L18

pjgranieri · 2026-02-24T02:05:37Z

The current metrics-exporter implementation fetches stats/states from io-engine when it gets queries by the Prometheus on /metrics? Cant we use the the same procedure. Prometheus stores the data in time series manner. We will introduce incorrect states w.r.t time if we don't fetch when polled by Prometheus.

We use to have a separate polling period but not anymore. Its in-line with prometheus scraping. Please refer

https://github.com/openebs/mayastor-extensions/blob/develop/metrics-exporter/src/bin/io_engine/serve/handler.rs#L18

We absolutely could, we are aligned with that approach. The inline fetch at handler.rs:18 (store_resource_data(grpc_client()).await) is the pattern we will be following. Node status is going to be fetched the way via an inline REST call in metrics_handler() right alongside the existing gRPC fetch, so values will always reflect the actual state at scrape time:

store_resource_data(grpc_client()).await; // existing gRPC fetch
let node = fetch_node(&node_id, &client).await.ok(); // new inline REST fetch
let node_status_collector = NodeStatusCollector::new(node); // takes Option

The background poller, cache, and the polling-specific metrics are all going to be removed. Thank you for the pointer, let us know if you agree that we are working in the right direction with these ideas.

IyanekiB · 2026-03-12T04:27:57Z

Hi everyone, just wanted to follow up on the review feedback. The background poller, cache, and polling-specific metrics have been removed; node status is now fetched inline at scrape time in metrics_handler(), which should be consistent with the existing pattern at handler.rs:18. The custom Node/NodeSpec/NodeState types in types.rs (now deleted) have been replaced with openapi::models equivalents, and the CLI has been cleaned up (polling_interval removed, rest_endpoint typed as Uri, scrape_timeout as humantime::Duration). Would appreciate a re-review when you get a chance.

tiagolobocastro

LGTM

@IyanekiB could you please squash your commits?

IyanekiB · 2026-03-19T14:30:39Z

LGTM

@IyanekiB could you please squash your commits?

@tiagolobocastro just finished squashing the commits and applied the changes you suggested. Would like a re-review if you don't mind. Thanks!

tiagolobocastro

thanks for contributing this @IyanekiB !

IyanekiB · 2026-03-30T18:39:51Z

thanks for contributing this @IyanekiB !

Thanks @tiagolobocastro! Just wanted to ask, is there anything else needed for the OEP that we should finalize?

tiagolobocastro · 2026-04-07T08:47:33Z

thanks for contributing this @IyanekiB !

Thanks @tiagolobocastro! Just wanted to ask, is there anything else needed for the OEP that we should finalize?

Nothing I can think of, just need a few more reviews here, CC @abhilashshetty04 @niladrih @Abhinandan-Purkait

tiagolobocastro · 2026-04-13T17:28:01Z

bors merge

tiagolobocastro · 2026-04-16T10:52:52Z

bors merge

tiagolobocastro · 2026-04-16T10:54:19Z

bors merge

tiagolobocastro · 2026-04-16T10:59:24Z

@pchandra19 any clues why this is stuck?

tiagolobocastro · 2026-04-16T11:21:02Z

bors merge

pchandra19 · 2026-04-16T11:57:02Z

bors merge

pchandra19 · 2026-04-16T12:14:08Z

bors merge

pchandra19 · 2026-04-16T12:59:14Z

bors try

tiagolobocastro · 2026-04-16T13:18:57Z

bors cancel

Expose three Prometheus gauge metrics per io-engine node, fetched inline at scrape time from the control-plane REST API: - mayastor_node_online (1 = Online, 0 = Offline) - mayastor_node_cordoned (1 = cordoned, draining, or drained) - mayastor_node_draining (1 = draining or drained) Node data is fetched on demand (GET /v0/nodes/{node_id}) at each Prometheus scrape, matching the pull-model semantics used by the existing pool, nexus, and replica metric collectors. No background polling thread or persistent cache is used, eliminating the risk of stale data being recorded with a scrape timestamp. Key implementation details: - REST endpoint configured via MAYASTOR_REST_ENDPOINT env var / --rest-endpoint CLI flag (optional; metrics omitted if not set) - Scrape timeout configurable via MAYASTOR_SCRAPE_TIMEOUT (default 10s) - Uses openapi::models types directly (Node, NodeSpec, NodeState, CordonDrainState) instead of custom duplicates - NodeStatusClient fetches a single node by ID using the tower-based openapi client; endpoint URL parsed via the url crate - Graceful degradation: if the REST call fails, node status metrics are omitted for that scrape (consistent with gRPC failure behaviour) Also includes upstream develop syncs and CI workflow additions picked up while the branch was in development. Signed-off-by: IyanekiB <iyan.n@outlook.com>

tiagolobocastro · 2026-04-16T14:27:53Z

bors ping

bors-openebs-mayastor · 2026-04-16T14:27:57Z

pong

tiagolobocastro · 2026-04-16T14:28:08Z

bors merge

IyanekiB requested a review from a team as a code owner February 6, 2026 02:34

IyanekiB force-pushed the feat/oep-4111-node-status-metrics branch from bf0024b to aabc548 Compare February 6, 2026 02:54

IyanekiB mentioned this pull request Feb 6, 2026

[OEP 4111]: Mayastor Node-Status Metrics via Metrics-Exporter REST Integration openebs/openebs#4111

Open

7 tasks

tiagolobocastro reviewed Feb 9, 2026

View reviewed changes

tiagolobocastro reviewed Mar 12, 2026

View reviewed changes

pjgranieri force-pushed the feat/oep-4111-node-status-metrics branch from a5ac233 to 3e451cb Compare March 12, 2026 20:48

IyanekiB requested a review from tiagolobocastro March 18, 2026 21:23

tiagolobocastro reviewed Mar 19, 2026

View reviewed changes

Comment thread metrics-exporter/src/bin/io_engine/collector/node_status.rs

Comment thread metrics-exporter/src/bin/io_engine/node_status/client.rs Outdated

Comment thread metrics-exporter/src/bin/io_engine/main.rs

Comment thread metrics-exporter/src/bin/io_engine/main.rs Outdated

IyanekiB force-pushed the feat/oep-4111-node-status-metrics branch 2 times, most recently from 108b051 to 70d65c9 Compare March 19, 2026 14:04

IyanekiB requested a review from tiagolobocastro March 19, 2026 15:53

tiagolobocastro approved these changes Mar 27, 2026

View reviewed changes

Abhinandan-Purkait approved these changes Apr 9, 2026

View reviewed changes

tiagolobocastro closed this Apr 16, 2026

tiagolobocastro reopened this Apr 16, 2026

tiagolobocastro force-pushed the feat/oep-4111-node-status-metrics branch from 70d65c9 to 17124a1 Compare April 16, 2026 11:00

pchandra19 closed this Apr 16, 2026

pchandra19 reopened this Apr 16, 2026

tiagolobocastro closed this Apr 16, 2026

tiagolobocastro reopened this Apr 16, 2026

tiagolobocastro force-pushed the feat/oep-4111-node-status-metrics branch from 17124a1 to 9482d14 Compare April 16, 2026 13:52

Conversation

IyanekiB commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Regression

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhilashshetty04 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjgranieri commented Feb 24, 2026

Uh oh!

IyanekiB commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tiagolobocastro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IyanekiB commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiagolobocastro left a comment

Choose a reason for hiding this comment

Uh oh!

IyanekiB commented Mar 30, 2026

Uh oh!

tiagolobocastro commented Apr 7, 2026

Uh oh!

tiagolobocastro commented Apr 13, 2026

Uh oh!

tiagolobocastro commented Apr 16, 2026

Uh oh!

tiagolobocastro commented Apr 16, 2026

Uh oh!

tiagolobocastro commented Apr 16, 2026

Uh oh!

tiagolobocastro commented Apr 16, 2026

Uh oh!

pchandra19 commented Apr 16, 2026

Uh oh!

pchandra19 commented Apr 16, 2026

Uh oh!

pchandra19 commented Apr 16, 2026

Uh oh!

tiagolobocastro commented Apr 16, 2026

Uh oh!

tiagolobocastro commented Apr 16, 2026

Uh oh!

bors-openebs-mayastor bot commented Apr 16, 2026

Uh oh!

tiagolobocastro commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

IyanekiB commented Feb 6, 2026 •

edited

Loading

abhilashshetty04 commented Feb 18, 2026 •

edited

Loading

IyanekiB commented Mar 19, 2026 •

edited

Loading