Skip to content

adding pool alert metrics#833

Open
abhilashshetty04 wants to merge 1 commit intodevelopfrom
pool_alert_exporter
Open

adding pool alert metrics#833
abhilashshetty04 wants to merge 1 commit intodevelopfrom
pool_alert_exporter

Conversation

@abhilashshetty04
Copy link
Copy Markdown
Member

This PR adds additional metrics on the diskpool to expose alert releated information.

@abhilashshetty04 abhilashshetty04 requested a review from a team as a code owner April 3, 2026 08:19
@Abhinandan-Purkait
Copy link
Copy Markdown
Member

Code wise looks good. Although it would be good to see how this will be consumed and mapped to charts?

self.alert_status
}

/// Get the collection of notice alert reasons for the pool.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean, what is this number, the enum value, the count.. ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of concatenating all i32 into one f64 value. Pasted example below. attention reason has
IoError and IoStrallIntermittent both.

./io-engine-client pool list -ojson
[
{
"capacity": 1069547520,
"cluster_size": 4194304,
"committed": 0,
"disk_capacity": 1073741824,
"disk_info": [
{
"errors": {
"alerts": {
"attention": [
4,
2
],
"critical": [
1
],
"notice": [],
"status": 3,
"warning": []
},
"io_error_count": 3,
"io_error_threshold": 8,
"io_stall_transition_count": 2,
"io_stall_transition_threshold": 3,
"io_stalled": true
},
"uri": "uring:///dev/mapper/lvmvg-newlv"
}
],
"disks": [
"uring:///dev/mapper/lvmvg-newlv"
],
"encrypted": false,
"errors": {
"alerts": {
"attention": [
4,
2
],
"critical": [
1
],
"notice": [],
"status": 3,
"warning": []
},
"io_error_count": 3,
"io_error_threshold": 8,
"io_stall_transition_count": 2,
"io_stall_transition_threshold": 3,
"io_stalled": true
},
"max_expandable_size": 137271181312,
"md_info": {
"md_page_size": 4096,
"md_pages": 256,
"md_used_pages": 1
},
"name": "pool-uring",
"page_size": 4096,
"pooltype": 0,
"state": 4,
"used": 0,
"uuid": "03de92fb-1f6b-4121-ba3b-ee31934b9abc"
}
]
/bin # %

image

After discussing with @krishnaGajabi , Im thinking of exposing a variant present or not using labels. Will see how that works.


/// Get the alert status for the pool.
pub(crate) fn alert_status(&self) -> i32 {
self.alert_status
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the enum value?
In the node exporter, online is being created as a gauge, would it be better to do the same here, and create gauges for each status? (I don't know how these are typically exported...)

@abhilashshetty04
Copy link
Copy Markdown
Member Author

./io-engine-client pool list -ojson
[
{
"capacity": 5360320512,
"cluster_size": 4194304,
"committed": 0,
"disk_capacity": 5368709120,
"disk_info": [
{
"errors": {
"alerts": {
"attention": [
4
],
"critical": [],
"notice": [],
"status": 2,
"warning": [
3
]
},
"io_error_count": 3,
"io_error_threshold": 8,
"io_stall_transition_count": 3,
"io_stall_transition_threshold": 3,
"io_stalled": false
},
"uri": "uring:///dev/mapper/lvmvg-new"
}
],
"disks": [
"uring:///dev/mapper/lvmvg-new"
],
"encrypted": false,
"errors": {
"alerts": {
"attention": [
4
],
"critical": [],
"notice": [],
"status": 2,
"warning": [
3
]
},
"io_error_count": 3,
"io_error_threshold": 8,
"io_stall_transition_count": 3,
"io_stall_transition_threshold": 3,
"io_stalled": false
},
"max_expandable_size": 137271181312,
"md_info": {
"md_page_size": 4096,
"md_pages": 1280,
"md_used_pages": 1
},
"name": "pool-uring",
"page_size": 4096,
"pooltype": 0,
"state": 4,
"used": 0,
"uuid": "dba451a1-345d-4659-aa71-b252bf1b94ec"
}
]

This has attention and warning alert reasons. Please see alert with label name pool-uring.

image image

When in stalled state, alert type is Critical:

image

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Abhilash Shetty <abhilash.shetty@datacore.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@tiagolobocastro tiagolobocastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add some tests as well?

let opts = Opts::new(metric_name, metric_desc)
.subsystem("diskpool_alert")
.variable_labels(vec![
"node".to_string(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's place these in a vec and reuse it here and below to avoid discrepancies

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node is part of Collector struct now. Is that what you meant?

};
let cache_deref = cache.deref_mut();
let mut metric_family = Vec::with_capacity(3 * cache_deref.pool_mut().pools.capacity());
let node_name = match get_node_name() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not place the name in the collector self?

}
};
let cache_deref = cache.deref_mut();
let mut metric_family = Vec::with_capacity(3 * cache_deref.pool_mut().pools.capacity());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? creating a vector with 3 x pool byte capacity??

Copy link
Copy Markdown
Member Author

@abhilashshetty04 abhilashshetty04 Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not byte capacity, cache_deref.pool_mut().pools holds Vec of PoolInfo. capacity is number of elements (poolInfo) present in the cache. cache is populated before the collect is called.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be 10 here as this Collector exports 10 metrics per pool.

fn update_alerts(&mut self, pool_alerts: &Vec<i32>) {
for i in pool_alerts {
match i {
0 => self.unknown = 1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems quite brittle to just use these as raw int... should use the actual types

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get Vec from gRPC. Here, For each metrics like io_alert_critical_reason or io_alert_attention_reason we have labels (io_error, io_stall_intermittent_exc etc) with value 0 or 1 (unset or set).

Hence keeping everything in i32 seemed straighforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants