Skip to content

Commit dedb779

Browse files
postnatischmikei
andauthored
chore: Modernize opensearch-mixin (#1528)
* Modernize opensearch-mixin to grafonnet v11 and signals architecture * Updated signal files to align to dashboards. Reworked signals to better use builder pattern. Updated panels for best practices * Revert .gitignore to master version * Ran make fmt * Fixed linting errors for prometheus datasource. * Reworked all panels to use commonlib as base. * Updated README to match modernization changes. * Update opensearch-mixin/signals/search-and-index-overview.libsonnet Co-authored-by: Keith Schmitt <32067685+schmikei@users.noreply.github.com> * Update opensearch-mixin/signals/search-and-index-overview.libsonnet Co-authored-by: Keith Schmitt <32067685+schmikei@users.noreply.github.com> * Update opensearch-mixin/signals/search-and-index-overview.libsonnet Co-authored-by: Keith Schmitt <32067685+schmikei@users.noreply.github.com> * Update opensearch-mixin/mixin.libsonnet Co-authored-by: Keith Schmitt <32067685+schmikei@users.noreply.github.com> * Update opensearch-mixin/g.libsonnet Co-authored-by: Keith Schmitt <32067685+schmikei@users.noreply.github.com> * Added PR feedback * Updated with PR feedback * Fixed lint errors in alerts caused by changing filteringSelector to empty * Updated with latest output * Fixed dashboard import path for g.libsonnet * Removed enableMultiCluster from README since it is no longer used in the mixin --------- Co-authored-by: Keith Schmitt <32067685+schmikei@users.noreply.github.com>
1 parent 69c6cbc commit dedb779

27 files changed

+7034
-10372
lines changed

opensearch-mixin/.pint.hcl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
//ignore fragile promql selectors for OpenSearch latency alerts
2+
checks {
3+
disabled = ["promql/fragile"]
4+
}

opensearch-mixin/README.md

Lines changed: 10 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -14,31 +14,16 @@ and the following alerts:
1414
- OpenSearchRedCluster
1515
- OpenSearchUnstableShardReallocation
1616
- OpenSearchUnstableShardUnassigned
17-
- OpenSearchModerateNodeDiskUsage
18-
- OpenSearchHighNodeDiskUsage
19-
- OpenSearchModerateNodeCPUUsage
20-
- OpenSearchHighNodeCPUUsage
21-
- OpenSearchModerateNodeMemoryUsage
22-
- OpenSearchHighNodeMemoryUsage
17+
- OpenSearchHighNodeDiskUsage (warning and critical)
18+
- OpenSearchHighNodeCpuUsage (warning and critical)
19+
- OpenSearchHighNodeMemoryUsage (warning and critical)
2320
- OpenSearchModerateRequestLatency
2421
- OpenSearchModerateIndexLatency
2522

2623
>## **Note on the exporter plugin**
2724
>>The Prometheus exporter plugin provides the label `cluster` on the metrics, which represents the name given to the OpenSearch cluster.
2825
The mixin is looking for `opensearch_cluster` and the configuration snippets will include rules for creating the `opensearch_cluster` label and dropping the `cluster` label.
2926

30-
### Kubernetes deployments
31-
32-
By default, the mixin has `enableMultiCluster` set to `false` to account for those running OpenSearch clusters outside of kubernetes. To configure the mixin
33-
to work with deployments in kubernetes, set this to `true` in the `config.libsonnet` file.
34-
```jsonnet
35-
{
36-
_config+:: {
37-
enableMultiCluster: true,
38-
},
39-
}
40-
```
41-
4227
## OpenSearch Cluster Overview
4328

4429
The OpenSearch cluster overview dashboard provides details on cluster, node and shard status as well as cluster search and index summary details on a cluster level.
@@ -85,22 +70,20 @@ The OpenSearch search and index overview dashboard provides details on request p
8570
8671
## Alerts Overview
8772
88-
8973
| Alert | Summary |
9074
|-------------------------------------|---------------------------------------------------------------------------------|
9175
| OpenSearchYellowCluster | At least one of the clusters is reporting a yellow status. |
9276
| OpenSearchRedCluster | At least one of the clusters is reporting a red status. |
9377
| OpenSearchUnstableShardReallocation | A node has gone offline or has been disconnected triggering shard reallocation. |
9478
| OpenSearchUnstableShardUnassigned | There are shards that have been detected as unassigned. |
95-
| OpenSearchModerateNodeDiskUsage | The node disk usage has exceeded the warning threshold. |
96-
| OpenSearchHighNodeDiskUsage | The node disk usage has exceeded the critical threshold. |
97-
| OpenSearchModerateNodeCpuUsage | The node CPU usage has exceeded the warning threshold. |
98-
| OpenSearchHighNodeCpuUsage | The node CPU usage has exceeded the critical threshold. |
99-
| OpenSearchModerateNodeMemoryUsage | The node memory usage has exceeded the warning threshold. |
100-
| OpenSearchHighNodeMemoryUsage | The node memory usage has exceeded the critical threshold. |
79+
| OpenSearchHighNodeDiskUsage | The node disk usage has exceeded the configured threshold (warning or critical). |
80+
| OpenSearchHighNodeCpuUsage | The node CPU usage has exceeded the configured threshold (warning or critical). |
81+
| OpenSearchHighNodeMemoryUsage | The node memory usage has exceeded the configured threshold (warning or critical). |
10182
| OpenSearchModerateRequestLatency | The request latency has exceeded the warning threshold. |
10283
| OpenSearchModerateIndexLatency | The index latency has exceeded the warning threshold. |
10384
85+
Node resource alerts (disk, CPU, memory) use the same alert name for both warning and critical severity levels. This follows the Alertmanager inhibition pattern, allowing warning alerts to be automatically suppressed when critical alerts fire.
86+
10487
Default thresholds can be configured in `config.libsonnet`
10588

10689
```js
@@ -114,8 +97,8 @@ Default thresholds can be configured in `config.libsonnet`
11497
alertsCriticalCPUUsage: 85,
11598
alertsWarningMemoryUsage: 70,
11699
alertsCriticalMemoryUsage: 85,
117-
alertsWarningRequestLatency: 0.5, // seconds
118-
alertsWarningIndexLatency: 0.5, // seconds
100+
alertsWarningRequestLatency: 500, // milliseconds
101+
alertsWarningIndexLatency: 500, // milliseconds
119102
},
120103
}
121104
```
Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
2-
prometheusAlerts+:: {
2+
new(this): {
33
groups+: [
44
{
5-
name: $._config.uid + '-alerts',
5+
name: this.config.uid + '-alerts',
66
rules: [
77
{
88
alert: 'OpenSearchYellowCluster',
99
expr: |||
1010
opensearch_cluster_status{%(filteringSelector)s} == 1
11-
||| % $._config,
11+
||| % this.config,
1212
'for': '5m',
1313
labels: {
1414
severity: 'warning',
@@ -18,14 +18,14 @@
1818
description:
1919
(
2020
'{{$labels.cluster}} health status is yellow over the last 5 minutes'
21-
) % $._config,
21+
) % this.config,
2222
},
2323
},
2424
{
2525
alert: 'OpenSearchRedCluster',
2626
expr: |||
2727
opensearch_cluster_status{%(filteringSelector)s} == 2
28-
||| % $._config,
28+
||| % this.config,
2929
'for': '5m',
3030
labels: {
3131
severity: 'critical',
@@ -35,14 +35,14 @@
3535
description:
3636
(
3737
'{{$labels.cluster}} health status is red over the last 5 minutes'
38-
) % $._config,
38+
) % this.config,
3939
},
4040
},
4141
{
4242
alert: 'OpenSearchUnstableShardReallocation',
4343
expr: |||
44-
sum without(type) (opensearch_cluster_shards_number{%(filteringSelector)s, type="relocating"}) > %(alertsWarningShardReallocations)s
45-
||| % $._config,
44+
sum without(type) (opensearch_cluster_shards_number{type="relocating", %(filteringSelector)s}) > %(alertsWarningShardReallocations)s
45+
||| % this.config,
4646
'for': '1m',
4747
labels: {
4848
severity: 'warning',
@@ -51,14 +51,14 @@
5151
summary: 'A node has gone offline or has been disconnected triggering shard reallocation.',
5252
description: |||
5353
{{$labels.cluster}} has had {{ printf "%%.0f" $value }} shard reallocation over the last 1m which is above the threshold of %(alertsWarningShardReallocations)s.
54-
||| % $._config,
54+
||| % this.config,
5555
},
5656
},
5757
{
5858
alert: 'OpenSearchUnstableShardUnassigned',
5959
expr: |||
60-
sum without(type) (opensearch_cluster_shards_number{%(filteringSelector)s, type="unassigned"}) > %(alertsWarningShardUnassigned)s
61-
||| % $._config,
60+
sum without(type) (opensearch_cluster_shards_number{type="unassigned", %(filteringSelector)s}) > %(alertsWarningShardUnassigned)s
61+
||| % this.config,
6262
'for': '5m',
6363
labels: {
6464
severity: 'warning',
@@ -67,14 +67,14 @@
6767
summary: 'There are shards that have been detected as unassigned.',
6868
description: |||
6969
{{$labels.cluster}} has had {{ printf "%%.0f" $value }} shard unassigned over the last 5m which is above the threshold of %(alertsWarningShardUnassigned)s.
70-
||| % $._config,
70+
||| % this.config,
7171
},
7272
},
7373
{
7474
alert: 'OpenSearchHighNodeDiskUsage',
7575
expr: |||
7676
100 * sum without(nodeid, path, mount, type) ((opensearch_fs_path_total_bytes{%(filteringSelector)s} - opensearch_fs_path_free_bytes{%(filteringSelector)s}) / opensearch_fs_path_total_bytes{%(filteringSelector)s}) > %(alertsWarningDiskUsage)s
77-
||| % $._config,
77+
||| % this.config,
7878
'for': '5m',
7979
labels: {
8080
severity: 'warning',
@@ -83,14 +83,14 @@
8383
summary: 'The node disk usage has exceeded the warning threshold.',
8484
description: |||
8585
{{$labels.node}} has had {{ printf "%%.0f" $value }} disk usage over the last 5m which is above the threshold of %(alertsWarningDiskUsage)s.
86-
||| % $._config,
86+
||| % this.config,
8787
},
8888
},
8989
{
9090
alert: 'OpenSearchHighNodeDiskUsage',
9191
expr: |||
9292
100 * sum without(nodeid, path, mount, type) ((opensearch_fs_path_total_bytes{%(filteringSelector)s} - opensearch_fs_path_free_bytes{%(filteringSelector)s}) / opensearch_fs_path_total_bytes{%(filteringSelector)s}) > %(alertsCriticalDiskUsage)s
93-
||| % $._config,
93+
||| % this.config,
9494
'for': '5m',
9595
labels: {
9696
severity: 'critical',
@@ -99,14 +99,14 @@
9999
summary: 'The node disk usage has exceeded the critical threshold.',
100100
description: |||
101101
{{$labels.node}} has had {{ printf "%%.0f" $value }}%% disk usage over the last 5m which is above the threshold of %(alertsCriticalDiskUsage)s.
102-
||| % $._config,
102+
||| % this.config,
103103
},
104104
},
105105
{
106106
alert: 'OpenSearchHighNodeCpuUsage',
107107
expr: |||
108108
sum without(nodeid) (opensearch_os_cpu_percent{%(filteringSelector)s}) > %(alertsWarningCPUUsage)s
109-
||| % $._config,
109+
||| % this.config,
110110
'for': '5m',
111111
labels: {
112112
severity: 'warning',
@@ -115,14 +115,14 @@
115115
summary: 'The node CPU usage has exceeded the warning threshold.',
116116
description: |||
117117
{{$labels.node}} has had {{ printf "%%.0f" $value }}%% CPU usage over the last 5m which is above the threshold of %(alertsWarningCPUUsage)s.
118-
||| % $._config,
118+
||| % this.config,
119119
},
120120
},
121121
{
122122
alert: 'OpenSearchHighNodeCpuUsage',
123123
expr: |||
124124
sum without(nodeid) (opensearch_os_cpu_percent{%(filteringSelector)s}) > %(alertsCriticalCPUUsage)s
125-
||| % $._config,
125+
||| % this.config,
126126
'for': '5m',
127127
labels: {
128128
severity: 'critical',
@@ -131,14 +131,14 @@
131131
summary: 'The node CPU usage has exceeded the critical threshold.',
132132
description: |||
133133
{{$labels.node}} has had {{ printf "%%.0f" $value }}%% CPU usage over the last 5m which is above the threshold of %(alertsCriticalCPUUsage)s.
134-
||| % $._config,
134+
||| % this.config,
135135
},
136136
},
137137
{
138138
alert: 'OpenSearchHighNodeMemoryUsage',
139139
expr: |||
140140
sum without(nodeid) (opensearch_os_mem_used_percent{%(filteringSelector)s}) > %(alertsWarningMemoryUsage)s
141-
||| % $._config,
141+
||| % this.config,
142142
'for': '5m',
143143
labels: {
144144
severity: 'warning',
@@ -147,14 +147,14 @@
147147
summary: 'The node memory usage has exceeded the warning threshold.',
148148
description: |||
149149
{{$labels.node}} has had {{ printf "%%.0f" $value }}%% memory usage over the last 5m which is above the threshold of %(alertsWarningMemoryUsage)s.
150-
||| % $._config,
150+
||| % this.config,
151151
},
152152
},
153153
{
154154
alert: 'OpenSearchHighNodeMemoryUsage',
155155
expr: |||
156156
sum without(nodeid) (opensearch_os_mem_used_percent{%(filteringSelector)s}) > %(alertsCriticalMemoryUsage)s
157-
||| % $._config,
157+
||| % this.config,
158158
'for': '5m',
159159
labels: {
160160
severity: 'critical',
@@ -163,39 +163,39 @@
163163
summary: 'The node memory usage has exceeded the critical threshold.',
164164
description: |||
165165
{{$labels.node}} has had {{ printf "%%.0f" $value }}%% memory usage over the last 5m which is above the threshold of %(alertsCriticalMemoryUsage)s.
166-
||| % $._config,
166+
||| % this.config,
167167
},
168168
},
169169
{
170170
alert: 'OpenSearchModerateRequestLatency',
171171
expr: |||
172-
sum without(context) ((increase(opensearch_index_search_fetch_time_seconds{%(filteringSelector)s, context="total"}[5m])+increase(opensearch_index_search_query_time_seconds{context="total"}[5m])+increase(opensearch_index_search_scroll_time_seconds{context="total"}[5m])) / clamp_min(increase(opensearch_index_search_fetch_count{context="total"}[5m])+increase(opensearch_index_search_query_count{context="total"}[5m])+increase(opensearch_index_search_scroll_count{context="total"}[5m]), 1)) > %(alertsWarningRequestLatency)s
173-
||| % $._config,
172+
sum without(context) ((increase(opensearch_index_search_fetch_time_seconds{context="total", %(filteringSelector)s}[5m])+increase(opensearch_index_search_query_time_seconds{context="total"}[5m])+increase(opensearch_index_search_scroll_time_seconds{context="total"}[5m])) / clamp_min(increase(opensearch_index_search_fetch_count{context="total"}[5m])+increase(opensearch_index_search_query_count{context="total"}[5m])+increase(opensearch_index_search_scroll_count{context="total"}[5m]), 1)) > %(alertsWarningRequestLatency)s / 1000
173+
||| % this.config,
174174
'for': '5m',
175175
labels: {
176176
severity: 'warning',
177177
},
178178
annotations: {
179179
summary: 'The request latency has exceeded the warning threshold.',
180180
description: |||
181-
{{$labels.index}} has had {{ printf "%%.0f" $value }}s of request latency over the last 5m which is above the threshold of %(alertsWarningRequestLatency)s.
182-
||| % $._config,
181+
{{$labels.index}} has had {{ printf "%%.0f" $value }}s of request latency over the last 5m which is above the threshold of %(alertsWarningRequestLatency)sms.
182+
||| % this.config,
183183
},
184184
},
185185
{
186186
alert: 'OpenSearchModerateIndexLatency',
187187
expr: |||
188-
sum without(context) (increase(opensearch_index_indexing_index_time_seconds{%(filteringSelector)s, context="total"}[5m]) / clamp_min(increase(opensearch_index_indexing_index_count{context="total"}[5m]), 1)) > %(alertsWarningIndexLatency)s
189-
||| % $._config,
188+
sum without(context) (increase(opensearch_index_indexing_index_time_seconds{context="total", %(filteringSelector)s}[5m]) / clamp_min(increase(opensearch_index_indexing_index_count{context="total"}[5m]), 1)) > %(alertsWarningIndexLatency)s / 1000
189+
||| % this.config,
190190
'for': '5m',
191191
labels: {
192192
severity: 'warning',
193193
},
194194
annotations: {
195195
summary: 'The index latency has exceeded the warning threshold.',
196196
description: |||
197-
{{$labels.index}} has had {{ printf "%%.0f" $value }}s of index latency over the last 5m which is above the threshold of %(alertsWarningIndexLatency)s.
198-
||| % $._config,
197+
{{$labels.index}} has had {{ printf "%%.0f" $value }}s of index latency over the last 5m which is above the threshold of %(alertsWarningIndexLatency)sms.
198+
||| % this.config,
199199
},
200200
},
201201
],

opensearch-mixin/config.libsonnet

Lines changed: 38 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,44 @@
11
{
2-
_config+:: {
3-
enableMultiCluster: false,
4-
// extra static selector to apply to all templated variables and alerts
5-
filteringSelector: if self.enableMultiCluster then 'cluster!="",opensearch_cluster!=""' else 'opensearch_cluster!=""',
6-
groupLabels: if self.enableMultiCluster then ['job', 'cluster', 'opensearch_cluster'] else ['job', 'opensearch_cluster'],
7-
instanceLabels: ['node'],
8-
dashboardTags: ['opensearch-mixin'],
9-
dashboardPeriod: 'now-1h',
10-
dashboardTimezone: 'default',
11-
dashboardRefresh: '1m',
12-
dashboardNamePrefix: '',
2+
local this = self,
3+
filteringSelector: '',
4+
groupLabels: ['job', 'cluster', 'opensearch_cluster'],
5+
logLabels: ['job', 'cluster', 'opensearch_cluster'],
6+
instanceLabels: ['instance'],
137

14-
// prefix dashboards uids
15-
uid: 'opensearch',
8+
uid: 'opensearch',
9+
dashboardTags: [self.uid + '-mixin'],
10+
dashboardNamePrefix: 'OpenSearch',
11+
dashboardPeriod: 'now-1h',
12+
dashboardTimezone: 'default',
13+
dashboardRefresh: '1m',
14+
metricsSource: ['prometheus'], // metrics source for signals
1615

17-
// alerts thresholds
18-
alertsWarningShardReallocations: 0,
19-
alertsWarningShardUnassigned: 0,
20-
alertsWarningDiskUsage: 60,
21-
alertsCriticalDiskUsage: 80,
22-
alertsWarningCPUUsage: 70,
23-
alertsCriticalCPUUsage: 85,
24-
alertsWarningMemoryUsage: 70,
25-
alertsCriticalMemoryUsage: 85,
26-
alertsWarningRequestLatency: 0.5, // seconds
27-
alertsWarningIndexLatency: 0.5, // seconds
16+
// Logging configuration
17+
enableLokiLogs: true,
18+
extraLogLabels: ['level', 'severity'], // Required by logs-lib
19+
logsVolumeGroupBy: 'level',
20+
showLogsVolume: true,
2821

29-
enableLokiLogs: true,
22+
// Agg Lists
23+
groupAggList: std.join(',', this.groupLabels),
24+
groupAggListWithInstance: std.join(',', this.groupLabels + this.instanceLabels),
25+
26+
// Alerts configuration
27+
alertsWarningShardReallocations: 0, // count
28+
alertsWarningShardUnassigned: 0, // count
29+
alertsWarningDiskUsage: 60, // %
30+
alertsCriticalDiskUsage: 80, // %
31+
alertsWarningCPUUsage: 70, // %
32+
alertsCriticalCPUUsage: 85, // %
33+
alertsWarningMemoryUsage: 70, // %
34+
alertsCriticalMemoryUsage: 85, // %
35+
alertsWarningRequestLatency: 500, // milliseconds
36+
alertsWarningIndexLatency: 500, // milliseconds
37+
38+
// Signals configuration
39+
signals+: {
40+
clusterOverview: (import './signals/cluster-overview.libsonnet')(this),
41+
nodeOverview: (import './signals/node-overview.libsonnet')(this),
42+
searchAndIndexOverview: (import './signals/search-and-index-overview.libsonnet')(this),
3043
},
3144
}

0 commit comments

Comments
 (0)