-
Notifications
You must be signed in to change notification settings - Fork 178
chore: Modernize the Apache CouchDB mixin #1522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to head out in a second so I'll publish my comments so-far! Going to pick it back up in the morning
@aalhour would also appreciate a second pair of eyes on this one :)
| * the prometheusWithTotal is used for backwards compatibility as some metrics are suffixed with _total but in later versions of the couchdb-mixin. | ||
| * i.e. couchdb_open_os_files_total => couchdb_open_os_files | ||
| * This is to ensure that the signals for the metrics that are suffixed with _total continue to work as expected. | ||
| * This was an identified as a noticeable change from 3.3.0 to 3.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! 🚀
Can you call this out in the readme as well please? E.g. just what versions are supported, and what the different metricSources are for
| averageRequestLatencyp50: { | ||
| name: 'Average request latency p50', | ||
| nameShort: 'Average request latency p50', | ||
| type: 'raw', | ||
| description: 'The average request latency p50 aggregated across all nodes.', | ||
| unit: 's', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.5"})', | ||
| legendCustomTemplate: legendCustomTemplate + ' - p50', | ||
| }, | ||
| }, | ||
| }, | ||
|
|
||
| averageRequestLatencyp75: { | ||
| name: 'Average request latency p75', | ||
| nameShort: 'Average request latency p75', | ||
| type: 'raw', | ||
| description: 'The average request latency p75 aggregated across all nodes.', | ||
| unit: 's', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.75"})', | ||
| legendCustomTemplate: legendCustomTemplate + ' - p75', | ||
| }, | ||
| }, | ||
| }, | ||
|
|
||
| averageRequestLatencyp95: { | ||
| name: 'Average request latency p95', | ||
| nameShort: 'Average request latency p95', | ||
| type: 'raw', | ||
| description: 'The average request latency p95 aggregated across all nodes.', | ||
| unit: 's', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.95"})', | ||
| legendCustomTemplate: legendCustomTemplate + ' - p95', | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as my recent requests on the other old mixins, I think we could and should use histograms more, either as a replacement or in addition to these signals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed these quantiles timeseries to histograms on both dashboards
…ouchdb-modernization
…nnet-libs into chore/couchdb-modernization
aalhour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I caught some issues, thanks for pushing this.
| expr: ||| | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s | ||
| ||| % $._config, | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the 4.* regex more general than the 4..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/grafana/jsonnet-libs/actions/runs/19470908585/job/55717683328?pr=1522
Unfortunately its considered a messy selector. For some reason the lint passes locally for myself but within Ci it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.
I doubt we will run into the latter, so for now I think 4.. is perfectly fine
apache-couchdb-mixin/mixin.libsonnet
Outdated
| cluster+: { | ||
| allValue: '.*', | ||
| }, | ||
| couchb_cluster+: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couchb --> couchdb, missing d character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep thank you for the catch!
| { | ||
| alert: 'CouchDBUnhealthyCluster', | ||
| expr: ||| | ||
| min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable) < %(alertsCriticalClusterIsUnstable5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the alerts use the filteringSelector from the config? Just wondering.
expr: |||
min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable{%(filteringSelector)s}) < %(alertsCriticalClusterIsUnstable5m)s
||| % this.config,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went ahead and implemented these!
| alertsCriticalClusterIsUnstable5m: 1, //1 is stable | ||
| alertsWarning4xxResponseCodes5m: 5, | ||
| alertsCritical5xxResponseCodes5m: 0, | ||
| alertsWarningRequestLatency5m: 500, //ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in milliseconds, which means that the alert threshold is broken.
| { | ||
| alert: 'CouchDBHighRequestLatency', | ||
| expr: ||| | ||
| sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) > %(alertsCriticalRequestLatency5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that you didn't change this but I think there is a bug in the calculation, sum / count is in seconds but the config declares the threshold in milliseconds (500), we should multiply the value before the compare operator by 1000. I think this is the second time I catch this bug, which is not on you but it's there in the original implementations.
expr: |||
sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) * 1000 > %(alertsWarningRequestLatency5m)s
||| % this.config,
Alternatively, we can change the alert threshold in the config. I am fine with either solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep didn't really audit the content of the queries too much. Will go ahead and take a stab at a fix
| }, | ||
| { | ||
| alert: 'CouchDBReplicatorConnectionOwnersCrashing', | ||
| alert: 'CouchDBReplicatorOwnersCrashing', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would changing the alert name affect existing installations? What if someone uninstalls or upgrades an integration, would they get duplicate alerts?
Looking in @Dasomeone for more context.
| unit: 'requests', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'sum by(' + groupLabelAggTerm + ') (couchdb_httpd_status_codes{%(queriesSelector)s, code=~"[45].*"})', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not using increase(...) like the goodResponseStatuses signal above it? Also neither interval nor offset. At least when I read them, I get the feeling that they are complementary but their expressions are not really consistent.
| + g.query.prometheus.withInstant(true) | ||
| + g.query.prometheus.withFormat('timeseries'), | ||
| ]) | ||
| + g.panel.gauge.queryOptions.withDatasource('prometheus', '${' + this.grafana.variables.datasources.prometheus.name + '}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to specify the datasource here? I think it should be derived from the board config and selectors.
| expr: 'couchdb_httpd_view_reads_total{%(queriesSelector)s}', | ||
| }, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that these signals don't specify a rate(...)[$__rate_interval] like other counterparts in overview.libsonnet, does the function also break the panels for you? At least for me with the databricks data, it can be a gauge type and it might break it. But here i see that these are counters 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are counters so the rate gets auto-added from my experience!
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this another pass while checking with the sample-app and I realised the node dashboard is pretty cluttered and the panels hard to read.
I've left a couple comments but I think it's a more overarching issue for the node views so happy to set up a call to discuss between us :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to call out the version support here like I mentioned in the config file :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Called out in the readme :)
| ) | ||
| + g.panel.timeSeries.panelOptions.withDescription('The total number of error response statuses aggregated across all nodes.') | ||
| + g.panel.timeSeries.standardOptions.withUnit('rps'), | ||
| + g.panel.timeSeries.panelOptions.withDescription('The total number of error response statuses (HTTP 4xx-5xx) aggregated across all nodes.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just hit me that we're duplicating the description here.
If you're just using a single signal you can just reuse the signal's internal description so long as it still fits.
As far as I recall, it should be used by default, but if not you can access it directly via the description field on the signal API, e.g.
signals.overview.errorResponseStatuses.description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed async but there's currently no good way of avoiding the duplication 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Let's park this one for now, we can revisit once properly exposed via the signals api
| expr: ||| | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s | ||
| ||| % $._config, | ||
| sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.
I doubt we will run into the latter, so for now I think 4.. is perfectly fine
| + g.panel.timeSeries.panelOptions.withDescription('The total number of bulk requests on a node.') | ||
| + g.panel.timeSeries.standardOptions.withUnit('rps'), | ||
|
|
||
| nodeResponseStatusOverviewPanel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the histogram view, this pie-chart doesn't look great due to the multiple instances and series. It frequently even stops rendering at all due to the amount of values.

I have a few thoughts on how we could solve this, and wanted to discuss becuase this really is a wider issue with node views
-
We could create repeat rule, e.g. setting up a full row of all the queries we want, then repeating that for each value of
instancedynamically -
per-visualisation decluttering / summing where it makes sense
E.g. here I mean discarding 0 values, e.g.
- Heavy utilisation of table panels with sub-mixins, e.g. have a table panel that per-instance has a latency histogram, pie chart, etc.
That however also quickly gets overwhelming.
Happy to set up a call to discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can discard the zero values 👍
| overviewClusterHealthPanel: | ||
| g.panel.stat.new(title='Clusters healthy') | ||
| + g.panel.stat.queryOptions.withTargets([signals.overview.clusterHealth.asTarget()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat expected as the sample app did not setup replication. We could maybe add a feature to track to set it up within the sample app but the base got most of the metrics.
This metric in particular is not generated with the single node approach: couchdb_couch_replicator_cluster_is_stable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good, just wanted to ensure that it was an expected behaviour, should've clicked for me that the sample-app wasn't clustered 😅 Thanks Keith!

This modernizes the CouchDB Mixin to use newer libraries.
Overview




Nodes:



Logs:
