Skip to content

Conversation

@schmikei
Copy link
Contributor

@schmikei schmikei commented Oct 29, 2025

This modernizes the CouchDB Mixin to use newer libraries.

Overview
image
image
image
image

Nodes:
image
image
image

Logs:
image

@schmikei schmikei marked this pull request as ready for review October 29, 2025 21:19
@schmikei schmikei requested a review from a team as a code owner October 29, 2025 21:19
Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to head out in a second so I'll publish my comments so-far! Going to pick it back up in the morning
@aalhour would also appreciate a second pair of eyes on this one :)

* the prometheusWithTotal is used for backwards compatibility as some metrics are suffixed with _total but in later versions of the couchdb-mixin.
* i.e. couchdb_open_os_files_total => couchdb_open_os_files
* This is to ensure that the signals for the metrics that are suffixed with _total continue to work as expected.
* This was an identified as a noticeable change from 3.3.0 to 3.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🚀
Can you call this out in the readme as well please? E.g. just what versions are supported, and what the different metricSources are for

Comment on lines 159 to 198
averageRequestLatencyp50: {
name: 'Average request latency p50',
nameShort: 'Average request latency p50',
type: 'raw',
description: 'The average request latency p50 aggregated across all nodes.',
unit: 's',
sources: {
prometheus: {
expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.5"})',
legendCustomTemplate: legendCustomTemplate + ' - p50',
},
},
},

averageRequestLatencyp75: {
name: 'Average request latency p75',
nameShort: 'Average request latency p75',
type: 'raw',
description: 'The average request latency p75 aggregated across all nodes.',
unit: 's',
sources: {
prometheus: {
expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.75"})',
legendCustomTemplate: legendCustomTemplate + ' - p75',
},
},
},

averageRequestLatencyp95: {
name: 'Average request latency p95',
nameShort: 'Average request latency p95',
type: 'raw',
description: 'The average request latency p95 aggregated across all nodes.',
unit: 's',
sources: {
prometheus: {
expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.95"})',
legendCustomTemplate: legendCustomTemplate + ' - p95',
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as my recent requests on the other old mixins, I think we could and should use histograms more, either as a replacement or in addition to these signals

Copy link
Contributor Author

@schmikei schmikei Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed these quantiles timeseries to histograms on both dashboards

Copy link

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I caught some issues, thanks for pushing this.

expr: |||
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s
||| % $._config,
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the 4.* regex more general than the 4..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/grafana/jsonnet-libs/actions/runs/19470908585/job/55717683328?pr=1522

Unfortunately its considered a messy selector. For some reason the lint passes locally for myself but within Ci it fails

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.

I doubt we will run into the latter, so for now I think 4.. is perfectly fine

cluster+: {
allValue: '.*',
},
couchb_cluster+: {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couchb --> couchdb, missing d character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep thank you for the catch!

{
alert: 'CouchDBUnhealthyCluster',
expr: |||
min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable) < %(alertsCriticalClusterIsUnstable5m)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the alerts use the filteringSelector from the config? Just wondering.

expr: |||
  min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable{%(filteringSelector)s}) < %(alertsCriticalClusterIsUnstable5m)s
||| % this.config,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and implemented these!

alertsCriticalClusterIsUnstable5m: 1, //1 is stable
alertsWarning4xxResponseCodes5m: 5,
alertsCritical5xxResponseCodes5m: 0,
alertsWarningRequestLatency5m: 500, //ms
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in milliseconds, which means that the alert threshold is broken.

{
alert: 'CouchDBHighRequestLatency',
expr: |||
sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) > %(alertsCriticalRequestLatency5m)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that you didn't change this but I think there is a bug in the calculation, sum / count is in seconds but the config declares the threshold in milliseconds (500), we should multiply the value before the compare operator by 1000. I think this is the second time I catch this bug, which is not on you but it's there in the original implementations.

expr: |||
  sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) * 1000 > %(alertsWarningRequestLatency5m)s
||| % this.config,

Alternatively, we can change the alert threshold in the config. I am fine with either solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep didn't really audit the content of the queries too much. Will go ahead and take a stab at a fix

},
{
alert: 'CouchDBReplicatorConnectionOwnersCrashing',
alert: 'CouchDBReplicatorOwnersCrashing',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would changing the alert name affect existing installations? What if someone uninstalls or upgrades an integration, would they get duplicate alerts?

Looking in @Dasomeone for more context.

unit: 'requests',
sources: {
prometheus: {
expr: 'sum by(' + groupLabelAggTerm + ') (couchdb_httpd_status_codes{%(queriesSelector)s, code=~"[45].*"})',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not using increase(...) like the goodResponseStatuses signal above it? Also neither interval nor offset. At least when I read them, I get the feeling that they are complementary but their expressions are not really consistent.

+ g.query.prometheus.withInstant(true)
+ g.query.prometheus.withFormat('timeseries'),
])
+ g.panel.gauge.queryOptions.withDatasource('prometheus', '${' + this.grafana.variables.datasources.prometheus.name + '}')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to specify the datasource here? I think it should be derived from the board config and selectors.

expr: 'couchdb_httpd_view_reads_total{%(queriesSelector)s}',
},
},
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these signals don't specify a rate(...)[$__rate_interval] like other counterparts in overview.libsonnet, does the function also break the panels for you? At least for me with the databricks data, it can be a gauge type and it might break it. But here i see that these are counters 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are counters so the rate gets auto-added from my experience!

Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this another pass while checking with the sample-app and I realised the node dashboard is pretty cluttered and the panels hard to read.
I've left a couple comments but I think it's a more overarching issue for the node views so happy to set up a call to discuss between us :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to call out the version support here like I mentioned in the config file :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Called out in the readme :)

)
+ g.panel.timeSeries.panelOptions.withDescription('The total number of error response statuses aggregated across all nodes.')
+ g.panel.timeSeries.standardOptions.withUnit('rps'),
+ g.panel.timeSeries.panelOptions.withDescription('The total number of error response statuses (HTTP 4xx-5xx) aggregated across all nodes.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just hit me that we're duplicating the description here.

If you're just using a single signal you can just reuse the signal's internal description so long as it still fits.

As far as I recall, it should be used by default, but if not you can access it directly via the description field on the signal API, e.g.
signals.overview.errorResponseStatuses.description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed async but there's currently no good way of avoiding the duplication 😢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let's park this one for now, we can revisit once properly exposed via the signals api

expr: |||
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s
||| % $._config,
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.

I doubt we will run into the latter, so for now I think 4.. is perfectly fine

+ g.panel.timeSeries.panelOptions.withDescription('The total number of bulk requests on a node.')
+ g.panel.timeSeries.standardOptions.withUnit('rps'),

nodeResponseStatusOverviewPanel:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the histogram view, this pie-chart doesn't look great due to the multiple instances and series. It frequently even stops rendering at all due to the amount of values.
Image

I have a few thoughts on how we could solve this, and wanted to discuss becuase this really is a wider issue with node views

  1. We could create repeat rule, e.g. setting up a full row of all the queries we want, then repeating that for each value of instance dynamically

  2. per-visualisation decluttering / summing where it makes sense
    E.g. here I mean discarding 0 values, e.g.

Image
  1. Heavy utilisation of table panels with sub-mixins, e.g. have a table panel that per-instance has a latency histogram, pie chart, etc.
    That however also quickly gets overwhelming.

Happy to set up a call to discuss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can discard the zero values 👍

Comment on lines 23 to 25
overviewClusterHealthPanel:
g.panel.stat.new(title='Clusters healthy')
+ g.panel.stat.queryOptions.withTargets([signals.overview.clusterHealth.asTarget()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit here, but this isn't populated with metrics from our sample-app

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat expected as the sample app did not setup replication. We could maybe add a feature to track to set it up within the sample app but the base got most of the metrics.

This metric in particular is not generated with the single node approach: couchdb_couch_replicator_cluster_is_stable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good, just wanted to ensure that it was an expected behaviour, should've clicked for me that the sample-app wasn't clustered 😅 Thanks Keith!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants