chore: Modernize the Apache CouchDB mixin #1522

schmikei · 2025-10-29T21:13:24Z

This modernizes the CouchDB Mixin to use newer libraries.

Overview

Nodes:

Logs:

Dasomeone

Going to head out in a second so I'll publish my comments so-far! Going to pick it back up in the morning
@aalhour would also appreciate a second pair of eyes on this one :)

Dasomeone · 2025-11-11T16:01:13Z

apache-couchdb-mixin/config.libsonnet

+    * the prometheusWithTotal is used for backwards compatibility as some metrics are suffixed with _total but in later versions of the couchdb-mixin.
+    * i.e. couchdb_open_os_files_total => couchdb_open_os_files
+    * This is to ensure that the signals for the metrics that are suffixed with _total continue to work as expected.
+    * This was an identified as a noticeable change from 3.3.0 to 3.5.0


Thank you! 🚀
Can you call this out in the readme as well please? E.g. just what versions are supported, and what the different metricSources are for

apache-couchdb-mixin/mixin.libsonnet

apache-couchdb-mixin/signals/overview.libsonnet

Dasomeone · 2025-11-11T16:21:44Z

apache-couchdb-mixin/signals/overview.libsonnet

+    averageRequestLatencyp50: {
+      name: 'Average request latency p50',
+      nameShort: 'Average request latency p50',
+      type: 'raw',
+      description: 'The average request latency p50 aggregated across all nodes.',
+      unit: 's',
+      sources: {
+        prometheus: {
+          expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.5"})',
+          legendCustomTemplate: legendCustomTemplate + ' - p50',
+        },
+      },
+    },
+
+    averageRequestLatencyp75: {
+      name: 'Average request latency p75',
+      nameShort: 'Average request latency p75',
+      type: 'raw',
+      description: 'The average request latency p75 aggregated across all nodes.',
+      unit: 's',
+      sources: {
+        prometheus: {
+          expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.75"})',
+          legendCustomTemplate: legendCustomTemplate + ' - p75',
+        },
+      },
+    },
+
+    averageRequestLatencyp95: {
+      name: 'Average request latency p95',
+      nameShort: 'Average request latency p95',
+      type: 'raw',
+      description: 'The average request latency p95 aggregated across all nodes.',
+      unit: 's',
+      sources: {
+        prometheus: {
+          expr: 'avg by(' + groupLabelAggTerm + ', quantile) (couchdb_request_time_seconds{%(queriesSelector)s, quantile="0.95"})',
+          legendCustomTemplate: legendCustomTemplate + ' - p95',
+        },
+      },


Same as my recent requests on the other old mixins, I think we could and should use histograms more, either as a replacement or in addition to these signals

Changed these quantiles timeseries to histograms on both dashboards

…ouchdb-modernization

…nnet-libs into chore/couchdb-modernization

aalhour

I caught some issues, thanks for pushing this.

aalhour · 2025-11-18T13:52:06Z

apache-couchdb-mixin/alerts.libsonnet

            expr: |||
-              sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s
-            ||| % $._config,
+              sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s


Isn't the 4.* regex more general than the 4..?

https://github.com/grafana/jsonnet-libs/actions/runs/19470908585/job/55717683328?pr=1522

Unfortunately its considered a messy selector. For some reason the lint passes locally for myself but within Ci it fails

The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.

I doubt we will run into the latter, so for now I think 4.. is perfectly fine

aalhour · 2025-11-18T13:52:35Z

apache-couchdb-mixin/mixin.libsonnet

+  cluster+: {
+    allValue: '.*',
+  },
+  couchb_cluster+: {


couchb --> couchdb, missing d character.

Ah yep thank you for the catch!

aalhour · 2025-11-18T13:53:43Z

apache-couchdb-mixin/alerts.libsonnet

          {
            alert: 'CouchDBUnhealthyCluster',
            expr: |||
              min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable) < %(alertsCriticalClusterIsUnstable5m)s


Should the alerts use the filteringSelector from the config? Just wondering.

expr: ||| min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable{%(filteringSelector)s}) < %(alertsCriticalClusterIsUnstable5m)s ||| % this.config,

Went ahead and implemented these!

aalhour · 2025-11-18T13:55:23Z

apache-couchdb-mixin/config.libsonnet

+  alertsCriticalClusterIsUnstable5m: 1,  //1 is stable
+  alertsWarning4xxResponseCodes5m: 5,
+  alertsCritical5xxResponseCodes5m: 0,
+  alertsWarningRequestLatency5m: 500,  //ms


This is in milliseconds, which means that the alert threshold is broken.

aalhour · 2025-11-18T13:57:29Z

apache-couchdb-mixin/alerts.libsonnet

          {
            alert: 'CouchDBHighRequestLatency',
            expr: |||
              sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) > %(alertsCriticalRequestLatency5m)s


I know that you didn't change this but I think there is a bug in the calculation, sum / count is in seconds but the config declares the threshold in milliseconds (500), we should multiply the value before the compare operator by 1000. I think this is the second time I catch this bug, which is not on you but it's there in the original implementations.

expr: ||| sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) * 1000 > %(alertsWarningRequestLatency5m)s ||| % this.config,

Alternatively, we can change the alert threshold in the config. I am fine with either solution.

Ah yep didn't really audit the content of the queries too much. Will go ahead and take a stab at a fix

aalhour · 2025-11-18T13:58:33Z

apache-couchdb-mixin/alerts.libsonnet

          },
          {
-            alert: 'CouchDBReplicatorConnectionOwnersCrashing',
+            alert: 'CouchDBReplicatorOwnersCrashing',


Would changing the alert name affect existing installations? What if someone uninstalls or upgrades an integration, would they get duplicate alerts?

Looking in @Dasomeone for more context.

aalhour · 2025-11-18T14:00:47Z

apache-couchdb-mixin/signals/overview.libsonnet

+      unit: 'requests',
+      sources: {
+        prometheus: {
+          expr: 'sum by(' + groupLabelAggTerm + ') (couchdb_httpd_status_codes{%(queriesSelector)s, code=~"[45].*"})',


Why are we not using increase(...) like the goodResponseStatuses signal above it? Also neither interval nor offset. At least when I read them, I get the feeling that they are complementary but their expressions are not really consistent.

aalhour · 2025-11-18T14:01:49Z

apache-couchdb-mixin/panels.libsonnet

+          + g.query.prometheus.withInstant(true)
+          + g.query.prometheus.withFormat('timeseries'),
+        ])
+        + g.panel.gauge.queryOptions.withDatasource('prometheus', '${' + this.grafana.variables.datasources.prometheus.name + '}')


Do we really want to specify the datasource here? I think it should be derived from the board config and selectors.

aalhour · 2025-11-18T14:03:27Z

apache-couchdb-mixin/signals/nodes.libsonnet

+          expr: 'couchdb_httpd_view_reads_total{%(queriesSelector)s}',
+        },
+      },
+    },


I see that these signals don't specify a rate(...)[$__rate_interval] like other counterparts in overview.libsonnet, does the function also break the panels for you? At least for me with the databricks data, it can be a gauge type and it might break it. But here i see that these are counters 🤔

These are counters so the rate gets auto-added from my experience!

Dasomeone

I gave this another pass while checking with the sample-app and I realised the node dashboard is pretty cluttered and the panels hard to read.
I've left a couple comments but I think it's a more overarching issue for the node views so happy to set up a call to discuss between us :)

Dasomeone · 2025-11-19T12:18:17Z

apache-couchdb-mixin/README.md

Still need to call out the version support here like I mentioned in the config file :)

Called out in the readme :)

Dasomeone · 2025-11-19T12:23:49Z

apache-couchdb-mixin/panels.libsonnet

        )
-        + g.panel.timeSeries.panelOptions.withDescription('The total number of error response statuses aggregated across all nodes.')
-        + g.panel.timeSeries.standardOptions.withUnit('rps'),
+        + g.panel.timeSeries.panelOptions.withDescription('The total number of error response statuses (HTTP 4xx-5xx) aggregated across all nodes.')


It just hit me that we're duplicating the description here.

If you're just using a single signal you can just reuse the signal's internal description so long as it still fits.

As far as I recall, it should be used by default, but if not you can access it directly via the description field on the signal API, e.g.
signals.overview.errorResponseStatuses.description

Discussed async but there's currently no good way of avoiding the duplication 😢

+1. Let's park this one for now, we can revisit once properly exposed via the signals api

Dasomeone · 2025-11-19T12:49:16Z

apache-couchdb-mixin/alerts.libsonnet

            expr: |||
-              sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s
-            ||| % $._config,
+              sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.."}[5m])) > %(alertsWarning4xxResponseCodes5m)s


The reason for this failing is due to the Pint linter rules that takes a bit more setup to install locally. I've no strong feelings about this either way. On one hand it's good to be specific to the three digit setup, on the other it could be an issue if mixed with text, e.g. 404NotFound as a label.

I doubt we will run into the latter, so for now I think 4.. is perfectly fine

Dasomeone · 2025-11-19T13:22:59Z

apache-couchdb-mixin/panels.libsonnet

        + g.panel.timeSeries.panelOptions.withDescription('The total number of bulk requests on a node.')
        + g.panel.timeSeries.standardOptions.withUnit('rps'),

      nodeResponseStatusOverviewPanel:


Similar to the histogram view, this pie-chart doesn't look great due to the multiple instances and series. It frequently even stops rendering at all due to the amount of values.

I have a few thoughts on how we could solve this, and wanted to discuss becuase this really is a wider issue with node views

We could create repeat rule, e.g. setting up a full row of all the queries we want, then repeating that for each value of instance dynamically

per-visualisation decluttering / summing where it makes sense
E.g. here I mean discarding 0 values, e.g.

Heavy utilisation of table panels with sub-mixins, e.g. have a table panel that per-instance has a latency histogram, pie chart, etc.
That however also quickly gets overwhelming.

Happy to set up a call to discuss

I think we can discard the zero values 👍

Dasomeone · 2025-11-19T13:24:45Z

apache-couchdb-mixin/panels.libsonnet

      overviewClusterHealthPanel:
        g.panel.stat.new(title='Clusters healthy')
        + g.panel.stat.queryOptions.withTargets([signals.overview.clusterHealth.asTarget()])


Minor nit here, but this isn't populated with metrics from our sample-app

This is somewhat expected as the sample app did not setup replication. We could maybe add a feature to track to set it up within the sample app but the base got most of the metrics.

This metric in particular is not generated with the single node approach: couchdb_couch_replicator_cluster_is_stable

Ah good, just wanted to ensure that it was an expected behaviour, should've clicked for me that the sample-app wasn't clustered 😅 Thanks Keith!

schmikei added 2 commits October 29, 2025 17:08

modernize the apache couchdb mixin

8cad84a

fix links and try to fix lint

c3f6bb4

schmikei marked this pull request as ready for review October 29, 2025 21:19

schmikei requested a review from a team as a code owner October 29, 2025 21:19

Merge branch 'master' into chore/couchdb-modernization

a969714

Dasomeone reviewed Nov 11, 2025

View reviewed changes

schmikei added 5 commits November 13, 2025 10:05

Merge branch 'master' of github.com:grafana/jsonnet-libs into chore/c…

2c6abca

…ouchdb-modernization

pr feedback

d9e918d

Merge branch 'chore/couchdb-modernization' of github.com:schmikei/jso…

17e7fa2

…nnet-libs into chore/couchdb-modernization

make fmt

7006703

fix units on histogram

092776c

aalhour requested changes Nov 18, 2025

View reviewed changes

schmikei added 5 commits November 18, 2025 10:08

fix a couple of issues caught in PR review

9186490

fix lint with selector

f492f27

fix lint with selector

467d456

fix some issues due to recent commits; interval/legends

e6a5654

make fmt

7f424aa

Dasomeone reviewed Nov 19, 2025

View reviewed changes

address PR feedback minus the description implementation

58ca718

chore: Modernize the Apache CouchDB mixin #1522

Are you sure you want to change the base?

chore: Modernize the Apache CouchDB mixin #1522

Uh oh!

Conversation

schmikei commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dasomeone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schmikei Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aalhour left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dasomeone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

schmikei commented Oct 29, 2025 •

edited

Loading

schmikei Nov 13, 2025 •

edited

Loading