Add JSON-RPC utilization metrics and troubleshooting documentation #10553

julianbrost · 2025-09-09T13:03:55Z

This PR is based on #10266, reducing it to the part of it I find most relevant, allowing for a simpler implementation, and adding some troubleshooting documentation from which users can learn how to actually use it.

First of all, the troubleshooting documentation explains that a full TCP receive queue on a JSON-RPC connection socket can be a sign of the connection being overloaded and Icinga 2 processing the messages slower than they are coming in.

Second, it adds the seconds_processing_messages attribute to Endpoint objects. Observing the rate at which this changes allows to estimate how close the connection is to being saturated, which is also explained in the troubleshooting documentation.

This PR just uses the already measured total duration, which currently includes the time taken by CpuBoundWork, which is a difference from #10266 which added multiple metrics. I opted for the single metrics for two reasons: First, it's enough to derive how busy the connection is. Second, I have ideas to remove CpuBoundWork from here soon (as in this or maybe next week) anyways, so for the moment I don't want to spend time adding and documenting metrics here that will probably become obsolete soon.

Al2Klimov · 2025-09-15T08:49:10Z

In other words, this PR

closes #10266

?

julianbrost · 2025-09-23T08:38:13Z

In other words, this PR closes #10266?

My motivation for creating this PR was that I find it more useful to provide a metric with an explanation what it says or how it can be interpreted instead of just dumping a bunch of metrics without documentation where you need to read the source to understand them. So I wrote that documentation for the metric that was the reason for this task in the first place. If you want to provide something similar for the other metrics, feel free to do so. Note that I found writing that documentation useful as it made me think more about the details, especially what can already be derived from the TCP send/receive queue sizes. So for example, your PR has a "time spent reading" metric, what can I actually learn from it? It would allow me to figure out if the connection is the bottleneck as in the link between the machines doesn't provide enough throughput, but this would already show in the TCP buffers as the send buffer would be full but the read buffer empty.

Co-authored-by: Alexander A. Klimov <[email protected]>

yhabteab

LPTM!

julianbrost added area/distributed Distributed monitoring (master, satellites, clients) area/documentation End-user or developer help area/api REST API labels Sep 9, 2025

cla-bot bot added the cla/signed label Sep 9, 2025

julianbrost requested a review from yhabteab September 23, 2025 08:22

Al2Klimov and others added 4 commits September 23, 2025 11:04

Introduce AtomicDuration

4b2b45c

Measure and store message processing time per endpoint

e3ee07b

Co-authored-by: Alexander A. Klimov <[email protected]>

Endpoint expose seconds_processing_messages attribute

be2b1a8

Co-authored-by: Alexander A. Klimov <[email protected]>

Docs: add some troubleshooting for overloaded JSON-RPC connections

2a7fb13

julianbrost force-pushed the jsonrpc-utilization-metrics branch from e6ecc02 to 2a7fb13 Compare September 23, 2025 09:24

yhabteab approved these changes Sep 23, 2025

View reviewed changes

yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Sep 23, 2025

julianbrost merged commit 3674af7 into master Sep 23, 2025
28 checks passed

julianbrost deleted the jsonrpc-utilization-metrics branch September 23, 2025 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add JSON-RPC utilization metrics and troubleshooting documentation #10553

Add JSON-RPC utilization metrics and troubleshooting documentation #10553

Uh oh!

julianbrost commented Sep 9, 2025

Uh oh!

Al2Klimov commented Sep 15, 2025

Uh oh!

julianbrost commented Sep 23, 2025

Uh oh!

yhabteab left a comment

Uh oh!

Uh oh!

Uh oh!

Add JSON-RPC utilization metrics and troubleshooting documentation #10553

Add JSON-RPC utilization metrics and troubleshooting documentation #10553

Uh oh!

Conversation

julianbrost commented Sep 9, 2025

Uh oh!

Al2Klimov commented Sep 15, 2025

Uh oh!

julianbrost commented Sep 23, 2025

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!