Skip to content

Conversation

julianbrost
Copy link
Contributor

This PR is based on #10266, reducing it to the part of it I find most relevant, allowing for a simpler implementation, and adding some troubleshooting documentation from which users can learn how to actually use it.

First of all, the troubleshooting documentation explains that a full TCP receive queue on a JSON-RPC connection socket can be a sign of the connection being overloaded and Icinga 2 processing the messages slower than they are coming in.

Second, it adds the seconds_processing_messages attribute to Endpoint objects. Observing the rate at which this changes allows to estimate how close the connection is to being saturated, which is also explained in the troubleshooting documentation.

This PR just uses the already measured total duration, which currently includes the time taken by CpuBoundWork, which is a difference from #10266 which added multiple metrics. I opted for the single metrics for two reasons: First, it's enough to derive how busy the connection is. Second, I have ideas to remove CpuBoundWork from here soon (as in this or maybe next week) anyways, so for the moment I don't want to spend time adding and documenting metrics here that will probably become obsolete soon.

@julianbrost julianbrost added area/distributed Distributed monitoring (master, satellites, clients) area/documentation End-user or developer help area/api REST API labels Sep 9, 2025
@cla-bot cla-bot bot added the cla/signed label Sep 9, 2025
@Al2Klimov
Copy link
Member

In other words, this PR

closes #10266

?

@julianbrost
Copy link
Contributor Author

In other words, this PR closes #10266?

My motivation for creating this PR was that I find it more useful to provide a metric with an explanation what it says or how it can be interpreted instead of just dumping a bunch of metrics without documentation where you need to read the source to understand them. So I wrote that documentation for the metric that was the reason for this task in the first place. If you want to provide something similar for the other metrics, feel free to do so. Note that I found writing that documentation useful as it made me think more about the details, especially what can already be derived from the TCP send/receive queue sizes. So for example, your PR has a "time spent reading" metric, what can I actually learn from it? It would allow me to figure out if the connection is the bottleneck as in the link between the machines doesn't provide enough throughput, but this would already show in the TCP buffers as the send buffer would be full but the read buffer empty.

@julianbrost julianbrost force-pushed the jsonrpc-utilization-metrics branch from e6ecc02 to 2a7fb13 Compare September 23, 2025 09:24
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LPTM!

@yhabteab yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Sep 23, 2025
@julianbrost julianbrost merged commit 3674af7 into master Sep 23, 2025
28 checks passed
@julianbrost julianbrost deleted the jsonrpc-utilization-metrics branch September 23, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/distributed Distributed monitoring (master, satellites, clients) area/documentation End-user or developer help cla/signed consider backporting Should be considered for inclusion in a bugfix release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants