Add JSON-RPC utilization metrics and troubleshooting documentation #10553
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is based on #10266, reducing it to the part of it I find most relevant, allowing for a simpler implementation, and adding some troubleshooting documentation from which users can learn how to actually use it.
First of all, the troubleshooting documentation explains that a full TCP receive queue on a JSON-RPC connection socket can be a sign of the connection being overloaded and Icinga 2 processing the messages slower than they are coming in.
Second, it adds the
seconds_processing_messages
attribute toEndpoint
objects. Observing the rate at which this changes allows to estimate how close the connection is to being saturated, which is also explained in the troubleshooting documentation.This PR just uses the already measured
total
duration, which currently includes the time taken byCpuBoundWork
, which is a difference from #10266 which added multiple metrics. I opted for the single metrics for two reasons: First, it's enough to derive how busy the connection is. Second, I have ideas to removeCpuBoundWork
from here soon (as in this or maybe next week) anyways, so for the moment I don't want to spend time adding and documenting metrics here that will probably become obsolete soon.