Skip to content

Commit 3674af7

Browse files
authored
Merge pull request #10553 from Icinga/jsonrpc-utilization-metrics
Add JSON-RPC utilization metrics and troubleshooting documentation
2 parents c2c9b6b + 2a7fb13 commit 3674af7

File tree

8 files changed

+132
-0
lines changed

8 files changed

+132
-0
lines changed

doc/15-troubleshooting.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2055,3 +2055,40 @@ Value data: 65534
20552055

20562056
More details in [this blogpost](https://www.netways.de/blog/2019/01/24/windows-blocking-icinga-2-with-ephemeral-port-range/)
20572057
and this [MS help entry](https://support.microsoft.com/en-us/help/196271/when-you-try-to-connect-from-tcp-ports-greater-than-5000-you-receive-t).
2058+
2059+
### Cluster Troubleshooting: Overloaded JSON-RPC Connections <a id="troubleshooting-jsonrpc-overload"></a>
2060+
2061+
If JSON-RPC connections are overloaded, messages are processed with a delay. This can show in symptoms like the master
2062+
lagging behind, for example showing check results only minutes after they were available on a satellite.
2063+
2064+
There are two ways this situation can be identified:
2065+
2066+
First, if a connection is overloaded, Icinga 2 will read data from it slower than it arrives, so pending messages are
2067+
accumulating in the TCP receive queue on the overloaded endpoint and the TCP send queue of other endpoints sending to
2068+
it. This can be checked by querying information about open TCP connections using the command
2069+
`ss --tcp --processes --numeric`. High values for Recv-Q on a socket used by the `icinga2` process can be a hint that
2070+
the local endpoint is not able to keep up with the messages from this connection. Note that small values (a few
2071+
kilobytes) are perfectly fine as messages can be in flight. Also, while the replay log is received, messages are
2072+
processed as fast as possible and thus the connection is operating at capacity, thus the size of the TCP receive queue
2073+
is only meaningful after processing the replay log has finished.
2074+
2075+
Second, Icinga 2.15.1 introduced a metric that can be used to estimate how much load there is on a particular
2076+
connection: the `seconds_processing_messages` attribute of `Endpoint` objects which can be
2077+
[queried using the API](12-icinga2-api.md#icinga2-api-config-objects-query). This value accumulates the total time spent
2078+
processing JSON-RPC messages from connections to that endpoint. In order to interpret this number, you have to query it
2079+
at least twice and calculate the rate at which the number increased. For example, a rate of 0.4 (increases by 0.4s every
2080+
second) means that the connection is at around 40% of its maximum capacity. In practice, the rate will never reach the
2081+
theoretical maximum of 1 as there's also some time spent reading the messages, so if it's close to 1, the connection
2082+
might be overloaded or is close to its capacity limit.
2083+
2084+
This limit in capacity exists because all there can be implicit dependencies between different JSON-RPC messages,
2085+
requiring them to be processed in the same order that they were sent. This is currently ensured by processing all
2086+
messages from the same connection sequentially.
2087+
2088+
To work around this limit, the following approaches are possible:
2089+
1. Try to redistribute load between connections, for example if the overloaded connection is between the master and
2090+
a satellite zone, try splitting this zone into two, distributing the load across two connections.
2091+
2. Reduce the load on that connection. Typically, the most frequent message type will be check results, so reducing
2092+
the check interval can be a first step.
2093+
3. As the messages are processed sequentially, the throughput is limited by the single core CPU performance of the
2094+
machine Icinga 2 is running on, switching to a more powerful one can increase the capacity of individual connections.

lib/base/atomic.hpp

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
#define ATOMIC_H
55

66
#include <atomic>
7+
#include <chrono>
78
#include <mutex>
89
#include <type_traits>
910
#include <utility>
@@ -34,6 +35,43 @@ class Atomic : public std::atomic<T> {
3435
}
3536
};
3637

38+
/**
39+
* Accumulates time durations atomically.
40+
*
41+
* @ingroup base
42+
*/
43+
class AtomicDuration
44+
{
45+
public:
46+
using Clock = std::chrono::steady_clock;
47+
48+
/**
49+
* Adds the elapsedTime to this instance.
50+
*
51+
* May be called multiple times to accumulate time.
52+
*
53+
* @param elapsedTime The distance between two time points
54+
*
55+
* @return This instance for method chaining
56+
*/
57+
AtomicDuration& operator+=(const Clock::duration& elapsedTime) noexcept
58+
{
59+
m_Sum.fetch_add(elapsedTime.count(), std::memory_order_relaxed);
60+
return *this;
61+
}
62+
63+
/**
64+
* @return The total accumulated time in seconds
65+
*/
66+
operator double() const noexcept
67+
{
68+
return std::chrono::duration<double>(Clock::duration(m_Sum.load(std::memory_order_relaxed))).count();
69+
}
70+
71+
private:
72+
Atomic<Clock::duration::rep> m_Sum {0};
73+
};
74+
3775
/**
3876
* Wraps any T into a std::atomic<T>-like interface that locks using a mutex.
3977
*

lib/remote/endpoint.cpp

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,11 @@ void Endpoint::AddMessageReceived(const intrusive_ptr<ApiFunction>& method)
130130
m_MessageCounters.at(method).fetch_add(1, std::memory_order_relaxed);
131131
}
132132

133+
void Endpoint::AddMessageProcessed(const AtomicDuration::Clock::duration& duration)
134+
{
135+
m_InputProcessingTime += duration;
136+
}
137+
133138
double Endpoint::GetMessagesSentPerSecond() const
134139
{
135140
return m_MessagesSent.CalculateRate(Utility::GetTime(), 60);
@@ -162,3 +167,8 @@ Dictionary::Ptr Endpoint::GetMessagesReceivedPerType() const
162167

163168
return new Dictionary(std::move(result));
164169
}
170+
171+
double Endpoint::GetSecondsProcessingMessages() const
172+
{
173+
return m_InputProcessingTime;
174+
}

lib/remote/endpoint.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ class Endpoint final : public ObjectImpl<Endpoint>
4949
void AddMessageSent(int bytes);
5050
void AddMessageReceived(int bytes);
5151
void AddMessageReceived(const intrusive_ptr<ApiFunction>& method);
52+
void AddMessageProcessed(const AtomicDuration::Clock::duration& duration);
5253

5354
double GetMessagesSentPerSecond() const override;
5455
double GetMessagesReceivedPerSecond() const override;
@@ -58,6 +59,8 @@ class Endpoint final : public ObjectImpl<Endpoint>
5859

5960
Dictionary::Ptr GetMessagesReceivedPerType() const override;
6061

62+
double GetSecondsProcessingMessages() const override;
63+
6164
protected:
6265
void OnAllConfigLoaded() override;
6366

@@ -71,6 +74,8 @@ class Endpoint final : public ObjectImpl<Endpoint>
7174
mutable RingBuffer m_MessagesReceived{60};
7275
mutable RingBuffer m_BytesSent{60};
7376
mutable RingBuffer m_BytesReceived{60};
77+
78+
AtomicDuration m_InputProcessingTime;
7479
};
7580

7681
}

lib/remote/endpoint.ti

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,10 @@ class Endpoint : ConfigObject
5858
[no_user_modify, no_storage] Dictionary::Ptr messages_received_per_type {
5959
get;
6060
};
61+
62+
[no_user_modify, no_storage] double seconds_processing_messages {
63+
get;
64+
};
6165
};
6266

6367
}

lib/remote/jsonrpcconnection.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,9 @@ void JsonRpcConnection::HandleIncomingMessages(boost::asio::yield_context yc)
106106
l_TaskStats.InsertValue(Utility::GetTime(), 1);
107107

108108
auto total = ch::steady_clock::now() - start;
109+
if (m_Endpoint) {
110+
m_Endpoint->AddMessageProcessed(total);
111+
}
109112

110113
Log msg(total >= ch::seconds(5) ? LogWarning : LogDebug, "JsonRpcConnection");
111114
msg << "Processed JSON-RPC '" << rpcMethod << "' message for identity '" << m_Identity

test/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ set(base_test_SOURCES
8989
icingaapplication-fixture.cpp
9090
utils.cpp
9191
base-array.cpp
92+
base-atomic.cpp
9293
base-base64.cpp
9394
base-convert.cpp
9495
base-dictionary.cpp

test/base-atomic.cpp

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
/* Icinga 2 | (c) 2025 Icinga GmbH | GPLv2+ */
2+
3+
#include "base/atomic.hpp"
4+
#include <BoostTestTargetConfig.h>
5+
6+
using namespace icinga;
7+
8+
BOOST_AUTO_TEST_SUITE(base_atomic)
9+
10+
BOOST_AUTO_TEST_CASE(duration_none)
11+
{
12+
BOOST_CHECK_EQUAL(static_cast<double>(AtomicDuration()), 0);
13+
}
14+
15+
BOOST_AUTO_TEST_CASE(duration_one)
16+
{
17+
AtomicDuration sum;
18+
19+
sum += std::chrono::seconds(1);
20+
21+
BOOST_CHECK_EQUAL(static_cast<double>(sum), 1);
22+
}
23+
24+
BOOST_AUTO_TEST_CASE(duration_two)
25+
{
26+
AtomicDuration sum;
27+
28+
sum += std::chrono::seconds(1);
29+
sum += std::chrono::seconds(2);
30+
31+
BOOST_CHECK_EQUAL(static_cast<double>(sum), 3);
32+
}
33+
34+
BOOST_AUTO_TEST_SUITE_END()

0 commit comments

Comments
 (0)