@@ -2055,3 +2055,40 @@ Value data: 65534
2055
2055
2056
2056
More details in [ this blogpost] ( https://www.netways.de/blog/2019/01/24/windows-blocking-icinga-2-with-ephemeral-port-range/ )
2057
2057
and this [ MS help entry] ( https://support.microsoft.com/en-us/help/196271/when-you-try-to-connect-from-tcp-ports-greater-than-5000-you-receive-t ) .
2058
+
2059
+ ### Cluster Troubleshooting: Overloaded JSON-RPC Connections <a id =" troubleshooting-jsonrpc-overload " ></a >
2060
+
2061
+ If JSON-RPC connections are overloaded, messages are processed with a delay. This can show in symptoms like the master
2062
+ lagging behind, for example showing check results only minutes after they were available on a satellite.
2063
+
2064
+ There are two ways this situation can be identified:
2065
+
2066
+ First, if a connection is overloaded, Icinga 2 will read data from it slower than it arrives, so pending messages are
2067
+ accumulating in the TCP receive queue on the overloaded endpoint and the TCP send queue of other endpoints sending to
2068
+ it. This can be checked by querying information about open TCP connections using the command
2069
+ ` ss --tcp --processes --numeric ` . High values for Recv-Q on a socket used by the ` icinga2 ` process can be a hint that
2070
+ the local endpoint is not able to keep up with the messages from this connection. Note that small values (a few
2071
+ kilobytes) are perfectly fine as messages can be in flight. Also, while the replay log is received, messages are
2072
+ processed as fast as possible and thus the connection is operating at capacity, thus the size of the TCP receive queue
2073
+ is only meaningful after processing the replay log has finished.
2074
+
2075
+ Second, Icinga 2.15.1 introduced a metric that can be used to estimate how much load there is on a particular
2076
+ connection: the ` seconds_processing_messages ` attribute of ` Endpoint ` objects which can be
2077
+ [ queried using the API] ( 12-icinga2-api.md#icinga2-api-config-objects-query ) . This value accumulates the total time spent
2078
+ processing JSON-RPC messages from connections to that endpoint. In order to interpret this number, you have to query it
2079
+ at least twice and calculate the rate at which the number increased. For example, a rate of 0.4 (increases by 0.4s every
2080
+ second) means that the connection is at around 40% of its maximum capacity. In practice, the rate will never reach the
2081
+ theoretical maximum of 1 as there's also some time spent reading the messages, so if it's close to 1, the connection
2082
+ might be overloaded or is close to its capacity limit.
2083
+
2084
+ This limit in capacity exists because all there can be implicit dependencies between different JSON-RPC messages,
2085
+ requiring them to be processed in the same order that they were sent. This is currently ensured by processing all
2086
+ messages from the same connection sequentially.
2087
+
2088
+ To work around this limit, the following approaches are possible:
2089
+ 1 . Try to redistribute load between connections, for example if the overloaded connection is between the master and
2090
+ a satellite zone, try splitting this zone into two, distributing the load across two connections.
2091
+ 2 . Reduce the load on that connection. Typically, the most frequent message type will be check results, so reducing
2092
+ the check interval can be a first step.
2093
+ 3 . As the messages are processed sequentially, the throughput is limited by the single core CPU performance of the
2094
+ machine Icinga 2 is running on, switching to a more powerful one can increase the capacity of individual connections.
0 commit comments