Skip to content

Conversation

julianbrost
Copy link
Contributor

@julianbrost julianbrost commented Sep 29, 2025

The idea of CpuBoundWork was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided by CpuBoundWork were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication.

The core idea of the replacement in this PR is still similar, the main differences are:

  • It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long. Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense.
  • There's no more operation to wait a slot. Instead, HTTP requests are now rejected with a 503 Service Unavailable error if there's no slot available. This means that if there is too much load on an Icinga 2 instance from HTTP requests, this shows in error messages instead of more and more waiting requests accumulating and increased response times.

This does not limit the total number of running HTTP requests. In particular, those streaming the response (for example using chunked encoding) are no longer counted once they enter the streaming phase. There is a very specific reason for this: otherwise, a slow or malicious client would block the slot for the whole time it's reading the response, which could take a while. If that happens on multiple connections, the whole pool could be blocked by clients reading responses very slowly. In particular, this fixes a denial of service problem introduced by #10516 (not yet released), hence the blocker label.

Tests

Send enough slow requests at once. Slow requests can be forced for example by executing sleep() as a console command inside the running dameon:

cd "$(mktemp -d)" # create a directory for the outputs

for i in {1..17}; do
  curl -Ssku root:icinga https://localhost:5665/v1/console/execute-script --json '{"command":"sleep(3)"}' &> $i &
done

# wait for all the commands to complete

# extra loop + echo for pretty output as files are not ending in a newline
for f in *; do cat "$f"; echo; done

Example output:

{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"error":503,"status":"Too many requests already in progress, please try again later."}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}

16 is the number of available slots on my machine. However, you might also see more requests failing if other clients send requests at the same time.

The idea of CpuBoundWork was to prevent too many coroutines from performing
long-running actions at the same time so that some worker threads are always
available for other tasks. However, in practice, once all slots provided by
CpuBoundWork were used, this would also block the handling of JSON-RPC
messages, effectively bringing down the whole cluster communication.

A replacement is added in the following commit.
This serves as a replacement for CpuBoundWork removed in the previous commit.
The core idea is still similar, the main differences are:

- It is no longer used during JSON-RPC message handling. The corresponding
  handlers are rather quick, so they don't block the threads for long.
  Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster
  to work, so making them wait for something makes little sense.
- There's no more operation to wait a slot. Instead, HTTP requests are now
  rejected with a 503 Service Unavailable error if there's no slot available.
  This means that if there is too much load on an Icinga 2 instance from HTTP
  requests, this shows in error messages instead of more and more waiting
  requests accumulating and increased response times.

This commit does not limit the total number of running HTTP requests. In
particular, those streaming the response (for example using chunked encoding)
are no longer counted once they enter the streaming phase. There is a very
specific reason for this: otherwise, a slow or malicious client would block the
slot for the whole time it's reading the response, which could take a while. If
that happens on multiple connections, the whole pool could be blocked by
clients reading responses very slowly.
@julianbrost julianbrost added blocker Blocks a release or needs immediate attention area/distributed Distributed monitoring (master, satellites, clients) area/api REST API core/quality Improve code, libraries, algorithms, inline docs labels Sep 29, 2025
@cla-bot cla-bot bot added the cla/signed label Sep 29, 2025
@julianbrost julianbrost marked this pull request as draft September 30, 2025 08:51
@julianbrost julianbrost self-assigned this Sep 30, 2025
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just admit you don't like my #9990 despite it's shown amazing benchmark results. /s

Seriously speaking:

The idea of CpuBoundWork was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided by CpuBoundWork were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication.

Precisely, CpuBoundWork was originally made to prevent a full house from freezing the whole I/O (accept(2), SSL, ...). Of course, waiting for a reply gets worse with O(time). But not even getting a connect(2) or SSL handshake done is the worst. It completely breaks the network stack, possibly including existing connections.

How this PR prevents this?

  • It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long.

Actually, they scale O(input), I've already got Icinga to wait 5s+ for a slot due to 1MiB+ plugin output in check results. For a good reason JSON decoding is included in CpuBoundWork.

[2025-09-30 15:15:01 +0000] warning/JsonRpcConnection: Processed JSON-RPC 'event::CheckResult' message for identity '10.27.2.86' (took total 7194ms, waited 7159ms on semaphore).

Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense.

Except where something is another JSON-RPC message. See my arguments above.

If you don't want to wait for HTTP, the solution is dead simple: #10046


return std::make_unique<Defer>([this] {
std::unique_lock lock(m_SlowSlotsMutex);
m_SlowSlotsAvailable++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, my semaphore from #9990 seems pretty fast. Especially a release could be just an atomic subtraction.

m_SlowSlotsAvailable--;
lock.unlock();

return std::make_unique<Defer>([this] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a good reason to make Defer movable.

*/
if (!response.TryAcquireSlowSlot()) {
HttpUtility::SendJsonError(response, request.Params(), 503,
"Too many requests already in progress, please try again later.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP requests are now rejected with a 503 Service Unavailable error if there's no slot available. This means that if there is too much load on an Icinga 2 instance from HTTP requests, this shows in error messages instead of more and more waiting requests accumulating and increased response times.

This indeed seems clever:

But at least waiting some time before giving up looks clever AND smart to me: https://www.haproxy.com/blog/protect-servers-with-haproxy-connection-limits-and-queues#:~:text=a%20client%20will%20wait%20for%20up%20to%2030%20seconds%20in%20the%20queue,%20after%20which%20HAProxy%20returns%20a%20503%20Service%20Unavailable%20response%20to%20them

I consider even the current behavior better (accumulating and increased response times) than the 503 approach. This handles rush hours smoother.

While on HTTP status codes! This PR lets 1 ApiUser DoS the cluster now, I'd expect a 429 before even running into 503 and/or waiting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡

At least I'd introduce a new permission making an ApiUser wait as long as necessary, preventing 503.

This is especially useful for critical components which consume the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR lets 1 ApiUser DoS the cluster now

This PR doesn't increase the time an ApiUser can block a slot, so wasn't this the same before, it's just hidden on the server side by waiting instead of retrying on the client side? If you send another request, you get another chance at obtaining that slot, just like you get when the server waits for you.

I'd expect a 429 before even running into 503 and/or waiting.

That's a whole other issue. Regardless of what we do to CpuBoundWork, we can consider doing some rate-limiting, like per ApiUser. That could avoid some overload situations, but not all (if there are just many different users putting load on the server).

@julianbrost
Copy link
Contributor Author

Let me first take a step back and add some more context. Overall, I'd say there are three different situations that the Icinga 2 node can be regarding CpuBoundWork or a possible replacement:

  1. Low load, in particular low enough that no limiting is necessary and any lock/acquire operation just succeeds on the first attempt.
  2. Some high load spikes, as in there are some peaks but when these are smoothed out a bit, the load can still be handled without problems.
  3. Overload, i.e. Icinga 2 gets constantly sent requests faster than it can handle them.

(1) is trivial and handled well by any approach, even removing CpuBoundWork without a replacement. The other two are where the approaches differ. The approach of waiting is what handles (2) better, as from the outside, you see s bit of delay during the load spikes but apart from that, things are fine. However, when faced with (3), that approach results in everything waiting and waiting longer, until nothing works any more (at some point, HTTP clients will probably run into a timeout before their requests gets handled and also our JSON-RPC connections will disconnect with a "no messages received" timeout). In contrast, this PR handles (3) better, in that it sheds load instead of having an ever-growing queue of waiting HTTP requests. Though as it's currently implemented, it has the downside of showing a worse behavior in situation (2) because it will tell clients to go away and try again later immediately, even if waiting for like 10ms would have been enough.

Just admit you don't like my #9990 despite it's shown amazing benchmark results. /s

I don't like spending so much time on optimizing that when I'm not even convinced that what it does is even the right approach.

Precisely, CpuBoundWork was originally made to prevent a full house from freezing the whole I/O (accept(2), SSL, ...). Of course, waiting for a reply gets worse with O(time).

That sounds good in theory, but when viewed in the context of Icinga 2, what is it worth to get a successful TLS handshake if afterwards, if neither your HTTP requests nor your JSON-RPC messages would be processed without a big delay?

I'm not even sure if simply removing CpuBoundWork would even be worse in scenario (2). I mean if it's just a temporary peak, does it make much of a difference if the limiting is done explicitly using CpuBoundWork or just implicitly by the size of the corresponding thread pool? The situation would look different if there we're non-CpuBoundWork-tasks that are required for those in CpuBoundWork to continue, but are there any? If not, doesn't that just make the difference between waiting for the TLS handshake to start instead of waiting for the handling of the request to start? I

But not even getting a connect(2) or SSL handshake done is the worst. It completely breaks the network stack, possibly including existing connections.

I disagree with that, why is this supposed to be the worst? Just knowing right away that this won't work is better than getting a connection and then having to wait 10 minutes for anything on that connection. Und during that time, taking up resources. Even worse, if that client then gives up, closes the connection and retries, you've wasted resources on that TLS handshake and will do so repeatedly, making everything else even slower.

How this PR prevents this?

  • It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long.

Actually, they scale O(input), I've already got Icinga to wait 5s+ for a slot due to 1MiB+ plugin output in check results. For a good reason JSON decoding is included in CpuBoundWork.

[2025-09-30 15:15:01 +0000] warning/JsonRpcConnection: Processed JSON-RPC 'event::CheckResult' message for identity '10.27.2.86' (took total 7194ms, waited 7159ms on semaphore).

That message just shows that you end up waiting for an unhealthy duration already, which is a sign that you're just pushing more check results through that node than it's able to handle. I don't see how the size of the plugin output and JSON decoding is relevant in particular here, you should get a very similar result with smaller check results at a faster pace.

Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense.

Except where something is another JSON-RPC message. See my arguments above.

I don't understand what that's trying to say.

@Al2Klimov
Copy link
Member

  1. Some high load spikes, as in there are some peaks but when these are smoothed out a bit, the load can still be handled without problems.
  2. Overload, i.e. Icinga 2 gets constantly sent requests faster than it can handle them.

as it's currently implemented, it has the downside of showing a worse behavior in situation (2) because it will tell clients to go away and try again later immediately, even if waiting for like 10ms would have been enough.

Yes. Exactly.

You don't even wait those 10ms, though I'd even wait 3s+. Speaking about waiting, that should be pretty easy to build once #9990 is merged as it uses ASIO timers.

Instead, you unconditionally (alternative: #10572 (comment)) make the API rude in case of (2) which can break unaware clients, just to be safe against (3). But, given the semaphore is acquired for authn-ed ApiUsers anyway, the only proper fix to (3) lies in the specific environment's requests/capacity ratio. Only the admin can fix that. If he doesn't, neither of our approaches will provide a good outcome for him.

But we can influence the API behavior in case of (2) in a user-friendly way which is waiting. If you don't want to wait forever, let's discuss a reasonable threshold.

I don't like spending so much time on optimizing that when I'm not even convinced that what it does is even the right approach.

The time has already been spent and regarding right/wrong see the beginning of this comment of mine.

what is it worth to get a successful TLS handshake if afterwards, if neither your HTTP requests nor your JSON-RPC messages would be processed without a big delay?

Depending whether big delay implies running into a timeout anyway.

But not even getting a connect(2) or SSL handshake done is the worst. It completely breaks the network stack, possibly including existing connections.

I disagree with that, why is this supposed to be the worst? Just knowing right away that this won't work is better than getting a connection and then having to wait 10 minutes for anything on that connection.

Even if it's a cluster connection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention cla/signed core/quality Improve code, libraries, algorithms, inline docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants