Replace `CpuBoundWork` #10572

julianbrost · 2025-09-29T16:29:05Z

The idea of CpuBoundWork was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided by CpuBoundWork were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication.

The core idea of the replacement in this PR is still similar, the main differences are:

It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long. Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense.
There's no more operation to wait a slot. Instead, HTTP requests are now rejected with a 503 Service Unavailable error if there's no slot available. This means that if there is too much load on an Icinga 2 instance from HTTP requests, this shows in error messages instead of more and more waiting requests accumulating and increased response times.

This does not limit the total number of running HTTP requests. In particular, those streaming the response (for example using chunked encoding) are no longer counted once they enter the streaming phase. There is a very specific reason for this: otherwise, a slow or malicious client would block the slot for the whole time it's reading the response, which could take a while. If that happens on multiple connections, the whole pool could be blocked by clients reading responses very slowly. In particular, this fixes a denial of service problem introduced by #10516 (not yet released), hence the blocker label.

Tests

Send enough slow requests at once. Slow requests can be forced for example by executing sleep() as a console command inside the running dameon:

cd "$(mktemp -d)" # create a directory for the outputs

for i in {1..17}; do
  curl -Ssku root:icinga https://localhost:5665/v1/console/execute-script --json '{"command":"sleep(3)"}' &> $i &
done

# wait for all the commands to complete

# extra loop + echo for pretty output as files are not ending in a newline
for f in *; do cat "$f"; echo; done

Example output:

{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"error":503,"status":"Too many requests already in progress, please try again later."}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}
{"results":[{"code":200,"result":null,"status":"Executed successfully."}]}

16 is the number of available slots on my machine. However, you might also see more requests failing if other clients send requests at the same time.

The idea of CpuBoundWork was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided by CpuBoundWork were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication. A replacement is added in the following commit.

This serves as a replacement for CpuBoundWork removed in the previous commit. The core idea is still similar, the main differences are: - It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long. Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense. - There's no more operation to wait a slot. Instead, HTTP requests are now rejected with a 503 Service Unavailable error if there's no slot available. This means that if there is too much load on an Icinga 2 instance from HTTP requests, this shows in error messages instead of more and more waiting requests accumulating and increased response times. This commit does not limit the total number of running HTTP requests. In particular, those streaming the response (for example using chunked encoding) are no longer counted once they enter the streaming phase. There is a very specific reason for this: otherwise, a slow or malicious client would block the slot for the whole time it's reading the response, which could take a while. If that happens on multiple connections, the whole pool could be blocked by clients reading responses very slowly.

Al2Klimov

Just admit you don't like my #9990 despite it's shown amazing benchmark results. /s

Seriously speaking:

The idea of CpuBoundWork was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided by CpuBoundWork were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication.

Precisely, CpuBoundWork was originally made to prevent a full house from freezing the whole I/O (accept(2), SSL, ...). Of course, waiting for a reply gets worse with O(time). But not even getting a connect(2) or SSL handshake done is the worst. It completely breaks the network stack, possibly including existing connections.

How this PR prevents this?

It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long.

Actually, they scale O(input), I've already got Icinga to wait 5s+ for a slot due to 1MiB+ plugin output in check results. For a good reason JSON decoding is included in CpuBoundWork.

[2025-09-30 15:15:01 +0000] warning/JsonRpcConnection: Processed JSON-RPC 'event::CheckResult' message for identity '10.27.2.86' (took total 7194ms, waited 7159ms on semaphore).

Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense.

Except where something is another JSON-RPC message. See my arguments above.

If you don't want to wait for HTTP, the solution is dead simple: #10046

lib/base/io-engine.cpp

Al2Klimov · 2025-09-30T14:15:31Z

lib/base/io-engine.cpp

+
+		return std::make_unique<Defer>([this] {
+			std::unique_lock lock(m_SlowSlotsMutex);
+			m_SlowSlotsAvailable++;


Again, my semaphore from #9990 seems pretty fast. Especially a release could be just an atomic subtraction.

Al2Klimov · 2025-09-30T14:16:00Z

lib/base/io-engine.cpp

+		m_SlowSlotsAvailable--;
+		lock.unlock();
+
+		return std::make_unique<Defer>([this] {


Seems a good reason to make Defer movable.

Al2Klimov · 2025-09-30T14:19:22Z

lib/remote/httpserverconnection.cpp

+		 */
+		if (!response.TryAcquireSlowSlot()) {
+			HttpUtility::SendJsonError(response, request.Params(), 503,
+				"Too many requests already in progress, please try again later.");


HTTP requests are now rejected with a 503 Service Unavailable error if there's no slot available. This means that if there is too much load on an Icinga 2 instance from HTTP requests, this shows in error messages instead of more and more waiting requests accumulating and increased response times.

This indeed seems clever:

https://stackoverflow.com/questions/42957767/is-there-a-better-way-limit-requests-at-the-door#:~:text=A%20better%20way%20to%20do,telling%20it%20to%20back%20off

https://stackoverflow.com/questions/72897199/uvicorn-backlog-vs-limit-concurrency#:~:text=limit-concurrency%20is%20just%20here%20to%20tell,%20after%20x%20responses%20issue%20a%20503

https://docs.nginx.com/nginx/admin-guide/security-controls/controlling-access-proxied-http/#:~:text=For%20requests%20that%20arrive%20at%20the%20full%20bucket,%20NGINX%20will%20respond%20with%20the%20503%20Service%20Unavailable

But at least waiting some time before giving up looks clever AND smart to me: https://www.haproxy.com/blog/protect-servers-with-haproxy-connection-limits-and-queues#:~:text=a%20client%20will%20wait%20for%20up%20to%2030%20seconds%20in%20the%20queue,%20after%20which%20HAProxy%20returns%20a%20503%20Service%20Unavailable%20response%20to%20them

I consider even the current behavior better (accumulating and increased response times) than the 503 approach. This handles rush hours smoother.

While on HTTP status codes! This PR lets 1 ApiUser DoS the cluster now, I'd expect a 429 before even running into 503 and/or waiting.

💡

At least I'd introduce a new permission making an ApiUser wait as long as necessary, preventing 503.

This is especially useful for critical components which consume the API.

This PR lets 1 ApiUser DoS the cluster now

This PR doesn't increase the time an ApiUser can block a slot, so wasn't this the same before, it's just hidden on the server side by waiting instead of retrying on the client side? If you send another request, you get another chance at obtaining that slot, just like you get when the server waits for you.

I'd expect a 429 before even running into 503 and/or waiting.

That's a whole other issue. Regardless of what we do to CpuBoundWork, we can consider doing some rate-limiting, like per ApiUser. That could avoid some overload situations, but not all (if there are just many different users putting load on the server).

julianbrost · 2025-10-01T08:39:18Z

Let me first take a step back and add some more context. Overall, I'd say there are three different situations that the Icinga 2 node can be regarding CpuBoundWork or a possible replacement:

Low load, in particular low enough that no limiting is necessary and any lock/acquire operation just succeeds on the first attempt.
Some high load spikes, as in there are some peaks but when these are smoothed out a bit, the load can still be handled without problems.
Overload, i.e. Icinga 2 gets constantly sent requests faster than it can handle them.

(1) is trivial and handled well by any approach, even removing CpuBoundWork without a replacement. The other two are where the approaches differ. The approach of waiting is what handles (2) better, as from the outside, you see s bit of delay during the load spikes but apart from that, things are fine. However, when faced with (3), that approach results in everything waiting and waiting longer, until nothing works any more (at some point, HTTP clients will probably run into a timeout before their requests gets handled and also our JSON-RPC connections will disconnect with a "no messages received" timeout). In contrast, this PR handles (3) better, in that it sheds load instead of having an ever-growing queue of waiting HTTP requests. Though as it's currently implemented, it has the downside of showing a worse behavior in situation (2) because it will tell clients to go away and try again later immediately, even if waiting for like 10ms would have been enough.

Just admit you don't like my #9990 despite it's shown amazing benchmark results. /s

I don't like spending so much time on optimizing that when I'm not even convinced that what it does is even the right approach.

Precisely, CpuBoundWork was originally made to prevent a full house from freezing the whole I/O (accept(2), SSL, ...). Of course, waiting for a reply gets worse with O(time).

That sounds good in theory, but when viewed in the context of Icinga 2, what is it worth to get a successful TLS handshake if afterwards, if neither your HTTP requests nor your JSON-RPC messages would be processed without a big delay?

I'm not even sure if simply removing CpuBoundWork would even be worse in scenario (2). I mean if it's just a temporary peak, does it make much of a difference if the limiting is done explicitly using CpuBoundWork or just implicitly by the size of the corresponding thread pool? The situation would look different if there we're non-CpuBoundWork-tasks that are required for those in CpuBoundWork to continue, but are there any? If not, doesn't that just make the difference between waiting for the TLS handshake to start instead of waiting for the handling of the request to start? I

But not even getting a connect(2) or SSL handshake done is the worst. It completely breaks the network stack, possibly including existing connections.

I disagree with that, why is this supposed to be the worst? Just knowing right away that this won't work is better than getting a connection and then having to wait 10 minutes for anything on that connection. Und during that time, taking up resources. Even worse, if that client then gives up, closes the connection and retries, you've wasted resources on that TLS handshake and will do so repeatedly, making everything else even slower.

How this PR prevents this?

It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long.

Actually, they scale O(input), I've already got Icinga to wait 5s+ for a slot due to 1MiB+ plugin output in check results. For a good reason JSON decoding is included in CpuBoundWork.

[2025-09-30 15:15:01 +0000] warning/JsonRpcConnection: Processed JSON-RPC 'event::CheckResult' message for identity '10.27.2.86' (took total 7194ms, waited 7159ms on semaphore).

That message just shows that you end up waiting for an unhealthy duration already, which is a sign that you're just pushing more check results through that node than it's able to handle. I don't see how the size of the plugin output and JSON decoding is relevant in particular here, you should get a very similar result with smaller check results at a faster pace.

Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense.

Except where something is another JSON-RPC message. See my arguments above.

I don't understand what that's trying to say.

Al2Klimov · 2025-10-02T13:06:33Z

Some high load spikes, as in there are some peaks but when these are smoothed out a bit, the load can still be handled without problems.

Overload, i.e. Icinga 2 gets constantly sent requests faster than it can handle them.

as it's currently implemented, it has the downside of showing a worse behavior in situation (2) because it will tell clients to go away and try again later immediately, even if waiting for like 10ms would have been enough.

Yes. Exactly.

You don't even wait those 10ms, though I'd even wait 3s+. Speaking about waiting, that should be pretty easy to build once #9990 is merged as it uses ASIO timers.

Instead, you unconditionally (alternative: #10572 (comment)) make the API rude in case of (2) which can break unaware clients, just to be safe against (3). But, given the semaphore is acquired for authn-ed ApiUsers anyway, the only proper fix to (3) lies in the specific environment's requests/capacity ratio. Only the admin can fix that. If he doesn't, neither of our approaches will provide a good outcome for him.

But we can influence the API behavior in case of (2) in a user-friendly way which is waiting. If you don't want to wait forever, let's discuss a reasonable threshold.

I don't like spending so much time on optimizing that when I'm not even convinced that what it does is even the right approach.

The time has already been spent and regarding right/wrong see the beginning of this comment of mine.

what is it worth to get a successful TLS handshake if afterwards, if neither your HTTP requests nor your JSON-RPC messages would be processed without a big delay?

Depending whether big delay implies running into a timeout anyway.

But not even getting a connect(2) or SSL handshake done is the worst. It completely breaks the network stack, possibly including existing connections.

I disagree with that, why is this supposed to be the worst? Just knowing right away that this won't work is better than getting a connection and then having to wait 10 minutes for anything on that connection.

Even if it's a cluster connection?

julianbrost added 2 commits September 29, 2025 15:03

julianbrost added blocker Blocks a release or needs immediate attention area/distributed Distributed monitoring (master, satellites, clients) area/api REST API core/quality Improve code, libraries, algorithms, inline docs labels Sep 29, 2025

cla-bot bot added the cla/signed label Sep 29, 2025

julianbrost marked this pull request as draft September 30, 2025 08:51

julianbrost self-assigned this Sep 30, 2025

Al2Klimov requested changes Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace `CpuBoundWork` #10572

Replace `CpuBoundWork` #10572

Uh oh!

julianbrost commented Sep 29, 2025 •

edited

Loading

Uh oh!

Al2Klimov left a comment

Uh oh!

Uh oh!

Uh oh!

Al2Klimov Sep 30, 2025

Uh oh!

Al2Klimov Sep 30, 2025

Uh oh!

Al2Klimov Sep 30, 2025

Uh oh!

Al2Klimov Sep 30, 2025

Uh oh!

julianbrost Oct 1, 2025

Uh oh!

julianbrost commented Oct 1, 2025

Uh oh!

Al2Klimov commented Oct 2, 2025

Uh oh!

Uh oh!

Replace CpuBoundWork #10572

Are you sure you want to change the base?

Replace CpuBoundWork #10572

Uh oh!

Conversation

julianbrost commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Al2Klimov Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Al2Klimov Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Al2Klimov Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Al2Klimov Sep 30, 2025

Choose a reason for hiding this comment

💡

Uh oh!

julianbrost Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

julianbrost commented Oct 1, 2025

Uh oh!

Al2Klimov commented Oct 2, 2025

Uh oh!

Uh oh!

Replace `CpuBoundWork` #10572

Replace `CpuBoundWork` #10572

julianbrost commented Sep 29, 2025 •

edited

Loading