-
Notifications
You must be signed in to change notification settings - Fork 596
Replace CpuBoundWork
#10572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Replace CpuBoundWork
#10572
Conversation
The idea of CpuBoundWork was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided by CpuBoundWork were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication. A replacement is added in the following commit.
This serves as a replacement for CpuBoundWork removed in the previous commit. The core idea is still similar, the main differences are: - It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long. Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense. - There's no more operation to wait a slot. Instead, HTTP requests are now rejected with a 503 Service Unavailable error if there's no slot available. This means that if there is too much load on an Icinga 2 instance from HTTP requests, this shows in error messages instead of more and more waiting requests accumulating and increased response times. This commit does not limit the total number of running HTTP requests. In particular, those streaming the response (for example using chunked encoding) are no longer counted once they enter the streaming phase. There is a very specific reason for this: otherwise, a slow or malicious client would block the slot for the whole time it's reading the response, which could take a while. If that happens on multiple connections, the whole pool could be blocked by clients reading responses very slowly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just admit you don't like my #9990 despite it's shown amazing benchmark results. /s
Seriously speaking:
The idea of
CpuBoundWork
was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided byCpuBoundWork
were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication.
Precisely, CpuBoundWork was originally made to prevent a full house from freezing the whole I/O (accept(2), SSL, ...). Of course, waiting for a reply gets worse with O(time). But not even getting a connect(2) or SSL handshake done is the worst. It completely breaks the network stack, possibly including existing connections.
How this PR prevents this?
- It is no longer used during JSON-RPC message handling. The corresponding handlers are rather quick, so they don't block the threads for long.
Actually, they scale O(input), I've already got Icinga to wait 5s+ for a slot due to 1MiB+ plugin output in check results. For a good reason JSON decoding is included in CpuBoundWork.
[2025-09-30 15:15:01 +0000] warning/JsonRpcConnection: Processed JSON-RPC 'event::CheckResult' message for identity '10.27.2.86' (took total 7194ms, waited 7159ms on semaphore).
Additionally, JSON-RPC message handling is essential for an Icinga 2 cluster to work, so making them wait for something makes little sense.
Except where something is another JSON-RPC message. See my arguments above.
If you don't want to wait for HTTP, the solution is dead simple: #10046
|
||
return std::make_unique<Defer>([this] { | ||
std::unique_lock lock(m_SlowSlotsMutex); | ||
m_SlowSlotsAvailable++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, my semaphore from #9990 seems pretty fast. Especially a release could be just an atomic subtraction.
m_SlowSlotsAvailable--; | ||
lock.unlock(); | ||
|
||
return std::make_unique<Defer>([this] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a good reason to make Defer movable.
*/ | ||
if (!response.TryAcquireSlowSlot()) { | ||
HttpUtility::SendJsonError(response, request.Params(), 503, | ||
"Too many requests already in progress, please try again later."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTP requests are now rejected with a 503 Service Unavailable error if there's no slot available. This means that if there is too much load on an Icinga 2 instance from HTTP requests, this shows in error messages instead of more and more waiting requests accumulating and increased response times.
This indeed seems clever:
- https://stackoverflow.com/questions/42957767/is-there-a-better-way-limit-requests-at-the-door#:~:text=A%20better%20way%20to%20do,telling%20it%20to%20back%20off
- https://stackoverflow.com/questions/72897199/uvicorn-backlog-vs-limit-concurrency#:~:text=limit-concurrency%20is%20just%20here%20to%20tell,%20after%20x%20responses%20issue%20a%20503
- https://docs.nginx.com/nginx/admin-guide/security-controls/controlling-access-proxied-http/#:~:text=For%20requests%20that%20arrive%20at%20the%20full%20bucket,%20NGINX%20will%20respond%20with%20the%20503%20Service%20Unavailable
But at least waiting some time before giving up looks clever AND smart to me: https://www.haproxy.com/blog/protect-servers-with-haproxy-connection-limits-and-queues#:~:text=a%20client%20will%20wait%20for%20up%20to%2030%20seconds%20in%20the%20queue,%20after%20which%20HAProxy%20returns%20a%20503%20Service%20Unavailable%20response%20to%20them
I consider even the current behavior better (accumulating and increased response times) than the 503 approach. This handles rush hours smoother.
While on HTTP status codes! This PR lets 1 ApiUser DoS the cluster now, I'd expect a 429 before even running into 503 and/or waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡
At least I'd introduce a new permission making an ApiUser wait as long as necessary, preventing 503.
This is especially useful for critical components which consume the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR lets 1 ApiUser DoS the cluster now
This PR doesn't increase the time an ApiUser can block a slot, so wasn't this the same before, it's just hidden on the server side by waiting instead of retrying on the client side? If you send another request, you get another chance at obtaining that slot, just like you get when the server waits for you.
I'd expect a 429 before even running into 503 and/or waiting.
That's a whole other issue. Regardless of what we do to CpuBoundWork
, we can consider doing some rate-limiting, like per ApiUser
. That could avoid some overload situations, but not all (if there are just many different users putting load on the server).
Let me first take a step back and add some more context. Overall, I'd say there are three different situations that the Icinga 2 node can be regarding
(1) is trivial and handled well by any approach, even removing
I don't like spending so much time on optimizing that when I'm not even convinced that what it does is even the right approach.
That sounds good in theory, but when viewed in the context of Icinga 2, what is it worth to get a successful TLS handshake if afterwards, if neither your HTTP requests nor your JSON-RPC messages would be processed without a big delay? I'm not even sure if simply removing
I disagree with that, why is this supposed to be the worst? Just knowing right away that this won't work is better than getting a connection and then having to wait 10 minutes for anything on that connection. Und during that time, taking up resources. Even worse, if that client then gives up, closes the connection and retries, you've wasted resources on that TLS handshake and will do so repeatedly, making everything else even slower.
That message just shows that you end up waiting for an unhealthy duration already, which is a sign that you're just pushing more check results through that node than it's able to handle. I don't see how the size of the plugin output and JSON decoding is relevant in particular here, you should get a very similar result with smaller check results at a faster pace.
I don't understand what that's trying to say. |
Yes. Exactly. You don't even wait those 10ms, though I'd even wait 3s+. Speaking about waiting, that should be pretty easy to build once #9990 is merged as it uses ASIO timers. Instead, you unconditionally (alternative: #10572 (comment)) make the API rude in case of (2) which can break unaware clients, just to be safe against (3). But, given the semaphore is acquired for authn-ed ApiUsers anyway, the only proper fix to (3) lies in the specific environment's requests/capacity ratio. Only the admin can fix that. If he doesn't, neither of our approaches will provide a good outcome for him. But we can influence the API behavior in case of (2) in a user-friendly way which is waiting. If you don't want to wait forever, let's discuss a reasonable threshold.
The time has already been spent and regarding right/wrong see the beginning of this comment of mine.
Depending whether big delay implies running into a timeout anyway.
Even if it's a cluster connection? |
The idea of
CpuBoundWork
was to prevent too many coroutines from performing long-running actions at the same time so that some worker threads are always available for other tasks. However, in practice, once all slots provided byCpuBoundWork
were used, this would also block the handling of JSON-RPC messages, effectively bringing down the whole cluster communication.The core idea of the replacement in this PR is still similar, the main differences are:
This does not limit the total number of running HTTP requests. In particular, those streaming the response (for example using chunked encoding) are no longer counted once they enter the streaming phase. There is a very specific reason for this: otherwise, a slow or malicious client would block the slot for the whole time it's reading the response, which could take a while. If that happens on multiple connections, the whole pool could be blocked by clients reading responses very slowly. In particular, this fixes a denial of service problem introduced by #10516 (not yet released), hence the blocker label.
Tests
Send enough slow requests at once. Slow requests can be forced for example by executing
sleep()
as a console command inside the running dameon:Example output:
16 is the number of available slots on my machine. However, you might also see more requests failing if other clients send requests at the same time.