Skip to content

Conversation

jschmidt-icinga
Copy link
Contributor

@jschmidt-icinga jschmidt-icinga commented Jul 23, 2025

Fixes #10409.
Fixes #10142, since many of the additional parameters HttpHandlers needed before are no longer necessary.

Description

This PR mostly does four things:

  • Simplify the HttpHandler signature by containing additional parameters and interfaces in classes extending the boost::beast request and response classes.
  • Add generic interface and connection handling for HttpHandlers to stream responses via chunked encoding.
  • Refactor the ObjectQueryHandler and EventsHandler to make use of these streaming capabilities.
  • Add unit-tests for HttpServerConnection and the new HttpResponse and HttpRequest classes.

I will explain the implementation and the considerations I've made in detail below.

HttpRequest and HttpResponse

The rationale for adding these classes is that they serve as an abstraction between HttpServerConnection, which does no longer have to offer an interface to the handlers directly (i.e. via a server parameter passed down to the handlers and StartStreaming() being called by the handlers), and the HttpHandlers, which do no longer need to concern themselves with connection handling.

For the response class I've implemented a custom body type SerializableBody with a writer that can be used by the serializer to write the added content to the connection, while adding chunked encoding as needed and tracking progress as needed. A previous iteration of this PR also implemented the BodyReader, but this added unnecessary complexity and was removed.

This wasn't easily achievable with any of the regular beast body types, because we needed functionality of the buffer_body, specifically the ability to interrupt serialization when the buffer content has fully been written to the connection and pick up again on the next call to Flush() once more data is available, and the dynamic_body, which has the ability to allocate additional memory as needed.

Streaming and non-streaming responses

To start a streaming response, the handler has to first call response.StartStreaming(), which enables chunked encoding and allows the response header to be sent on the next response.Flush(yc) without setting the content_length.

In case of a non-streaming response, the handler can use the HttpUtility::SendJson(Body|Error) convenience functions, same as before, or write the body as normal and then just return. The handler no longer has to set up content_length, which is now done automatically in the first call to response.Flush(yc) which by default is done once after the handler returns to ProcessRequest() in httpserverconnection.cpp.

Due to the new custom body, it is no longer possible (it would have been possible to implement, but more work) to use string operators on the body, however an output stream operator is added to make the interface as or more convenient. There is however a bug in beast that makes this slightly less efficient (but still more efficient than string +=) than it could be (see the comment in here). I intend to write a minimal reproducer and report this bug to the beast github repo.

Refactoring ObjectQueryHandler and EventsHandler

Events were streamed before, but without using chunked encoding and needing to directly setup the server connection via StartStreaming(). Now, proper HTTP/1.1 compliant chunked encoding is used and no special treatment is necessary by the server connection.

In ObjectQueryHandler we now make use of the new ability of passing a generator function to the JsonEncoder added in #10414. The for-loop iterating over the config objects becomes a generator function that feeds the serialized objects into the JsonEncoder one by one, flushing the response occasionally.

The same could easily be applied to other handlers like ModifyObjectHandler and DeleteObjectHandler, but since these don't have as large of a memory footprint, the necessity of this can still be determined and done in a future PR.

HttpServerConnection::DetectClientShutdown

This new coroutine was added to HttpServerConnection to detect shutdowns initiated by the HTTP-client while we're streaming a response. Before this PR, we'd detect shutdowns either when writes failed, or when the next request was being read. This had the disadvantage that during long running responses there were situations where it was impossible for the client to complete a graceful SSL-shutdown within a reasonable time-frame. Even with the new streaming capabilities added by this PR, waiting until writes failed would lead to an "application data after close notify" error on the client's side and an exception being logged on the server's side.

In my opinion adding a general approach as a part of HttpServerConnection's connection handling made the most sense, and this coroutine was the simplest way to do this. Possibly there would have been ways to let handlers enable or disable this functionality, but that would have been more code for basically no benefit, as this coroutine doesn't do any harm in any of the other use-cases.

The Coroutine basically initiates an async_fill() whenever reading a request is done:

and for as long as the connection isn't in the process of (or done) disconnecting:

while (!m_ShuttingDown) {
m_CanRead.WaitForClear(yc);

Initially I intended to use boost::asio's async_wait function with wait_type::wait_read to avoid even as much as an async_fill in the case a client didn't initiate a shutdown. However, likely due to differences in the respective libc's, this would not work reliably on the Windows and Alpine builds. So instead I chose the following approach:

boost::system::error_code ec;
m_Stream->async_fill(yc[ec]);
if (ec && !m_ShuttingDown) {
Log(LogInformation, "HttpServerConnection") << "Detected shutdown from client: " << m_PeerAddress << ".";
Disconnect(yc);
break;
}

This tries to fill the stream's buffer with as little as a single byte from the connection, which either yields until one or more bytes are readable, or returns an error code when the connection is closed for whatever reason. In any case where an error code is returned, we will want to attempt a graceful shutdown, but mostly this will be responding to a shutdown the client has initiated.

In the case no error is returned and the connection is being reused for another request, the first few bytes of the request will be read into the stream's buffer and an AsioDualEvent flag will be set to allow the regular continuation of HttpServerConnection::ProcessMessage() to read the next request. This "locking" is needed to avoid two async_fills (the one here and the one in http::async_read) being in progress at the same time, which causes exceptions/errors under certain race-conditions.

Unit Testing

The largest part of new lines in this PR comes from the addition of unit testing for HttpServerConnection and the new HTTP message classes. To get this working I had to add a number of testing fixtures for setting up certificates, TLS connections, and log message pattern matching. These were all added in a generally useful way so its possible to reuse them for adding other medium complexity unit-tests later on. After this PR is merged, I intend to use these fixtures to add testing for ApiListener, JsonRpcConnection, and some individual HttpHandlers.

Generating the certificates is relatively time consuming and takes above 1.5s on my system, so we wouldn't want to do it once per test, which is why I've added cmake-test fixtures (not to be confused with boost-test fixtures) which are implemented as "test-cases" themselves that set up the certificates in a temporary directory and define a dependency to all tests that use them. Ctest will then schedule the setup fixture test to be run before and the cleanup fixture test to be run after all dependent tests are finished. This looks a bit verbose inside CMakeLists.txt, which stems from our add_boost_test() cmake function being old and limited legacy code. Maybe this can be revisited in the future to make it more elegant, like automated discovery of test targets similar to doctest's cmake integration. Anyway, after the initial setup of the certificates, this makes each test case take about 80-100ms on my system, with the exception of remote_httpserverconnection/liveness_disconnect which has to wait 10s for the disconnect to be triggered in HttpServerConnection::CheckLiveness().

For HttpServerConnection I've focused on testing connection handling with a simple test handler, so it doesn't rely on or test any specific behavior of the regular handlers. I've verified that all tests (including client_shutdown using HttpServerConnection::StartStreaming in the handler) also run on the master branch with some modifications to theUnitTestHandler and some of the Log Pattern matching commented out. I've decided against making them a separate PR, because then almost every commit in this PR would have to incrementally adapt that handler to the changes to still make it compile.

For HttpResponse and HttpRequest, parsing and serializing the messages is tested, in streaming and non-streaming conditions, and also their use together with some of the auxiliary functions in HttpUtility.

If anyone can think of test cases I missed, I'm be happy to add them.


Tests (yhabteab)

On a Debian VM of the following size:

RAM - 8GB
VCPUs - 4 VCPU
Disk - 50GB

Icinga 2 is running as a container within that VM using a locally built image from this PR. So, there is no other process involved in the resulted graph, since I've generated them using docker stats with a timeframe of 30m and Icinga 2 is the only container running.

I've used this script from outside of the VM (from my local machine) to stress test it:

#!/bin/zsh

# Usage:
#
#   while :; do ./do-request.sh 100; printf .; done

for ((i = 0; i < "$1"; i++)); do
    curl -sSk \
    		-u root:a8f80869d65de8ae \
    		-o /dev/null \
    		'https://10.27.0.95:5665/v1/objects/services?pretty=1' &
done

wait "$(jobs -p)"
icinga2 daemon -C
[2025-08-01 10:09:00 +0000] information/cli: Icinga application loader (version: v2.15.0-68-ge9f7f1c5f)
[2025-08-01 10:09:00 +0000] information/cli: Loading configuration file(s).
[2025-08-01 10:09:01 +0000] information/ConfigItem: Committing config item(s).
[2025-08-01 10:09:01 +0000] information/ApiListener: My API identity: icinga-master
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 16124 Dependencies.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 User.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 UserGroup.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 3 TimePeriods.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 3 ServiceGroups.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 12419 Services.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 ScheduledDowntime.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 3 Zones.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 12 Notifications.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 2 NotificationCommands.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 FileLogger.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 530 Hosts.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 2 HostGroups.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 Endpoint.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 ApiUser.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2025-08-01 10:09:03 +0000] information/ConfigItem: Instantiated 248 CheckCommands.

Before:

It couldn’t even run for straight 1 minute before the OOM-Killer was triggered:

[14170.329916] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=containerd.service,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-369ef8f43cd345cbd20e290bb24d8d209f2dda377ff4e9b77170a4b72fca7036.scope,task=icinga2,pid=106144,uid=5665
[14170.333017] Out of memory: Killed process 106144 (icinga2) total-vm:8940468kB, anon-rss:7622224kB, file-rss:204kB, shmem-rss:0kB, UID:5665 pgtables:15304kB oom_score_adj:0
CPU % MEM %

After:
CPU %
MEM %

@jschmidt-icinga jschmidt-icinga added this to the 2.16.0 milestone Jul 23, 2025
@jschmidt-icinga jschmidt-icinga added area/api REST API core/quality Improve code, libraries, algorithms, inline docs labels Jul 23, 2025
@cla-bot cla-bot bot added the cla/signed label Jul 23, 2025
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, chapeau for your great work and the effort you put into setup the unit tests. I didn't review all the tests in detail yet, but they look very comprehensive and reusable for future tests as well. But for the time being, I have a few comments regarding the non-test code part, so here we go :)!

@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Jul 28, 2025

  • Added HttpResponse::SendFile() method (WIP)
  • Reformatted new code with clang-format (please tell me if you find anything that doesn't match the style convention)
  • Fixed potential concurrency issues with Boost.Test asserts

@jschmidt-icinga jschmidt-icinga force-pushed the http-handlers-stream-refactor branch from 5f8e238 to d5bde69 Compare July 28, 2025 08:41
@jschmidt-icinga
Copy link
Contributor Author

  • Removed the BodyReader implementation from SerializableBody, which wasn't really needed at all. The HttpResponseJsonWriter also lost some complexity due to no longer needing to conform to the BodyReader interface.

@jschmidt-icinga jschmidt-icinga force-pushed the http-handlers-stream-refactor branch 2 times, most recently from f1fe71f to 55b0bd1 Compare July 28, 2025 09:55
@jschmidt-icinga
Copy link
Contributor Author

  • Rebased onto master (and then forgot about older boost versions again)

@jschmidt-icinga jschmidt-icinga force-pushed the http-handlers-stream-refactor branch 2 times, most recently from 3cf3765 to 4b1020e Compare July 31, 2025 11:32
@jschmidt-icinga
Copy link
Contributor Author

  • Added a few more asserts and made the tests use the parser-based overload of http::read() instead of the message one 1.

Footnotes

  1. Because as @yhabteab found out, when it returns an error code, even when the message is successfully read and the error stems from reading an end_of_stream error into the read buffer, the message is never moved back out of the parser object. I have no idea why they built it that way, and I don't think it caused any errors or false negatives, but better being safe than sorry.

yhabteab
yhabteab previously approved these changes Aug 1, 2025
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFTM!

PS: I've updated the PR description to include before and after MEM usage graphs that demonstrate the difference in memory usage with this change. So, please have look at them :)!

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, these are my comments for everything except the tests. I'll have a look at them next.

Comment on lines +38 to +41
/* Preferably, we would return an ostream object here instead. However
* there seems to be a bug in boost::beast where if the ostream, or rather its
* streambuf object is moved into the return value, the chunked encoding gets
* mangled, leading to the client disconnecting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a known issue? If so, please add a reference, otherwise, is this something that would be worth reporting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. I mentioned in the PR description that once I have some spare time I'll see if I can put together a minimal reproducer and report the issue.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now my comments for the test fixtures. The actual tests are a task for another day 😅

@julianbrost
Copy link
Contributor

For the record: apart from that the CPU usage has to be looked into, I'm happy with this PR now. And hopefully that just needs some tweaking of some thresholds. Or maybe some profiling that hints something that can easily be tweaked.

Response time also was bound to go up, especially on really fast connections (like localhost)

I'm not sure if the connection speed is much of a factor here. Just checked: for my test requests, the response size is about 30 MiB, so stretched over 7 seconds, we aren't talking about massive transfer rates here.

@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Aug 27, 2025

Ok, so from my preliminary testing, I can only reproduce a difference of this magnitude without optimization. Enabling -O2 -DNDEBUG severely cuts down on the impact on CPU-time, while increasing the flush-threshold to around 64k is reducing it to ~0 (I actually found a slight improvement in user time with even larger thresholds) on my system. Response time seems to mostly be unchanged between 1f92ec6 and this PR when optimizations are enabled regardless of flush threshold.

Maybe you can repeat the test with optimization on, on your end and see if it's still a problem. I'll do a few more tests tomorrow and find the ideal value for the flush threshold (and maybe prepare some charts to show the impact of this value).

@yhabteab
Copy link
Member

I have also tested this as described in #10516 (comment) and got almost the same results!

Request GET /v1/objects/hosts (from [::1]:38850, user: root, agent: curl/8.14.1, status: OK) took total 5076ms.
Request GET /v1/objects/hosts (from [::1]:38818, user: root, agent: curl/8.14.1, status: OK) took total 5113ms.
Request GET /v1/objects/hosts (from [::1]:38840, user: root, agent: curl/8.14.1, status: OK) took total 5185ms

And profiling it with dtrace on my Mac shows that the write_character(s) methods of the output adapter take a lot of CPU than I would expect them to, since this is basically transferring data from one to another buffer. However, the multi_buffer::prepare and buffer_copy functions seem to do a lot of work each time.

Bildschirmfoto 2025-08-27 um 17 45 09 Bildschirmfoto 2025-08-27 um 17 53 41 Bildschirmfoto 2025-08-27 um 17 54 11

@yhabteab
Copy link
Member

while increasing the flush-threshold to around 64k is reducing it to ~0 (I actually found a slight improvement in user time with even larger thresholds) on my system.

Sorry, didn’t see your comment yesterday while I was submitting mine! As you can see in the screenshot I posted yesterday, doing I/O ops too often isn't the problem but filling the actual buffer via the write_character(s) methods. With the following patch applied, the response time has decreased by approximately 50% compared to before, and CPU usage has dropped by about 30%.

Expand me
diff --git a/lib/remote/httpmessage.cpp b/lib/remote/httpmessage.cpp
index 19011e432..53855ff9a 100644
--- a/lib/remote/httpmessage.cpp
+++ b/lib/remote/httpmessage.cpp
@@ -25,9 +25,6 @@ public:
 	explicit HttpResponseJsonWriter(HttpResponse& msg) : m_Message{msg}
 	{
 		m_Message.body().Start();
-#if BOOST_VERSION >= 107000
-		m_Message.body().Buffer().reserve(m_MinPendingBufferSize);
-#endif /* BOOST_VERSION */
 	}

 	~HttpResponseJsonWriter() override { m_Message.body().Finish(); }
@@ -36,9 +33,7 @@ public:

 	void write_characters(const char* s, std::size_t length) override
 	{
-		auto buf = m_Message.body().Buffer().prepare(length);
-		boost::asio::buffer_copy(buf, boost::asio::const_buffer{s, length});
-		m_Message.body().Buffer().commit(length);
+		m_Message.body() << std::string_view{s, length};
 	}

 	void MayFlush(boost::asio::yield_context& yield) override
@@ -111,7 +106,6 @@ HttpResponse::HttpResponse(Shared<AsioTlsStream>::Ptr stream, HttpServerConnecti
 void HttpResponse::Clear()
 {
 	ASSERT(!m_SerializationStarted);
-	boost::beast::http::response<body_type>::operator=({});
 }

 void HttpResponse::Flush(boost::asio::yield_context yc)
@@ -175,7 +169,7 @@ void HttpResponse::SendFile(const String& path, const boost::asio::yield_context
 	while (remaining) {
 		auto maxTransfer = std::min(remaining, maxChunkSize);

-		auto buf = *body().Buffer().prepare(maxTransfer).begin();
+		auto buf = body().Buffer().prepare(maxTransfer);
 		fp.read(static_cast<char*>(buf.data()), buf.size());
 		body().Buffer().commit(buf.size());

diff --git a/lib/remote/httpmessage.hpp b/lib/remote/httpmessage.hpp
index 10d00fd49..43d9f65b5 100644
--- a/lib/remote/httpmessage.hpp
+++ b/lib/remote/httpmessage.hpp
@@ -10,6 +10,7 @@
 #include "remote/url.hpp"
 #include <boost/beast/http.hpp>
 #include <boost/version.hpp>
+#include <boost/asio/streambuf.hpp>

 namespace icinga {

@@ -20,11 +21,8 @@ namespace icinga {
  * which uses a multi_buffer, with the ability to continue serialization when
  * new data arrives of the @c boost::beast::http::buffer_body.
  *
- * @tparam DynamicBuffer A buffer conforming to the boost::beast interface of the same name
- *
  * @ingroup remote
  */
-template<class DynamicBuffer>
 struct SerializableBody
 {
 	class writer;
@@ -33,7 +31,7 @@ struct SerializableBody
 	{
 	public:
 		template<typename T>
-		value_type& operator<<(T&& right)
+		std::ostream& operator<<(T&& right)
 		{
 			/* Preferably, we would return an ostream object here instead. However
 			 * there seems to be a bug in boost::beast where if the ostream, or rather its
@@ -54,8 +52,8 @@ struct SerializableBody
 			 * responses are handled via a reader instance, this shouldn't be too much of a
 			 * problem.
 			 */
-			boost::beast::ostream(m_Buffer) << std::forward<T>(right);
-			return *this;
+			m_OStream << std::forward<T>(right);
+			return m_OStream;
 		}

 		[[nodiscard]] std::size_t Size() const { return m_Buffer.size(); }
@@ -63,7 +61,7 @@ struct SerializableBody
 		void Finish() { m_More = false; }
 		bool Finished() { return !m_More; }
 		void Start() { m_More = true; }
-		DynamicBuffer& Buffer() { return m_Buffer; }
+		boost::asio::streambuf& Buffer() { return m_Buffer; }

 		friend class writer;

@@ -72,7 +70,8 @@ struct SerializableBody
 		 * for simple messages and can still be written with http::async_write().
 		 */
 		bool m_More = false;
-		DynamicBuffer m_Buffer;
+		boost::asio::streambuf m_Buffer;
+		std::ostream m_OStream{&m_Buffer};
 	};

 	static std::uint64_t size(const value_type& body) { return body.Size(); }
@@ -95,7 +94,7 @@ struct SerializableBody
 	class writer
 	{
 	public:
-		using const_buffers_type = typename DynamicBuffer::const_buffers_type;
+		using const_buffers_type = boost::asio::streambuf::const_buffers_type;

 #if BOOST_VERSION > 106600
 		template<bool isRequest, class Fields>
@@ -200,7 +199,7 @@ private:
  *
  * @ingroup remote
  */
-class HttpResponse : public boost::beast::http::response<SerializableBody<boost::beast::multi_buffer>>
+class HttpResponse : public boost::beast::http::response<SerializableBody>
 {
 public:
 	explicit HttpResponse(Shared<AsioTlsStream>::Ptr stream, HttpServerConnection::Ptr server = nullptr);
diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt
index 5498a6d83..b69bdac8a 100644
--- a/test/CMakeLists.txt
+++ b/test/CMakeLists.txt
@@ -278,7 +278,6 @@ add_boost_test(base
     remote_certs_fixture/cleanup_certs
     remote_httpmessage/request_parse
     remote_httpmessage/request_params
-    remote_httpmessage/response_clear
     remote_httpmessage/response_flush_nothrow
     remote_httpmessage/response_flush_throw
     remote_httpmessage/response_write_empty
@@ -298,7 +297,6 @@ add_boost_test(base
     remote_httpserverconnection/reuse_connection
     remote_httpserverconnection/wg_abort
     remote_httpserverconnection/client_shutdown
-    remote_httpserverconnection/handler_throw_error
     remote_httpserverconnection/handler_throw_streaming
     remote_httpserverconnection/liveness_disconnect
     remote_configpackageutility/ValidateName
@@ -313,7 +311,6 @@ if(BUILD_TESTING)
   set_tests_properties(
     base-remote_httpmessage/request_parse
     base-remote_httpmessage/request_params
-    base-remote_httpmessage/response_clear
     base-remote_httpmessage/response_flush_nothrow
     base-remote_httpmessage/response_flush_throw
     base-remote_httpmessage/response_write_empty
@@ -333,7 +330,6 @@ if(BUILD_TESTING)
     base-remote_httpserverconnection/reuse_connection
     base-remote_httpserverconnection/wg_abort
     base-remote_httpserverconnection/client_shutdown
-    base-remote_httpserverconnection/handler_throw_error
     base-remote_httpserverconnection/handler_throw_streaming
     base-remote_httpserverconnection/liveness_disconnect
     PROPERTIES FIXTURES_REQUIRED ssl_certs)

The handler_throw_error and response_clear unittests will fail with this patch, but I didn't investigate why yet. The patch uses Asio's streambuf in combination with std::ostream, other than that there's no difference to the original code. However, as opposed to Beast's multi_buffer, Asio's streambuf doesn't have such constraints Calling members of the underlying buffer before the output stream is destroyed results in undefined behavior., so it uses a persistent std::ostream member instead of creating a new one for each << operation.

PR bf04a55

Request GET /v1/objects/hosts (from [::1]:43996, user: root, agent: curl/8.14.1, status: OK) took total 4971ms.
Request GET /v1/objects/hosts (from [::1]:43988, user: root, agent: curl/8.14.1, status: OK) took total 5137ms.
Request GET /v1/objects/hosts (from [::1]:43968, user: root, agent: curl/8.14.1, status: OK) took total 5260ms.
Request GET /v1/objects/hosts (from [::1]:43970, user: root, agent: curl/8.14.1, status: OK) took total 5296ms.
User time (seconds): 17.08
System time (seconds): 0.19
Percent of CPU this job got: 191%
Maximum resident set size (kbytes): 86272

After applying the patch

Request GET /v1/objects/hosts (from [::1]:55296, user: root, agent: curl/8.14.1, status: OK) took total 2434ms.
Request GET /v1/objects/hosts (from [::1]:55316, user: root, agent: curl/8.14.1, status: OK) took total 2438ms.
Request GET /v1/objects/hosts (from [::1]:55304, user: root, agent: curl/8.14.1, status: OK) took total 2438ms.
Request GET /v1/objects/hosts (from [::1]:55308, user: root, agent: curl/8.14.1, status: OK) took total 2454ms.
User time (seconds): 8.96
System time (seconds): 0.18
Percent of CPU this job got: 142%
Maximum resident set size (kbytes): 86008

@jschmidt-icinga
Copy link
Contributor Author

Asio's streambuf doesn't have such constraints Calling members of the underlying buffer before the output stream is destroyed results in undefined behavior., so it uses a persistent std::ostream member instead of creating a new one for each << operation.

That isn't relevant here, since everything is going through the output adapter and not the ostream operator. My guess (since from your numbers I'm assuming you're still running the test unoptimized) would be that some low level functionality of asio's streambuf is compiled into the dynamic libraries (with optimization) and that's why its faster. As I said before, streambuf uses multi_buffer internally and I'd be surprised if it was any faster with optimizations enabled.

This all comes down on whether it's a priority for us that our binary should not have performance regressions in an unoptimized debug binary (I didn't test -Og, but I'd bet it's still a lot closer than your results).

I'll finish a few more tests and then I'll get back to you with my own results.

@yhabteab
Copy link
Member

This all comes down on whether it's a priority for us that our binary should not have performance regressions in an unoptimized debug binary (I didn't test -Og, but I'd bet it's still a lot closer than your results).

I didn't use debug binary for my previous tests! I just built an image with #10505 and run it as described in #10516 (comment). That's the exact executable that will also be used by the end users if they use our image.

As I said before, streambuf uses multi_buffer internally and I'd be surprised if it was any faster with optimizations enabled.

Are sure about this? Beast dependes on Asio's functionality but never the other way around. Asio's streambuf implements the std::streambuf interface, so it has its own techniques to manage the buffers and doesn't depend on any Beast's functionality.

@jschmidt-icinga
Copy link
Contributor Author

Beast dependes on Asio's functionality but never the other way around. Asio's streambuf implements the std::streambuf interface, so it has its own techniques to manage the buffers and doesn't depend on any Beast's functionality.

My bad. I must have misremembered. I thought this was part of beast. Seems it uses a std::vector inside.

I didn't use debug binary for my previous tests!

Then why do you get these absurdly high response times? Mine are more along the lines of 610ms per response for 1f92ec6 (on my potato notebook mind you) and my PR starting at ~690ms at a 4k flush threshold (and decreasing significantly on higher thresholds).

@julianbrost
Copy link
Contributor

@jschmidt-icinga You're right, compiler optimizations (or the lack thereof) is to blame here. Only adding -O2 to CXXFLAGS changed the numbers for me as follows:

Base (1f92ec6)

Request GET /v1/objects/hosts ([...], status: OK) took total 876ms.
Request GET /v1/objects/hosts ([...], status: OK) took total 885ms.
Request GET /v1/objects/hosts ([...], status: OK) took total 887ms.
Request GET /v1/objects/hosts ([...], status: OK) took total 889ms.
	User time (seconds): 4.08
	System time (seconds): 0.68
	Maximum resident set size (kbytes): 699536

PR (bf04a55)

Request GET /v1/objects/hosts ([...], status: OK) took total 888ms.
Request GET /v1/objects/hosts ([...], status: OK) took total 889ms.
Request GET /v1/objects/hosts ([...], status: OK) took total 893ms.
Request GET /v1/objects/hosts ([...], status: OK) took total 893ms.
	User time (seconds): 4.58
	System time (seconds): 0.42
	Maximum resident set size (kbytes): 81212

So with that, it's just a 0.7% increase in response time and 5% increase in CPU usage for a 88% decrease in memory usage (again, take the exact numbers with a grain of salt, they are from a single run, repeating it shows similar numbers though). That sounds more than acceptable for that saving in memory.

This all comes down on whether it's a priority for us that our binary should not have performance regressions in an unoptimized debug binary

So that in itself wouldn't be a problem for me.

Though I haven't looked in detail into what @yhabteab found, should we have a closer look at this nonetheless?

@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Aug 28, 2025

I've got the following results with a container compiled with -O2 -DNDEBUG. Per test I ran the test script three times, averaged the four response times per run and then all values over the three runs. For each run I've doubled the value of HttpResponseJsonWriter::m_MinPendingBufferSize, from 4k to 128k, and 1M as an additional test.

test response time (ms) user time mem
1f92ec6 611 2.96 707074
4k 690 3.54 78473
8k 651 3.47 79154
16k 596 3.26 78977
32k 563 3.10 79718
64k 543 3.01 80483
128k 534 2.98 78865
1M 532 2.94 86795

At 1M, which I initially did just so make sure a really big number doesn't get any additional gains, I did an additional test that only does async_flush() conditionally when the serializer is done, and that looked promising, so I ran it again on lower threshold values, where I would have expected flushing manually to matter more, but it didn't seem to do anything, so I left those results out here.

I also want to note that the values for this PR were incredibly consistent compared to 1f92ec6 (likely because of fewer memory allocations) which fluctuated a lot more so I averaged over 6 tests for that instead of 3 for the others.

As you can see, the response time actually gets better than 1f92ec6 while user time breaks even at about 64k/128k. Given that under 1M max memory usage doesn't seem to be impacted much, I'd actually set this (and the threshold in SendFile()) to 128k unless anyone has objections.

Edit:
@julianbrost @yhabteab I'd be curious how the boost::asio::streambuf performs, but I'd be very surprised if there was a significant difference. It doesn't do anything special or different. The interface is very similar to the DynamicBuffer interface (which is the reason it worked here as an almost-drop-in replacement) and while the buffer it uses internally is contiguous, as opposed to multi_buffer, which is more std::deque-like, I think in our case that doesn't matter much. multi_buffer will probably need to allocate one or two additional chunks above the threshold, adding cost to future cycles, while streambuf will have to reallocate the entire buffer once or twice which costs more up front but is free from then on. Both should be negligible.

@julianbrost
Copy link
Contributor

I'd actually set this (and the threshold in SendFile()) to 128k unless anyone has objections.

Sounds fine for me. (Afterwards I'll spin it up for a test again, but I wouldn't expect any big surprises there).

The interface is very similar to the DynamicBuffer interface (which is the reason it worked here as an almost-drop-in replacement) and while the buffer it uses internally is contiguous, as opposed to multi_buffer, which is more std::deque-like

Fun fact: I'm reading the documentation of asio::streambuf as "we currently do that but if we feel like it, we could change it to something more like multi_buffer at any time":

The basic_streambuf class's public interface is intended to permit the following implementation strategies:

  • A single contiguous character array, which is reallocated as necessary to accommodate changes in the size of the character sequence. This is the implementation approach currently used in Asio.
  • A sequence of one or more character arrays, where each array is of the same size. Additional character array objects are appended to the sequence to accommodate changes in the size of the character sequence.
  • A sequence of one or more character arrays of varying sizes. Additional character array objects are appended to the sequence to accommodate changes in the size of the character sequence.

@jschmidt-icinga jschmidt-icinga force-pushed the http-handlers-stream-refactor branch from bf04a55 to 0b0b43d Compare August 28, 2025 10:23
@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Aug 28, 2025

  • As mentioned, bumped the flush threshold to 128k (and renamed + commented the value accordingly).
  • I also set the pre-allocated size with which the output adapter initializes the buffer to 1.25 times the threshold, because otherwise an additional chunk would always be allocated before the first flush. In the above test case this doesn't give any noticeable improvement, but it can't hurt either.

@yhabteab I've looked at the new docker ContainerFile and tested with the same values of CMAKE_BUILD_TYPE=RelWithDebInfo without any additional CMAKE_CXX_FLAGS and get ~about the same values. Same with Release and -O2 -g -DNDEBUG, which RelWithDebInfo boils down to. I can't think of anything else that could be the problem...

@jschmidt-icinga jschmidt-icinga force-pushed the http-handlers-stream-refactor branch from 0b0b43d to 7373f36 Compare August 28, 2025 11:22
@yhabteab
Copy link
Member

I can't think of anything else that could be the problem...

But I do 🙈! The new Containerfile from #10505 sets the CMAKE_BUILD_TYPE to ReleaseWithDebInfo but that's not a valid CMake flag, thus CMake probably just defaults to Debug, I guess. After changing it to RelWithDebInfo and rebuilt the images, I also get kinda the same numbers as you now. This time, asio::streambuf performs similarly to beast::multi_buffer and there are no significant differences between them. However, in debug builds (which I mistakenly assumed it's a release build xD), Asio's implementation appears to be superior to Beast's implementation.

@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Aug 28, 2025

The new Containerfile from #10505 sets the CMAKE_BUILD_TYPE to ReleaseWithDebInfo but that's not a valid CMake flag

I find this funny because in my last comment, I first wrote RelWithDebugInfo instead before I edited it. 😆

However, in debug builds (which I mistakenly assumed it's a release build xD), Asio's implementation appears to be superior to Beast's implementation.

I think I found the reason why. look for debug_check() in the multi_buffer source. This puts asserts on basically all variables of the buffer's control block at multiple points during each consume()/commit()/prepare() operation. boost::asio::streambuf on the other hand implements these functions using the the setg()/gptr()/pptr() interface of the std::streambuf base, which have no asserts at all.

@julianbrost
Copy link
Contributor

I think I found the reason why. look for debug_check() in the multi_buffer source.

I doubt that's the culprit. They are all inside #if BOOST_BEAST_MULTI_BUFFER_DEBUG_CHECK and looks like you'd have to set that explicitly, I didn't find any place in the Boost headers where this was set (and neither do we set it inside Icinga 2).

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afterwards I'll spin it up for a test again, but I wouldn't expect any big surprises there

Just did that, unsurprisingly it still worked fine, so for me, this PR is ready 🚀

Note: that doesn't mean that the code added by this PR has to be immutable, there may still be room for improvement, but I think the current state is fine and any further changes would be easier to grasp in follow-up PRs rather than inside a 2k+ lines and 200+ comments PR.

@julianbrost julianbrost merged commit 87df80d into master Aug 29, 2025
30 checks passed
@julianbrost julianbrost deleted the http-handlers-stream-refactor branch August 29, 2025 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API cla/signed core/quality Improve code, libraries, algorithms, inline docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP connection handling for streaming responses Simplify HttpHandler::HandleRequest() signature by grouping arguments into structs
3 participants