Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ #10266

Al2Klimov · 2024-12-10T12:42:51Z

ref/NC/820479

Tests

lib/base/benchmark.hpp

Al2Klimov · 2024-12-10T15:35:44Z

lib/remote/endpoint.ti

+
+	[no_user_modify, no_storage] double seconds_processing_messages {
+		get;
+	};


It works! 👍

< HTTP/1.1 200 OK < Server: Icinga/v2.14.0-375-gecea52568 < Content-Type: application/json < Content-Length: 591 < { "results": [ { "attrs": { "seconds_processing_messages": 3.1292e-05, "seconds_reading_messages": 0.000155709 }, "joins": {}, "meta": {}, "name": "dummy", "type": "Endpoint" }, { "attrs": { "seconds_processing_messages": 0, "seconds_reading_messages": 0 }, "joins": {}, "meta": {}, "name": "ws-aklimov7777777.local", "type": "Endpoint" } ] }

Al2Klimov

Also works! 👍

Al2Klimov · 2024-12-10T16:01:49Z

lib/methods/clusterzonechecktask.cpp

-			new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond)
+			new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond),
+			new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages),
+			new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages)


{ "counter": false, "crit": null, "label": "sum_seconds_reading_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 0.000787875, "warn": null }, { "counter": false, "crit": null, "label": "sum_seconds_processing_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 3.45e-05, "warn": null }

Al2Klimov · 2024-12-10T16:02:23Z

lib/methods/icingachecktask.cpp

 	perfdata->Add(new PerfdataValue("sum_bytes_sent_per_second", bytesSentPerSecond));
 	perfdata->Add(new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond));
+	perfdata->Add(new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages));
+	perfdata->Add(new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages));


{ "counter": false, "crit": null, "label": "sum_seconds_reading_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 0.000264792, "warn": null }, { "counter": false, "crit": null, "label": "sum_seconds_processing_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 5.2666e-05, "warn": null }

Al2Klimov · 2024-12-10T16:25:41Z

lib/remote/endpoint.ti

+		get;
+	};
+
+	[no_user_modify, no_storage] double seconds_processing_messages {


Colleagues, addition we could also (just for the API) record seconds_processing_messages PER message. I.e. there should be also another attribute returning a dict like {"event::CheckResult":42.0,... saying that this endpoint e.g. spent 42 seconds handling (already read and decoded) event::CheckResult messages since program start.

Yes, there's also Log HTTP/RPC message processing stats #10141, but it's debug log unless a singe message takes excessively long. So, unless combined with Logger#object_filter: restrict Logger to messages referring to specific objects #9844, it's pretty useless in a large env.

Log HTTP/RPC message processing stats #10141 (comment) legitimately mentions that we already have logs which warn if the semaphore wait delays message processing noticeably. I.e you have more busy connections than semaphore slots.

But if you have enough slots and your messages themselves take long to process (Give PerfdataWriter a 👷‍♂️WorkQueue like e.g GraphiteWriter🖋️ #10267?), the suggested aggregation tells you.

julianbrost

Please also rebase, the Actions are pretty outdated.

lib/base/benchmark.cpp

lib/base/benchmark.hpp

Al2Klimov · 2025-03-17T09:27:50Z

lib/remote/jsonrpcconnection.cpp

+			// Only once we receive at least one byte, we know there must be a message to read.
+			if (!m_Stream->in_avail()) {
+				m_Stream->async_fill(yc);
+			}
+
+			// Only then we can start measuring the time it takes to read it.
+			if (m_Endpoint) {
+				readStarted = AtomicDuration::Clock::now();
+			}
+
 			jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024);


this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.

🎉

I've filtered out those 5 minutes!

It now reports very surprising numbers for me:

$ curl -isSku root:icinga 'https://localhost:5665/v1/objects/endpoints/master-2' -X GET --json '{"pretty":true, "attrs":["seconds_awaiting_semaphore", "seconds_processing_messages", "seconds_reading_messages"]}' HTTP/1.1 200 OK Server: Icinga/v2.14.0-470-gc67f8ddf1 Content-Type: application/json Content-Length: 371 { "results": [ { "attrs": { "seconds_awaiting_semaphore": 0.00343897, "seconds_processing_messages": 3.792230195, "seconds_reading_messages": 3.976059603 }, "joins": {}, "meta": {}, "name": "master-2", "type": "Endpoint" } ] }

That's 3.97s reading for 3.79s processing for a local connection between two containers, I would not expect reading the message to be more expensive than parsing (JSON parsing is counted as part of processing, not reading) and handling it. So there could still be something funky with that message or we do something pretty wrong during reading.

I'd not say JSON decoding is "something pretty wrong", but I could imagine it taking comparably as long as processing.

JSON parsing is counted as part of processing

So JSON parsing does not explain why reading reports such high numbers.

Yes, I've overseen that elephant. These are my numbers btw.:

➜ hotspare git:(e241a240a) for i in 10.27.1.87 10.27.1.90 10.27.3.6 10.27.0.129; do curl -ksSu root:123456 -X GET -H 'Accept: application/json' -d '{"attrs":["seconds_awaiting_semaphore","seconds_processing_messages","seconds_reading_messages"]}' "https://${i}:5665/v1/objects/endpoints?pretty=1"; done { "results": [ { "attrs": { "seconds_awaiting_semaphore": 0, "seconds_processing_messages": 0, "seconds_reading_messages": 0 }, "joins": {}, "meta": {}, "name": "akzl1", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.020272539, "seconds_processing_messages": 4.389837786, "seconds_reading_messages": 2.188152555 }, "joins": {}, "meta": {}, "name": "akzl2", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.03928695, "seconds_processing_messages": 8.916709728, "seconds_reading_messages": 2.046101043 }, "joins": {}, "meta": {}, "name": "akzl3", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.00137814, "seconds_processing_messages": 0.066405319, "seconds_reading_messages": 0.037797953 }, "joins": {}, "meta": {}, "name": "akzl4", "type": "Endpoint" } ] } { "results": [ { "attrs": { "seconds_awaiting_semaphore": 0.050616719, "seconds_processing_messages": 12.356168013, "seconds_reading_messages": 7.742184025 }, "joins": {}, "meta": {}, "name": "akzl1", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0, "seconds_processing_messages": 0, "seconds_reading_messages": 0 }, "joins": {}, "meta": {}, "name": "akzl2", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.001335411, "seconds_processing_messages": 0.068493356, "seconds_reading_messages": 0.034790134 }, "joins": {}, "meta": {}, "name": "akzl4", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.0013662, "seconds_processing_messages": 0.07003537, "seconds_reading_messages": 0.036346842 }, "joins": {}, "meta": {}, "name": "akzl3", "type": "Endpoint" } ] } { "results": [ { "attrs": { "seconds_awaiting_semaphore": 0.001642797, "seconds_processing_messages": 0.071626259, "seconds_reading_messages": 0.041599371 }, "joins": {}, "meta": {}, "name": "akzl2", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0, "seconds_processing_messages": 0, "seconds_reading_messages": 0 }, "joins": {}, "meta": {}, "name": "akzl3", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.0015372369999999999, "seconds_processing_messages": 0.072465629, "seconds_reading_messages": 0.052499105 }, "joins": {}, "meta": {}, "name": "akzl1", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.022459207, "seconds_processing_messages": 4.591767845, "seconds_reading_messages": 2.317401209 }, "joins": {}, "meta": {}, "name": "akzl4", "type": "Endpoint" } ] } { "results": [ { "attrs": { "seconds_awaiting_semaphore": 0.001503529, "seconds_processing_messages": 0.071794129, "seconds_reading_messages": 0.049342663 }, "joins": {}, "meta": {}, "name": "akzl2", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.001511789, "seconds_processing_messages": 0.070537203, "seconds_reading_messages": 0.049436677 }, "joins": {}, "meta": {}, "name": "akzl1", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0, "seconds_processing_messages": 0, "seconds_reading_messages": 0 }, "joins": {}, "meta": {}, "name": "akzl4", "type": "Endpoint" }, { "attrs": { "seconds_awaiting_semaphore": 0.019803705, "seconds_processing_messages": 4.484441092, "seconds_reading_messages": 3.16510494 }, "joins": {}, "meta": {}, "name": "akzl3", "type": "Endpoint" } ] } ➜ hotspare git:(e241a240a)

I've added a fourth metric and ran (echo -n '2:{},';sleep 55;echo -n '2:{},'; sleep 80000) |openssl s_client -connect 127.0.0.1:5665 -cert prefix/var/lib/icinga2/certs/ws-aklimov7777777.local.crt -key prefix/var/lib/icinga2/certs/ws-aklimov7777777.local.key against my local node:

"seconds_awaiting_messages": 54.949417417, "seconds_awaiting_semaphore": 2.541e-06, "seconds_processing_messages": 9.9251e-05, "seconds_reading_messages": 0.000117417,

So yes, my numbers are correct. seconds_reading_messages is just reading and especially not waiting for the next message to appear on the horizon (55s, seconds_awaiting_messages).

That might be a useful test for you, but what is the point for actually adding it to the PR?

It expresses how idle stuff is and add up to 100% of the uptime together with the other numbers. Or, to say it with David Kriesel (CCC, "SpiegelMining"):

Raw data is cool af! Keep all of it unless your storage is limited.

But I can drop it again if you insist.

It expresses how idle stuff is and add up to 100% of the uptime together with the other numbers.

Doesn't that even make it redundant then?

Primarily, I find it tedious that you keep changing the scope of the PR without proper communication. I mean this started roughly as "we want to know the load factor of JSON-RPC connections" for which a single metric would have been sufficient. You created a PR adding three metrics without further explanation what they do (hoping the names would be fully self-explanatory?), of which one turned out to be suboptimal (#10266 (comment)), after which we said to remove it (that happened offline, so I can't link to it unfortunately). Then instead, you pushed a version where that metric is collected differently without much explanation (#10266 (comment)). And now this is another iteration of "I've pushed something new to the PR that wasn't discussed before or is explained, go figure it out yourself what it does and if/why it makes sense to do this".

I mean this isn't even about whether these metrics would be good or bad, it's just that repeated change the scope of the PR combined with the lack of proper communication/explanations just drags the whole thing out unnecessarily.

But I can drop it again if you insist.

So please remove it so that we have a chance of getting this done any time soon.

It expresses how idle stuff is

please remove it

Done.

lib/base/atomic.cpp

lib/base/atomic.hpp

test/base-atomic.cpp

lib/methods/clusterzonechecktask.cpp

…ssing_messages} for the API

yhabteab

Generally it looks fine to me now. However, the seconds_reading_messages metric is useless IMO as this just indicates the time needed to read the message from the Asio internal buffer. This isn't a change request though, since you already have an unresolved thread about this with @julianbrost, do whatever you have agreed on with him.

Al2Klimov · 2025-08-27T08:15:06Z

seconds_reading_messages metric is useless IMO as this just indicates the time needed to read the message from the Asio internal buffer

If your uplink is slow enough, it actually indicates the reading time.

But yes, I can start that stopwatch only after I have at least one byte of the message. Otherwise, seconds_reading_messages would also contain idle time.

Al2Klimov self-assigned this Dec 10, 2024

cla-bot bot added the cla/signed label Dec 10, 2024

Al2Klimov changed the title Introduce Benchmark* Introduce Benchmark Dec 10, 2024

Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 5d1b07f to 92a4f45 Compare December 10, 2024 14:50

Al2Klimov changed the title ~~Introduce Benchmark~~ Benchmark message reading/processing time per endpoint Dec 10, 2024

Al2Klimov commented Dec 10, 2024

View reviewed changes

lib/base/benchmark.hpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from ef2a7d3 to 3ebbfc6 Compare December 10, 2024 15:34

Al2Klimov commented Dec 10, 2024

View reviewed changes

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 3ebbfc6 to f845407 Compare December 10, 2024 16:01

Al2Klimov commented Dec 10, 2024

View reviewed changes

Al2Klimov removed their assignment Dec 10, 2024

Al2Klimov marked this pull request as ready for review December 10, 2024 16:02

Al2Klimov added enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) area/api REST API area/checks Check execution and results labels Dec 10, 2024

Al2Klimov commented Dec 10, 2024

View reviewed changes

Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 20460e6 to 055c6b6 Compare December 11, 2024 09:41

Al2Klimov changed the title ~~Benchmark message reading/processing time per endpoint~~ Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} Dec 11, 2024

Al2Klimov added the ref/NC label Dec 11, 2024

This was referenced Dec 11, 2024

JSON-RPC outgoing messages queue metric #10269

Open

Log HTTP/RPC message processing stats #10141

Merged

Al2Klimov requested a review from yhabteab December 16, 2024 11:33

yhabteab added this to the 2.15.0 milestone Jan 13, 2025

Al2Klimov removed this from the 2.15.0 milestone Jan 21, 2025

julianbrost requested changes Mar 7, 2025

View reviewed changes

lib/base/benchmark.cpp Outdated Show resolved Hide resolved

lib/base/benchmark.hpp Outdated Show resolved Hide resolved

lib/base/benchmark.hpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 67ddc24 to 531991f Compare March 10, 2025 10:20

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 167e8c0 to 70cee1c Compare March 11, 2025 15:45

Al2Klimov self-assigned this Mar 17, 2025

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 70cee1c to 124c2e8 Compare March 17, 2025 09:26

Al2Klimov commented Mar 17, 2025

View reviewed changes

Al2Klimov removed their assignment Mar 17, 2025

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 124c2e8 to e6814c8 Compare March 17, 2025 10:27

Al2Klimov requested a review from julianbrost March 20, 2025 16:01

Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from f461342 to c67f8dd Compare March 21, 2025 09:06

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from c67f8dd to 8f76f7f Compare April 3, 2025 14:13

Al2Klimov changed the title ~~Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}~~ Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ Apr 3, 2025

Al2Klimov changed the title ~~Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️~~ Introduce Endpoint#seconds_{awaiting_messages,reading_messages,awaiting_semaphore,processing_messages}⏱️ Apr 3, 2025

Al2Klimov changed the title ~~Introduce Endpoint#seconds_{awaiting_messages,reading_messages,awaiting_semaphore,processing_messages}⏱️~~ Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ Apr 4, 2025

Al2Klimov self-assigned this Apr 4, 2025

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 8f76f7f to d63e728 Compare April 7, 2025 08:27

Al2Klimov removed their assignment Apr 7, 2025

Al2Klimov mentioned this pull request May 20, 2025

Checkable#ProcessCheckResult(): discard🗑️ CR or delay its producers shutdown #10397

Merged

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from d63e728 to 8d2c0da Compare August 15, 2025 09:23

julianbrost requested review from jschmidt-icinga and yhabteab August 26, 2025 08:37

yhabteab reviewed Aug 26, 2025

View reviewed changes

lib/base/atomic.cpp Outdated Show resolved Hide resolved

lib/base/atomic.hpp Outdated Show resolved Hide resolved

test/base-atomic.cpp Outdated Show resolved Hide resolved

Al2Klimov added 2 commits August 26, 2025 16:16

Introduce AtomicDuration

9e169d8

Benchmark message reading/waiting/processing time per endpoint

c50b0fe

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 8d2c0da to 8808f44 Compare August 26, 2025 14:16

yhabteab reviewed Aug 26, 2025

View reviewed changes

lib/methods/clusterzonechecktask.cpp Show resolved Hide resolved

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,proce…

7de13b6

…ssing_messages} for the API

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 8808f44 to 7de13b6 Compare August 26, 2025 16:21

yhabteab reviewed Aug 27, 2025

View reviewed changes

julianbrost mentioned this pull request Sep 9, 2025

Add JSON-RPC utilization metrics and troubleshooting documentation #10553

Merged

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ #10266

Are you sure you want to change the base?

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ #10266

Uh oh!

Conversation

Al2Klimov commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Al2Klimov Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julianbrost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

🎉

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

Al2Klimov commented Aug 27, 2025

Uh oh!

Uh oh!

Al2Klimov commented Dec 10, 2024 •

edited

Loading

Al2Klimov Dec 12, 2024 •

edited

Loading