-
Notifications
You must be signed in to change notification settings - Fork 597
Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ #10266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5d1b07f
to
92a4f45
Compare
ef2a7d3
to
3ebbfc6
Compare
|
||
[no_user_modify, no_storage] double seconds_processing_messages { | ||
get; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works! 👍
< HTTP/1.1 200 OK
< Server: Icinga/v2.14.0-375-gecea52568
< Content-Type: application/json
< Content-Length: 591
<
{
"results": [
{
"attrs": {
"seconds_processing_messages": 3.1292e-05,
"seconds_reading_messages": 0.000155709
},
"joins": {},
"meta": {},
"name": "dummy",
"type": "Endpoint"
},
{
"attrs": {
"seconds_processing_messages": 0,
"seconds_reading_messages": 0
},
"joins": {},
"meta": {},
"name": "ws-aklimov7777777.local",
"type": "Endpoint"
}
]
}
3ebbfc6
to
f845407
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also works! 👍
lib/methods/clusterzonechecktask.cpp
Outdated
new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond) | ||
new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond), | ||
new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages), | ||
new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
"counter": false,
"crit": null,
"label": "sum_seconds_reading_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 0.000787875,
"warn": null
},
{
"counter": false,
"crit": null,
"label": "sum_seconds_processing_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 3.45e-05,
"warn": null
}
lib/methods/icingachecktask.cpp
Outdated
perfdata->Add(new PerfdataValue("sum_bytes_sent_per_second", bytesSentPerSecond)); | ||
perfdata->Add(new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond)); | ||
perfdata->Add(new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages)); | ||
perfdata->Add(new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
"counter": false,
"crit": null,
"label": "sum_seconds_reading_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 0.000264792,
"warn": null
},
{
"counter": false,
"crit": null,
"label": "sum_seconds_processing_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 5.2666e-05,
"warn": null
}
get; | ||
}; | ||
|
||
[no_user_modify, no_storage] double seconds_processing_messages { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Colleagues, addition we could also (just for the API) record seconds_processing_messages PER message. I.e. there should be also another attribute returning a dict like {"event::CheckResult":42.0,... saying that this endpoint e.g. spent 42 seconds handling (already read and decoded) event::CheckResult messages since program start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, there's also Log HTTP/RPC message processing stats #10141, but it's debug log unless a singe message takes excessively long. So, unless combined with Logger#object_filter: restrict Logger to messages referring to specific objects #9844, it's pretty useless in a large env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Log HTTP/RPC message processing stats #10141 (comment) legitimately mentions that we already have logs which warn if the semaphore wait delays message processing noticeably. I.e you have more busy connections than semaphore slots.
- But if you have enough slots and your messages themselves take long to process (Give PerfdataWriter a 👷♂️WorkQueue like e.g GraphiteWriter🖋️ #10267?), the suggested aggregation tells you.
20460e6
to
055c6b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also rebase, the Actions are pretty outdated.
67ddc24
to
531991f
Compare
167e8c0
to
70cee1c
Compare
70cee1c
to
124c2e8
Compare
lib/remote/jsonrpcconnection.cpp
Outdated
// Only once we receive at least one byte, we know there must be a message to read. | ||
if (!m_Stream->in_avail()) { | ||
m_Stream->async_fill(yc); | ||
} | ||
|
||
// Only then we can start measuring the time it takes to read it. | ||
if (m_Endpoint) { | ||
readStarted = AtomicDuration::Clock::now(); | ||
} | ||
|
||
jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.
🎉
I've filtered out those 5 minutes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now reports very surprising numbers for me:
$ curl -isSku root:icinga 'https://localhost:5665/v1/objects/endpoints/master-2' -X GET --json '{"pretty":true, "attrs":["seconds_awaiting_semaphore", "seconds_processing_messages", "seconds_reading_messages"]}'
HTTP/1.1 200 OK
Server: Icinga/v2.14.0-470-gc67f8ddf1
Content-Type: application/json
Content-Length: 371
{
"results": [
{
"attrs": {
"seconds_awaiting_semaphore": 0.00343897,
"seconds_processing_messages": 3.792230195,
"seconds_reading_messages": 3.976059603
},
"joins": {},
"meta": {},
"name": "master-2",
"type": "Endpoint"
}
]
}
That's 3.97s reading for 3.79s processing for a local connection between two containers, I would not expect reading the message to be more expensive than parsing (JSON parsing is counted as part of processing, not reading) and handling it. So there could still be something funky with that message or we do something pretty wrong during reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd not say JSON decoding is "something pretty wrong", but I could imagine it taking comparably as long as processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON parsing is counted as part of processing
So JSON parsing does not explain why reading reports such high numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've overseen that elephant. These are my numbers btw.:
➜ hotspare git:(e241a240a) for i in 10.27.1.87 10.27.1.90 10.27.3.6 10.27.0.129; do curl -ksSu root:123456 -X GET -H 'Accept: application/json' -d '{"attrs":["seconds_awaiting_semaphore","seconds_processing_messages","seconds_reading_messages"]}' "https://${i}:5665/v1/objects/endpoints?pretty=1"; done
{
"results": [
{
"attrs": {
"seconds_awaiting_semaphore": 0,
"seconds_processing_messages": 0,
"seconds_reading_messages": 0
},
"joins": {},
"meta": {},
"name": "akzl1",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.020272539,
"seconds_processing_messages": 4.389837786,
"seconds_reading_messages": 2.188152555
},
"joins": {},
"meta": {},
"name": "akzl2",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.03928695,
"seconds_processing_messages": 8.916709728,
"seconds_reading_messages": 2.046101043
},
"joins": {},
"meta": {},
"name": "akzl3",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.00137814,
"seconds_processing_messages": 0.066405319,
"seconds_reading_messages": 0.037797953
},
"joins": {},
"meta": {},
"name": "akzl4",
"type": "Endpoint"
}
]
}
{
"results": [
{
"attrs": {
"seconds_awaiting_semaphore": 0.050616719,
"seconds_processing_messages": 12.356168013,
"seconds_reading_messages": 7.742184025
},
"joins": {},
"meta": {},
"name": "akzl1",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0,
"seconds_processing_messages": 0,
"seconds_reading_messages": 0
},
"joins": {},
"meta": {},
"name": "akzl2",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.001335411,
"seconds_processing_messages": 0.068493356,
"seconds_reading_messages": 0.034790134
},
"joins": {},
"meta": {},
"name": "akzl4",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.0013662,
"seconds_processing_messages": 0.07003537,
"seconds_reading_messages": 0.036346842
},
"joins": {},
"meta": {},
"name": "akzl3",
"type": "Endpoint"
}
]
}
{
"results": [
{
"attrs": {
"seconds_awaiting_semaphore": 0.001642797,
"seconds_processing_messages": 0.071626259,
"seconds_reading_messages": 0.041599371
},
"joins": {},
"meta": {},
"name": "akzl2",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0,
"seconds_processing_messages": 0,
"seconds_reading_messages": 0
},
"joins": {},
"meta": {},
"name": "akzl3",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.0015372369999999999,
"seconds_processing_messages": 0.072465629,
"seconds_reading_messages": 0.052499105
},
"joins": {},
"meta": {},
"name": "akzl1",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.022459207,
"seconds_processing_messages": 4.591767845,
"seconds_reading_messages": 2.317401209
},
"joins": {},
"meta": {},
"name": "akzl4",
"type": "Endpoint"
}
]
}
{
"results": [
{
"attrs": {
"seconds_awaiting_semaphore": 0.001503529,
"seconds_processing_messages": 0.071794129,
"seconds_reading_messages": 0.049342663
},
"joins": {},
"meta": {},
"name": "akzl2",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.001511789,
"seconds_processing_messages": 0.070537203,
"seconds_reading_messages": 0.049436677
},
"joins": {},
"meta": {},
"name": "akzl1",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0,
"seconds_processing_messages": 0,
"seconds_reading_messages": 0
},
"joins": {},
"meta": {},
"name": "akzl4",
"type": "Endpoint"
},
{
"attrs": {
"seconds_awaiting_semaphore": 0.019803705,
"seconds_processing_messages": 4.484441092,
"seconds_reading_messages": 3.16510494
},
"joins": {},
"meta": {},
"name": "akzl3",
"type": "Endpoint"
}
]
}
➜ hotspare git:(e241a240a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a fourth metric and ran (echo -n '2:{},';sleep 55;echo -n '2:{},'; sleep 80000) |openssl s_client -connect 127.0.0.1:5665 -cert prefix/var/lib/icinga2/certs/ws-aklimov7777777.local.crt -key prefix/var/lib/icinga2/certs/ws-aklimov7777777.local.key
against my local node:
"seconds_awaiting_messages": 54.949417417,
"seconds_awaiting_semaphore": 2.541e-06,
"seconds_processing_messages": 9.9251e-05,
"seconds_reading_messages": 0.000117417,
So yes, my numbers are correct. seconds_reading_messages is just reading and especially not waiting for the next message to appear on the horizon (55s, seconds_awaiting_messages).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be a useful test for you, but what is the point for actually adding it to the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It expresses how idle stuff is and add up to 100% of the uptime together with the other numbers. Or, to say it with David Kriesel (CCC, "SpiegelMining"):
Raw data is cool af! Keep all of it unless your storage is limited.
But I can drop it again if you insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It expresses how idle stuff is and add up to 100% of the uptime together with the other numbers.
Doesn't that even make it redundant then?
Primarily, I find it tedious that you keep changing the scope of the PR without proper communication. I mean this started roughly as "we want to know the load factor of JSON-RPC connections" for which a single metric would have been sufficient. You created a PR adding three metrics without further explanation what they do (hoping the names would be fully self-explanatory?), of which one turned out to be suboptimal (#10266 (comment)), after which we said to remove it (that happened offline, so I can't link to it unfortunately). Then instead, you pushed a version where that metric is collected differently without much explanation (#10266 (comment)). And now this is another iteration of "I've pushed something new to the PR that wasn't discussed before or is explained, go figure it out yourself what it does and if/why it makes sense to do this".
I mean this isn't even about whether these metrics would be good or bad, it's just that repeated change the scope of the PR combined with the lack of proper communication/explanations just drags the whole thing out unnecessarily.
But I can drop it again if you insist.
So please remove it so that we have a chance of getting this done any time soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It expresses how idle stuff is
please remove it
Done.
124c2e8
to
e6814c8
Compare
f461342
to
c67f8dd
Compare
c67f8dd
to
8f76f7f
Compare
8f76f7f
to
d63e728
Compare
d63e728
to
8d2c0da
Compare
8d2c0da
to
8808f44
Compare
…ssing_messages} for the API
8808f44
to
7de13b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally it looks fine to me now. However, the seconds_reading_messages
metric is useless IMO as this just indicates the time needed to read the message from the Asio internal buffer. This isn't a change request though, since you already have an unresolved thread about this with @julianbrost, do whatever you have agreed on with him.
If your uplink is slow enough, it actually indicates the reading time. But yes, I can start that stopwatch only after I have at least one byte of the message. Otherwise, seconds_reading_messages would also contain idle time. |
ref/NC/820479
Tests