Skip to content

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Dec 10, 2024

@Al2Klimov Al2Klimov self-assigned this Dec 10, 2024
@cla-bot cla-bot bot added the cla/signed label Dec 10, 2024
@Al2Klimov Al2Klimov changed the title Introduce Benchmark* Introduce Benchmark Dec 10, 2024
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 5d1b07f to 92a4f45 Compare December 10, 2024 14:50
@Al2Klimov Al2Klimov changed the title Introduce Benchmark Benchmark message reading/processing time per endpoint Dec 10, 2024
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from ef2a7d3 to 3ebbfc6 Compare December 10, 2024 15:34

[no_user_modify, no_storage] double seconds_processing_messages {
get;
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works! 👍

< HTTP/1.1 200 OK
< Server: Icinga/v2.14.0-375-gecea52568
< Content-Type: application/json
< Content-Length: 591
<
{
    "results": [
        {
            "attrs": {
                "seconds_processing_messages": 3.1292e-05,
                "seconds_reading_messages": 0.000155709
            },
            "joins": {},
            "meta": {},
            "name": "dummy",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_processing_messages": 0,
                "seconds_reading_messages": 0
            },
            "joins": {},
            "meta": {},
            "name": "ws-aklimov7777777.local",
            "type": "Endpoint"
        }
    ]
}

@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 3ebbfc6 to f845407 Compare December 10, 2024 16:01
Copy link
Member Author

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also works! 👍

new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond)
new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond),
new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages),
new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_reading_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 0.000787875,
                            "warn": null
                        },
                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_processing_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 3.45e-05,
                            "warn": null
                        }

perfdata->Add(new PerfdataValue("sum_bytes_sent_per_second", bytesSentPerSecond));
perfdata->Add(new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond));
perfdata->Add(new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages));
perfdata->Add(new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_reading_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 0.000264792,
                            "warn": null
                        },
                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_processing_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 5.2666e-05,
                            "warn": null
                        }

@Al2Klimov Al2Klimov removed their assignment Dec 10, 2024
@Al2Klimov Al2Klimov marked this pull request as ready for review December 10, 2024 16:02
@Al2Klimov Al2Klimov added enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) area/api REST API area/checks Check execution and results labels Dec 10, 2024
get;
};

[no_user_modify, no_storage] double seconds_processing_messages {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colleagues, addition we could also (just for the API) record seconds_processing_messages PER message. I.e. there should be also another attribute returning a dict like {"event::CheckResult":42.0,... saying that this endpoint e.g. spent 42 seconds handling (already read and decoded) event::CheckResult messages since program start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@Al2Klimov Al2Klimov Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 20460e6 to 055c6b6 Compare December 11, 2024 09:41
@Al2Klimov Al2Klimov changed the title Benchmark message reading/processing time per endpoint Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} Dec 11, 2024
@Al2Klimov Al2Klimov requested a review from yhabteab December 16, 2024 11:33
@yhabteab yhabteab added this to the 2.15.0 milestone Jan 13, 2025
@Al2Klimov Al2Klimov removed this from the 2.15.0 milestone Jan 21, 2025
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also rebase, the Actions are pretty outdated.

@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 67ddc24 to 531991f Compare March 10, 2025 10:20
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 167e8c0 to 70cee1c Compare March 11, 2025 15:45
@Al2Klimov Al2Klimov self-assigned this Mar 17, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 70cee1c to 124c2e8 Compare March 17, 2025 09:26
Comment on lines 77 to 87
// Only once we receive at least one byte, we know there must be a message to read.
if (!m_Stream->in_avail()) {
m_Stream->async_fill(yc);
}

// Only then we can start measuring the time it takes to read it.
if (m_Endpoint) {
readStarted = AtomicDuration::Clock::now();
}

jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.

🎉

I've filtered out those 5 minutes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now reports very surprising numbers for me:

$ curl -isSku root:icinga 'https://localhost:5665/v1/objects/endpoints/master-2' -X GET --json '{"pretty":true, "attrs":["seconds_awaiting_semaphore", "seconds_processing_messages", "seconds_reading_messages"]}'
HTTP/1.1 200 OK
Server: Icinga/v2.14.0-470-gc67f8ddf1
Content-Type: application/json
Content-Length: 371

{
    "results": [
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.00343897,
                "seconds_processing_messages": 3.792230195,
                "seconds_reading_messages": 3.976059603
            },
            "joins": {},
            "meta": {},
            "name": "master-2",
            "type": "Endpoint"
        }
    ]
}

That's 3.97s reading for 3.79s processing for a local connection between two containers, I would not expect reading the message to be more expensive than parsing (JSON parsing is counted as part of processing, not reading) and handling it. So there could still be something funky with that message or we do something pretty wrong during reading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not say JSON decoding is "something pretty wrong", but I could imagine it taking comparably as long as processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON parsing is counted as part of processing

So JSON parsing does not explain why reading reports such high numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've overseen that elephant. These are my numbers btw.:

➜  hotspare git:(e241a240a) for i in 10.27.1.87 10.27.1.90 10.27.3.6  10.27.0.129; do curl -ksSu root:123456 -X GET -H 'Accept: application/json' -d '{"attrs":["seconds_awaiting_semaphore","seconds_processing_messages","seconds_reading_messages"]}' "https://${i}:5665/v1/objects/endpoints?pretty=1"; done
{
    "results": [
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0,
                "seconds_processing_messages": 0,
                "seconds_reading_messages": 0
            },
            "joins": {},
            "meta": {},
            "name": "akzl1",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.020272539,
                "seconds_processing_messages": 4.389837786,
                "seconds_reading_messages": 2.188152555
            },
            "joins": {},
            "meta": {},
            "name": "akzl2",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.03928695,
                "seconds_processing_messages": 8.916709728,
                "seconds_reading_messages": 2.046101043
            },
            "joins": {},
            "meta": {},
            "name": "akzl3",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.00137814,
                "seconds_processing_messages": 0.066405319,
                "seconds_reading_messages": 0.037797953
            },
            "joins": {},
            "meta": {},
            "name": "akzl4",
            "type": "Endpoint"
        }
    ]
}
{
    "results": [
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.050616719,
                "seconds_processing_messages": 12.356168013,
                "seconds_reading_messages": 7.742184025
            },
            "joins": {},
            "meta": {},
            "name": "akzl1",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0,
                "seconds_processing_messages": 0,
                "seconds_reading_messages": 0
            },
            "joins": {},
            "meta": {},
            "name": "akzl2",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.001335411,
                "seconds_processing_messages": 0.068493356,
                "seconds_reading_messages": 0.034790134
            },
            "joins": {},
            "meta": {},
            "name": "akzl4",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.0013662,
                "seconds_processing_messages": 0.07003537,
                "seconds_reading_messages": 0.036346842
            },
            "joins": {},
            "meta": {},
            "name": "akzl3",
            "type": "Endpoint"
        }
    ]
}
{
    "results": [
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.001642797,
                "seconds_processing_messages": 0.071626259,
                "seconds_reading_messages": 0.041599371
            },
            "joins": {},
            "meta": {},
            "name": "akzl2",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0,
                "seconds_processing_messages": 0,
                "seconds_reading_messages": 0
            },
            "joins": {},
            "meta": {},
            "name": "akzl3",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.0015372369999999999,
                "seconds_processing_messages": 0.072465629,
                "seconds_reading_messages": 0.052499105
            },
            "joins": {},
            "meta": {},
            "name": "akzl1",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.022459207,
                "seconds_processing_messages": 4.591767845,
                "seconds_reading_messages": 2.317401209
            },
            "joins": {},
            "meta": {},
            "name": "akzl4",
            "type": "Endpoint"
        }
    ]
}
{
    "results": [
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.001503529,
                "seconds_processing_messages": 0.071794129,
                "seconds_reading_messages": 0.049342663
            },
            "joins": {},
            "meta": {},
            "name": "akzl2",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.001511789,
                "seconds_processing_messages": 0.070537203,
                "seconds_reading_messages": 0.049436677
            },
            "joins": {},
            "meta": {},
            "name": "akzl1",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0,
                "seconds_processing_messages": 0,
                "seconds_reading_messages": 0
            },
            "joins": {},
            "meta": {},
            "name": "akzl4",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_awaiting_semaphore": 0.019803705,
                "seconds_processing_messages": 4.484441092,
                "seconds_reading_messages": 3.16510494
            },
            "joins": {},
            "meta": {},
            "name": "akzl3",
            "type": "Endpoint"
        }
    ]
}
➜  hotspare git:(e241a240a)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a fourth metric and ran (echo -n '2:{},';sleep 55;echo -n '2:{},'; sleep 80000) |openssl s_client -connect 127.0.0.1:5665 -cert prefix/var/lib/icinga2/certs/ws-aklimov7777777.local.crt -key prefix/var/lib/icinga2/certs/ws-aklimov7777777.local.key against my local node:

            "seconds_awaiting_messages": 54.949417417,
            "seconds_awaiting_semaphore": 2.541e-06,
            "seconds_processing_messages": 9.9251e-05,
            "seconds_reading_messages": 0.000117417,

So yes, my numbers are correct. seconds_reading_messages is just reading and especially not waiting for the next message to appear on the horizon (55s, seconds_awaiting_messages).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a useful test for you, but what is the point for actually adding it to the PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It expresses how idle stuff is and add up to 100% of the uptime together with the other numbers. Or, to say it with David Kriesel (CCC, "SpiegelMining"):

Raw data is cool af! Keep all of it unless your storage is limited.

But I can drop it again if you insist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It expresses how idle stuff is and add up to 100% of the uptime together with the other numbers.

Doesn't that even make it redundant then?

Primarily, I find it tedious that you keep changing the scope of the PR without proper communication. I mean this started roughly as "we want to know the load factor of JSON-RPC connections" for which a single metric would have been sufficient. You created a PR adding three metrics without further explanation what they do (hoping the names would be fully self-explanatory?), of which one turned out to be suboptimal (#10266 (comment)), after which we said to remove it (that happened offline, so I can't link to it unfortunately). Then instead, you pushed a version where that metric is collected differently without much explanation (#10266 (comment)). And now this is another iteration of "I've pushed something new to the PR that wasn't discussed before or is explained, go figure it out yourself what it does and if/why it makes sense to do this".

I mean this isn't even about whether these metrics would be good or bad, it's just that repeated change the scope of the PR combined with the lack of proper communication/explanations just drags the whole thing out unnecessarily.

But I can drop it again if you insist.

So please remove it so that we have a chance of getting this done any time soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It expresses how idle stuff is

please remove it

Done.

@Al2Klimov Al2Klimov removed their assignment Mar 17, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 124c2e8 to e6814c8 Compare March 17, 2025 10:27
@Al2Klimov Al2Klimov requested a review from julianbrost March 20, 2025 16:01
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from f461342 to c67f8dd Compare March 21, 2025 09:06
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from c67f8dd to 8f76f7f Compare April 3, 2025 14:13
@Al2Klimov Al2Klimov changed the title Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ Apr 3, 2025
@Al2Klimov Al2Klimov changed the title Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ Introduce Endpoint#seconds_{awaiting_messages,reading_messages,awaiting_semaphore,processing_messages}⏱️ Apr 3, 2025
@Al2Klimov Al2Klimov changed the title Introduce Endpoint#seconds_{awaiting_messages,reading_messages,awaiting_semaphore,processing_messages}⏱️ Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages}⏱️ Apr 4, 2025
@Al2Klimov Al2Klimov self-assigned this Apr 4, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 8f76f7f to d63e728 Compare April 7, 2025 08:27
@Al2Klimov Al2Klimov removed their assignment Apr 7, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from d63e728 to 8d2c0da Compare August 15, 2025 09:23
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 8d2c0da to 8808f44 Compare August 26, 2025 14:16
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 8808f44 to 7de13b6 Compare August 26, 2025 16:21
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks fine to me now. However, the seconds_reading_messages metric is useless IMO as this just indicates the time needed to read the message from the Asio internal buffer. This isn't a change request though, since you already have an unresolved thread about this with @julianbrost, do whatever you have agreed on with him.

@Al2Klimov
Copy link
Member Author

seconds_reading_messages metric is useless IMO as this just indicates the time needed to read the message from the Asio internal buffer

If your uplink is slow enough, it actually indicates the reading time.

But yes, I can start that stopwatch only after I have at least one byte of the message. Otherwise, seconds_reading_messages would also contain idle time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/checks Check execution and results area/distributed Distributed monitoring (master, satellites, clients) cla/signed enhancement New feature or request ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants