Skip to content

Conversation

julianbrost
Copy link
Contributor

So far, calling Checkable::IsReachable() traversed all possible paths to it's parents. In case a parent is reachable via multiple paths, all it's parents were evaluated multiple times, result in a worst-case exponential complexity.

With this PR, the implementation keeps track of which checkables were already visited and uses the already-computed reachability instead of repeating the computation, ensuring a worst-case linear runtime within the graph size.

For implementing this, there are two additional commits with preparations for that change:

  • Add T::ConstPtr as a typedef for intrusive_ptr<const T> combined with the necessary changes to support this (changing some functions to accept a const Object* instead of an Object* as well as marking the reference counter attribute as mutable). This was done because Checkable::IsReachable() is const, i.e. this is of type const Checkable*, so a Checkable::Ptr can't be constructed from it, only the new Checkable::ConstPtr.
  • The actual implementation of Checkable::IsReachable() and DependencyGroup::GetState() is moved to a new helper class (the original methods are kept and now transparently call the ones in the helper class), so that they can easily access common storage (i.e. a cache needed for implementing this) without having to explicitly pass the cache through a public interface or having to resort to thread_local.

Test

I've prepared a config file that creates the following dependency graph that is pretty much the worst case for the old implementation. Note that this diagram is reduced to 4 levels, the actual config generates 25 levels, making the issue much more obvious when running the config.

graph TD;
    2-->0;
    2-->1;
    3-->0;
    3-->1;
    4-->2;
    4-->3;
    5-->2;
    5-->3;
    6-->4;
    6-->5;
    7-->4;
    7-->5;
Loading
bobby-little-dependencies.conf
include "/etc/icinga2/icinga2.conf"

var h = "bobby-little-dependencies"
object Host h {
	check_command = "dummy"
}

for (var i in range(25)) { // change this number to scale the depth of the dependency graph
	for (var j in range(2)) {
		var s = 2*i + j 
		object Service s use (h) {
			host_name = h
			check_command = "dummy"
			check_interval = 10s
			retry_interval = 10s
			vars.dummy_text = {{
				log(macro("checking $host.name$!$service.name$"))
			}}
		}
		if (i > 0) {
			for (var k in range(2)) {
				var d = 2*(i-1) + k
				log(String(s) + "-->" + String(d) + ";")
				object Dependency d use (s, d, h) {
					parent_host_name = h
					parent_service_name = d
					child_host_name = h
					child_service_name = s
				}
			}
		}
	}
}

This can easily be started in a container:

docker run --rm -it -v $(pwd)/bobby-little-dependencies.conf:/icinga2.conf:ro icinga/icinga2:dev icinga2 daemon -c /icinga2.conf

Before

Heavy CPU usage (fluctuating somewhere between 200% and 800% on my machine), checks aren't executed every 10s as configured.

After

Almost no CPU usage (mostly idle, peaks of around 2% on my machine), regular check execution.

ref/IP/59145
ref/IP/59671

@julianbrost julianbrost added bug Something isn't working core/quality Improve code, libraries, algorithms, inline docs ref/IP area/runtime Downtimes, comments, dependencies, events labels Jul 28, 2025
@cla-bot cla-bot bot added the cla/signed label Jul 28, 2025
Copy link
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this with the given config snippet and verified that this indeed fixes the exponential CPU-time use by the master-branch code. It's still not perfect, I now get ~2% usage vs. ~0.5% with about the same amount of objects (dependencies, hosts and services) in a flat hierarchy, but maybe that can't be avoided anyway, I don't know...

@julianbrost
Copy link
Contributor Author

now get ~2% usage vs. ~0.5% with about the same amount of objects (dependencies, hosts and services) in a flat hierarchy

What exactly do you mean by flat? I'd expect the most comparable result from a long chain of dependencies (i.e. 0 -> 1 -> 2 -> 3 -> ...) with the same number of checkables (i.e. length being twice the depth of the graph in my example), though that will probably still be a bit cheaper as you only have half the Dependency objects which are iterated over when determining the reachability.

@jschmidt-icinga
Copy link
Contributor

What exactly do you mean by flat?

Same number of objects total, but each service depends on a single host (i.e. the one host).

I'd expect the most comparable result from a long chain of dependencies (i.e. 0 -> 1 -> 2 -> 3 -> ...) with the same number of checkables

That's probably it.

This allows using ref-counted pointers to const objects. Adds a second typedef
so that T::ConstPtr can be used similar to how T::Ptr currently is.
@julianbrost julianbrost force-pushed the dependency-eval-complexity branch 2 times, most recently from 05c92c8 to 0f6185a Compare July 30, 2025 15:20
@julianbrost julianbrost requested a review from yhabteab July 31, 2025 08:14
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:

Bildschirmaufnahme.2025-07-31.um.10.15.07.mov

After:

Bildschirmaufnahme.2025-07-31.um.10.23.12.mov

Checkable::IsReachable() and DependencyGroup::GetState() call each other
recursively. Moving them to a common helper class allows adding caching to them
in a later commit without having to pass a cache between the functions (through
a public interface) or resorting to thread_local variables.
So far, calling Checkable::IsReachable() traversed all possible paths to it's
parents. In case a parent is reachable via multiple paths, all it's parents
were evaluated multiple times, result in a worst-case exponential complexity.

With this commit, the implementation keeps track of which checkables were
already visited and uses the already-computed reachability instead of repeating
the computation, ensuring a worst-case linear runtime within the graph size.
@yhabteab yhabteab merged commit 1f92ec6 into master Aug 5, 2025
29 checks passed
@yhabteab yhabteab deleted the dependency-eval-complexity branch August 5, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime Downtimes, comments, dependencies, events bug Something isn't working cla/signed core/quality Improve code, libraries, algorithms, inline docs ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants