Queue immediate reconciliation on kustomization dependency #1412

fogninid · 2025-03-31T18:00:38Z

Dependents of a kustomization, that are in "wait dependency", status should be reconciled immediately after the dependency becomes ready or is reconciled with a new revision.

This should not make any functional change compared to the current logic, only improve latency compared to current polling of requeue-dependency.

matheuscscp

Thanks very much @fogninid! This contribution will be a really good one!!

internal/controller/kustomization_indexers.go

internal/controller/kustomization_controller.go

internal/controller/kustomization_indexers.go

matheuscscp

Great! We should think about writing a test for this somehow. After fixing these comments I will run some manual tests myself 👍

internal/controller/dependency_predicate.go

internal/controller/kustomization_indexers.go

internal/controller/kustomization_controller.go

fogninid · 2025-04-08T08:19:47Z

Great! We should think about writing a test for this somehow. [...]

Can you give me some pointers how this could be tested better?
So far I have been relying only on this existing e2e, that somehow matches my expectations by completing some 30s quicker after the change.

I would really like to also have automated checks for the error cases, especially for the pathological one of cyclic dependencies, but I see them realistic only as e2e tests with quite complex setup and long run-time of the test

internal/controller/kustomization_controller.go

stefanprodan · 2025-04-14T07:04:44Z

Would this cause multiple reconciliations of the same object given that we add the object to the queue here:

kustomize-controller/internal/controller/kustomization_controller.go

Line 277 in 35fd6c1

return ctrl.Result{RequeueAfter: r.requeueDependency}, nil

Then, in this PR, if the dependency resolves faster, we add the object to the queue for a 2nd time.

fogninid · 2025-04-14T10:16:25Z

Would this cause multiple reconciliations of the same object given that we add the object to the queue here:

kustomize-controller/internal/controller/kustomization_controller.go

Line 277 in 35fd6c1

return ctrl.Result{RequeueAfter: r.requeueDependency}, nil

Then, in this PR, if the dependency resolves faster, we add the object to the queue for a 2nd time.

you are right, that re-queuing is not necessary anymore: I removed it

stefanprodan · 2025-04-14T10:34:09Z

you are right, that re-queuing is not necessary anymore: I removed it

This disables the the controller flag that everyone is using now, we need to deprecate it and edit its description saying that is no longer in use.

fogninid · 2025-04-14T23:51:46Z

you are right, that re-queuing is not necessary anymore: I removed it

This disables the the controller flag that everyone is using now, we need to deprecate it and edit its description saying that is no longer in use.

I see that the same flag is used also for retrying error conditions, including those related to retrieving artifacts from the source.

Watching for objects updates can not really cover those cases, so at least some of those "requeues" should be left anyway.

If you want, it should be possible to split those cases between "transient errors" (that should be retried with a requeueAfter delay, or even a non-nil err) and "source/dependency has a not-ready status" (that could just return the reconciliation loop and wait for the watcher to queue again as soon as that status changes).

For now I have pushed again the version that queues an additional reconciliation, that might not be necessary for the normal code-path.

stefanprodan · 2025-04-15T06:17:20Z

@fogninid I propose we make this feature optional at first. Let's add a feature gate called EnableDependencyQueueing and based on its value we add the watcher to the controller manager.

Signed-off-by: Daniele Fognini <[email protected]>

fogninid · 2025-04-15T19:29:00Z

@stefanprodan I added the optional feature-gate as you suggested.

As far as I see, all tests are currently running with either true or false for the option, but it is not clear to me which one is preferable to set (or if it would even be feasible to run both variants for some of the tests)

stefanprodan · 2025-04-16T06:57:38Z

internal/features/features.go

+	// EnableDependencyQueueing
+	EnableDependencyQueueing: false,


Suggested change

// EnableDependencyQueueing

EnableDependencyQueueing: false,

// EnableDependencyQueueing

// opt-in from v1.6

EnableDependencyQueueing: false,

stefanprodan · 2025-04-16T06:59:26Z

internal/features/features.go

+
+	// EnableDependencyQueueing controls whether reconciliation of a kustomization
+	// should be queued once one of its dependencies becomes ready, or if only
+	// time-based retries with reque-dependecy delays should be attempted


Suggested change

// time-based retries with reque-dependecy delays should be attempted

// time-based retries with requeue-dependency delays should be attempted

stefanprodan · 2025-04-16T07:06:50Z

internal/controller/suite_test.go

 			DependencyRequeueInterval: 2 * time.Second,
+			EnableDependencyQueueing:  true,


Suggested change

DependencyRequeueInterval: 2 * time.Second,

EnableDependencyQueueing: true,

DependencyRequeueInterval: time.Minute,

EnableDependencyQueueing: true,

Let's set here the requeue interval to 1m, this should cause the test to fail if the watcher doesn't work.

As I was trying to write above (#1412 (comment)), the requeue interval is currently used also for retries that are not related to readiness.

For example the test TestKustomizationReconciler_ArtifactDownload/recovers_after_not_found_errors fails with that, because it is explicitly setting some "invalid" statuses on the resources, that cannot be covered by the predicates that are used to filter watchers.
It is possible to change the test to simulate the conditions that would match the readiness predicates...
... but I suppose it is better to change the controller logic to avoid mixing retries due to unexpected errors together with expected (watchable) non-ready states.

In my opinion those kind of retries should be handled either with delay of obj.GetRetryInterval(), or left as return ..{}, err for the runtime framework to handle.

Root cause: Flux has a known issue where dependent kustomizations don't immediately reconcile when their dependency becomes ready. Instead, they poll with a retry interval, causing significant startup delays. Issue: fluxcd/kustomize-controller#1412 - Dependent kustomizations wait for dependencies using polling interval - Status can become stale, showing "dependency not ready" even when ready - observedGeneration: -1 indicates Flux hasn't attempted reconciliation Solution: Disable wait:true on parent ceph-csi kustomization - Allows namespace creation to proceed without blocking on health checks - Child kustomizations still have proper dependencies and will wait - Reduces startup time from minutes to seconds The dependency chain is still enforced: 1. ceph-csi (creates namespace) - no longer blocks on health 2. ceph-csi-shared-secret (depends on ceph-csi for namespace) 3. ceph-csi-cephfs/rbd (depend on shared-secret for config) 4. ceph-csi-shared-storage (depends on drivers for StorageClass CRD)

Root cause: Flux dependency polling causes cascading delays during startup. When wait:true is enabled, each kustomization waits for health checks AND dependencies use 30s polling intervals instead of immediate reconciliation. Issue: fluxcd/kustomize-controller#1412 - Dependencies poll every 30s instead of reacting to ready events - Status becomes stale, showing "dependency not ready" even when ready - Cascading effect: each layer adds 30-60s delay - Total startup time: 5-10 minutes instead of <1 minute Solution: Disable wait on foundation-layer kustomizations: - snapshot-controller: Creates CRDs and controller (no health check needed) - ceph-csi (app): Creates namespace only (already fixed in parent) - external-secrets: Creates CRDs and operator (dependencies handle health) Impact: - Startup time reduced from minutes to seconds - Dependencies still enforced (dependsOn unchanged) - Child kustomizations still wait for parents - Health checks happen at application layer, not infrastructure The dependency chain remains intact: 1. CRDs deploy immediately (no wait) 2. Operators deploy immediately after CRDs (no wait) 3. Applications wait for operators via dependencies 4. Storage classes wait for CSI drivers

matheuscscp reviewed Apr 7, 2025

View reviewed changes

fogninid force-pushed the ksDependsOn branch from fadeba5 to 5f602fc Compare April 7, 2025 20:57

matheuscscp reviewed Apr 8, 2025

View reviewed changes

internal/controller/dependency_predicate.go Outdated Show resolved Hide resolved

internal/controller/kustomization_indexers.go Outdated Show resolved Hide resolved

internal/controller/kustomization_controller.go Outdated Show resolved Hide resolved

fogninid force-pushed the ksDependsOn branch from 5f602fc to aa39ed4 Compare April 8, 2025 08:08

stefanprodan reviewed Apr 8, 2025

View reviewed changes

internal/controller/kustomization_controller.go Outdated Show resolved Hide resolved

fogninid force-pushed the ksDependsOn branch from aa39ed4 to 1dd37ae Compare April 8, 2025 09:37

stefanprodan added the enhancement New feature or request label Apr 13, 2025

stefanprodan changed the title ~~queue immediate reconciliation on kustomization dependency~~ Queue immediate reconciliation on kustomization dependency Apr 13, 2025

fogninid force-pushed the ksDependsOn branch from 1dd37ae to 6322545 Compare April 14, 2025 10:12

fogninid force-pushed the ksDependsOn branch from 6322545 to 1dd37ae Compare April 14, 2025 23:18

queue immediate reconciliation on kustomization dependency

f5a6534

Signed-off-by: Daniele Fognini <[email protected]>

fogninid force-pushed the ksDependsOn branch from 1dd37ae to f5a6534 Compare April 15, 2025 19:23

stefanprodan reviewed Apr 16, 2025

View reviewed changes

	// time-based retries with reque-dependecy delays should be attempted
	// time-based retries with requeue-dependency delays should be attempted

		DependencyRequeueInterval: 2 * time.Second,
		EnableDependencyQueueing: true,

Queue immediate reconciliation on kustomization dependency #1412

Are you sure you want to change the base?

Queue immediate reconciliation on kustomization dependency #1412

Uh oh!

Conversation

fogninid commented Mar 31, 2025

Uh oh!

matheuscscp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matheuscscp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fogninid commented Apr 8, 2025

Uh oh!

Uh oh!

stefanprodan commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fogninid commented Apr 14, 2025

Uh oh!

stefanprodan commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fogninid commented Apr 14, 2025

Uh oh!

stefanprodan commented Apr 15, 2025

Uh oh!

fogninid commented Apr 15, 2025

Uh oh!

stefanprodan Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

stefanprodan Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

stefanprodan Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

fogninid Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefanprodan commented Apr 14, 2025 •

edited

Loading

stefanprodan commented Apr 14, 2025 •

edited

Loading