Skip to content

Conversation

@fogninid
Copy link

Dependents of a kustomization, that are in "wait dependency", status should be reconciled immediately after the dependency becomes ready or is reconciled with a new revision.

This should not make any functional change compared to the current logic, only improve latency compared to current polling of requeue-dependency.

Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @fogninid! This contribution will be a really good one!!

Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! We should think about writing a test for this somehow. After fixing these comments I will run some manual tests myself 👍

@fogninid
Copy link
Author

fogninid commented Apr 8, 2025

Great! We should think about writing a test for this somehow. [...]

Can you give me some pointers how this could be tested better?
So far I have been relying only on this existing e2e, that somehow matches my expectations by completing some 30s quicker after the change.

I would really like to also have automated checks for the error cases, especially for the pathological one of cyclic dependencies, but I see them realistic only as e2e tests with quite complex setup and long run-time of the test

@stefanprodan stefanprodan added the enhancement New feature or request label Apr 13, 2025
@stefanprodan stefanprodan changed the title queue immediate reconciliation on kustomization dependency Queue immediate reconciliation on kustomization dependency Apr 13, 2025
@stefanprodan
Copy link
Member

stefanprodan commented Apr 14, 2025

Would this cause multiple reconciliations of the same object given that we add the object to the queue here:

return ctrl.Result{RequeueAfter: r.requeueDependency}, nil

Then, in this PR, if the dependency resolves faster, we add the object to the queue for a 2nd time.

@fogninid
Copy link
Author

Would this cause multiple reconciliations of the same object given that we add the object to the queue here:

return ctrl.Result{RequeueAfter: r.requeueDependency}, nil

Then, in this PR, if the dependency resolves faster, we add the object to the queue for a 2nd time.

you are right, that re-queuing is not necessary anymore: I removed it

@stefanprodan
Copy link
Member

stefanprodan commented Apr 14, 2025

you are right, that re-queuing is not necessary anymore: I removed it

This disables the the controller flag that everyone is using now, we need to deprecate it and edit its description saying that is no longer in use.

@fogninid
Copy link
Author

you are right, that re-queuing is not necessary anymore: I removed it

This disables the the controller flag that everyone is using now, we need to deprecate it and edit its description saying that is no longer in use.

I see that the same flag is used also for retrying error conditions, including those related to retrieving artifacts from the source.

Watching for objects updates can not really cover those cases, so at least some of those "requeues" should be left anyway.

If you want, it should be possible to split those cases between "transient errors" (that should be retried with a requeueAfter delay, or even a non-nil err) and "source/dependency has a not-ready status" (that could just return the reconciliation loop and wait for the watcher to queue again as soon as that status changes).

For now I have pushed again the version that queues an additional reconciliation, that might not be necessary for the normal code-path.

@stefanprodan
Copy link
Member

@fogninid I propose we make this feature optional at first. Let's add a feature gate called EnableDependencyQueueing and based on its value we add the watcher to the controller manager.

@fogninid
Copy link
Author

@stefanprodan I added the optional feature-gate as you suggested.

As far as I see, all tests are currently running with either true or false for the option, but it is not clear to me which one is preferable to set (or if it would even be feasible to run both variants for some of the tests)

Comment on lines +74 to +75
// EnableDependencyQueueing
EnableDependencyQueueing: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// EnableDependencyQueueing
EnableDependencyQueueing: false,
// EnableDependencyQueueing
// opt-in from v1.6
EnableDependencyQueueing: false,


// EnableDependencyQueueing controls whether reconciliation of a kustomization
// should be queued once one of its dependencies becomes ready, or if only
// time-based retries with reque-dependecy delays should be attempted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// time-based retries with reque-dependecy delays should be attempted
// time-based retries with requeue-dependency delays should be attempted

Comment on lines 187 to +188
DependencyRequeueInterval: 2 * time.Second,
EnableDependencyQueueing: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DependencyRequeueInterval: 2 * time.Second,
EnableDependencyQueueing: true,
DependencyRequeueInterval: time.Minute,
EnableDependencyQueueing: true,

Let's set here the requeue interval to 1m, this should cause the test to fail if the watcher doesn't work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was trying to write above (#1412 (comment)), the requeue interval is currently used also for retries that are not related to readiness.

For example the test TestKustomizationReconciler_ArtifactDownload/recovers_after_not_found_errors fails with that, because it is explicitly setting some "invalid" statuses on the resources, that cannot be covered by the predicates that are used to filter watchers.
It is possible to change the test to simulate the conditions that would match the readiness predicates...
... but I suppose it is better to change the controller logic to avoid mixing retries due to unexpected errors together with expected (watchable) non-ready states.

In my opinion those kind of retries should be handled either with delay of obj.GetRetryInterval(), or left as return ..{}, err for the runtime framework to handle.

sulibot added a commit to sulibot/home-ops that referenced this pull request Nov 25, 2025
Root cause: Flux has a known issue where dependent kustomizations don't
immediately reconcile when their dependency becomes ready. Instead, they
poll with a retry interval, causing significant startup delays.

Issue: fluxcd/kustomize-controller#1412
- Dependent kustomizations wait for dependencies using polling interval
- Status can become stale, showing "dependency not ready" even when ready
- observedGeneration: -1 indicates Flux hasn't attempted reconciliation

Solution: Disable wait:true on parent ceph-csi kustomization
- Allows namespace creation to proceed without blocking on health checks
- Child kustomizations still have proper dependencies and will wait
- Reduces startup time from minutes to seconds

The dependency chain is still enforced:
1. ceph-csi (creates namespace) - no longer blocks on health
2. ceph-csi-shared-secret (depends on ceph-csi for namespace)
3. ceph-csi-cephfs/rbd (depend on shared-secret for config)
4. ceph-csi-shared-storage (depends on drivers for StorageClass CRD)
sulibot added a commit to sulibot/home-ops that referenced this pull request Nov 25, 2025
Root cause: Flux dependency polling causes cascading delays during startup.
When wait:true is enabled, each kustomization waits for health checks AND
dependencies use 30s polling intervals instead of immediate reconciliation.

Issue: fluxcd/kustomize-controller#1412
- Dependencies poll every 30s instead of reacting to ready events
- Status becomes stale, showing "dependency not ready" even when ready
- Cascading effect: each layer adds 30-60s delay
- Total startup time: 5-10 minutes instead of <1 minute

Solution: Disable wait on foundation-layer kustomizations:
- snapshot-controller: Creates CRDs and controller (no health check needed)
- ceph-csi (app): Creates namespace only (already fixed in parent)
- external-secrets: Creates CRDs and operator (dependencies handle health)

Impact:
- Startup time reduced from minutes to seconds
- Dependencies still enforced (dependsOn unchanged)
- Child kustomizations still wait for parents
- Health checks happen at application layer, not infrastructure

The dependency chain remains intact:
1. CRDs deploy immediately (no wait)
2. Operators deploy immediately after CRDs (no wait)
3. Applications wait for operators via dependencies
4. Storage classes wait for CSI drivers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants