Skip to content

Provide a metric for cohort resource reservations#9833

Open
mszadkow wants to merge 2 commits intokubernetes-sigs:mainfrom
epam:feat/7539-cohort-metrics-usage
Open

Provide a metric for cohort resource reservations#9833
mszadkow wants to merge 2 commits intokubernetes-sigs:mainfrom
epam:feat/7539-cohort-metrics-usage

Conversation

@mszadkow
Copy link
Contributor

@mszadkow mszadkow commented Mar 12, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add metrics for cohort's resource reservations.

Which issue(s) this PR fixes:

Relates to #7539

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Observability:  Introduce the cohort resource reservations metric.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 12, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszadkow
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from pajakd and tenzen-y March 12, 2026 16:11
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 12, 2026
@netlify
Copy link

netlify bot commented Mar 12, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit fa2f796
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69bdca9804b89000082eb56a

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2026
@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from 1cc9dc8 to 081ab40 Compare March 13, 2026 13:37
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 13, 2026
@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch 2 times, most recently from f87ae90 to a26f57a Compare March 16, 2026 18:57
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2026
@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from a26f57a to c94a40d Compare March 17, 2026 08:44
@vladikkuzn
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2026
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 63f91d381ca482809071ea08ba34d1800d818d03

@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from c94a40d to 35c178e Compare March 18, 2026 08:25
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 18, 2026
@k8s-ci-robot k8s-ci-robot requested a review from vladikkuzn March 18, 2026 08:25
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from 11cf83f to 8761860 Compare March 19, 2026 18:07
@mszadkow
Copy link
Contributor Author

@mbobrovskyi please check again

@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from 8761860 to b85a8be Compare March 20, 2026 08:02
@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from b85a8be to 58b1de4 Compare March 20, 2026 10:22
@mszadkow mszadkow changed the title Provide a metric for cohort resource usage Provide a metric for cohort resource reservations Mar 20, 2026
@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from 58b1de4 to 06f08e6 Compare March 20, 2026 11:55
@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch 2 times, most recently from 35dee7e to ff6a34b Compare March 20, 2026 12:41
Comment on lines +68 to +70
// collectCohortMetricPoints prepares metric values for the target cohort and each ancestor.
// When simulateRemoval=true, it computes the remaining ancestor subtree values after
// subtracting the target cohort subtree quotas and reservations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the intention?

// collectCohortMetricPoints prepares subtree metric points for the target cohort
// and all cohorts on the path from the target to the root.
// When simulateRemoval=true, it computes post-removal subtree values by subtracting
// the target cohort's subtree contribution from each cohort's current subtree totals.
// This is used when clearing metrics so ancestor subtree gauges are updated too,
// rather than left with stale values after the target cohort is removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the point of this function

Comment on lines 83 to 120
removedSubtreeQuota := ch.resourceNode.SubtreeQuota

// Cache subtree reservation aggregations once per cohort during this run to avoid
// repeated recursion when walking ancestor paths.
reservationsCache := make(map[*cohort]resources.FlavorResourceQuantities)
removedSubtreeReservations := totalSubtreeReservationsWithCache(ch, reservationsCache)

var points []cohortMetricPoint
for ancestor := range ch.PathSelfToRoot() {
quotas := ancestor.resourceNode.SubtreeQuota
cohortSubtreeQuota := ancestor.resourceNode.SubtreeQuota
cohortSubtreeReservations := totalSubtreeReservationsWithCache(ancestor, reservationsCache)

var ancestorCurrentSubtreeReservations resources.FlavorResourceQuantities
if simulateRemoval {
quotas = removedQuota
// In removal mode, subtract target subtree values from the ancestor current values
// to obtain the post-removal snapshot used for clearing metrics.
cohortSubtreeQuota = removedSubtreeQuota
cohortSubtreeReservations = removedSubtreeReservations
ancestorCurrentSubtreeReservations = totalSubtreeReservationsWithCache(ancestor, reservationsCache)
}

for flr, qty := range quotas {
flavorResourceKeys := sets.New[resources.FlavorResource]()
flavorResourceKeys.Insert(slices.Collect(maps.Keys(cohortSubtreeQuota))...)
flavorResourceKeys.Insert(slices.Collect(maps.Keys(cohortSubtreeReservations))...)

for fr := range flavorResourceKeys {
quotaQty := cohortSubtreeQuota[fr]
reservationsQty := cohortSubtreeReservations[fr]
if simulateRemoval {
qty = max(ancestor.resourceNode.SubtreeQuota[flr]-qty, 0)
quotaQty = max(ancestor.resourceNode.SubtreeQuota[fr]-quotaQty, 0)
reservationsQty = max(ancestorCurrentSubtreeReservations[fr]-reservationsQty, 0)
}
points = append(points, cohortMetricPoint{
cohortName: ancestor.Name,
flavorResource: flr,
qty: qty,
cohortName: ancestor.Name,
flavorResource: fr,
quotaQty: quotaQty,
reservationsQty: reservationsQty,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is quite complex I would say. Maybe worth considering a helper to Subtract the usage maps represtented as ResourceFlavorQuantity maps to hide technicalities behind an abstraction and avoid some weird tricks with pinning / unpining the cohort variables.

I imagine the flow could be overall somthing like :

currentCohortQuota := ch.resourceNode.SubtreeQuota
for ancestor := range ch.PathSelfToRoot() {
  ancestorCohortQuota := ancestor.resourceNode.SubtreeQuota
  if simulateRemoval {
    currentCohortQuota.Subtract(ancestorCohortQuota) // or currentCohortQuotaWithRemove = currentCohortQuota.Sub(ancestorCohortQuota) if we need to also keep previous value
  }

This would iiuc allow us later to skip the simulateRemoval handling in the flow.

Let me know if I missing something or your encounter some obstacles.

}

removedQuota := ch.resourceNode.SubtreeQuota
removedSubtreeQuota := ch.resourceNode.SubtreeQuota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add some unit tests for the function which would allow quick verification of correctness with debugger without running the heavy machinery of integration tests.

@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from ff6a34b to 60b5ea1 Compare March 20, 2026 14:05
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mszadkow mszadkow force-pushed the feat/7539-cohort-metrics-usage branch from 60b5ea1 to fa2f796 Compare March 20, 2026 22:30
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 20, 2026
@k8s-ci-robot
Copy link
Contributor

@mszadkow: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-integration-multikueue-main fa2f796 link true /test pull-kueue-test-integration-multikueue-main
pull-kueue-test-e2e-certmanager-main fa2f796 link true /test pull-kueue-test-e2e-certmanager-main
pull-kueue-test-integration-extended-main fa2f796 link true /test pull-kueue-test-integration-extended-main
pull-kueue-test-e2e-multikueue-dra-main fa2f796 link true /test pull-kueue-test-e2e-multikueue-dra-main
pull-kueue-test-e2e-dra-main fa2f796 link true /test pull-kueue-test-e2e-dra-main
pull-kueue-test-unit-main fa2f796 link true /test pull-kueue-test-unit-main
pull-kueue-test-scheduling-perf-main fa2f796 link true /test pull-kueue-test-scheduling-perf-main
pull-kueue-test-integration-baseline-main fa2f796 link true /test pull-kueue-test-integration-baseline-main
pull-kueue-test-e2e-tas-main fa2f796 link true /test pull-kueue-test-e2e-tas-main
pull-kueue-test-e2e-certmanager-upgrade-main fa2f796 link true /test pull-kueue-test-e2e-certmanager-upgrade-main
pull-kueue-test-e2e-main-1-33 fa2f796 link true /test pull-kueue-test-e2e-main-1-33
pull-kueue-verify-main fa2f796 link true /test pull-kueue-verify-main
pull-kueue-build-image-main fa2f796 link true /test pull-kueue-build-image-main
pull-kueue-test-e2e-multikueue-main fa2f796 link true /test pull-kueue-test-e2e-multikueue-main
pull-kueue-test-e2e-main-1-34 fa2f796 link true /test pull-kueue-test-e2e-main-1-34
pull-kueue-test-e2e-main-1-35 fa2f796 link true /test pull-kueue-test-e2e-main-1-35
pull-kueue-populator-test-unit-main fa2f796 link true /test pull-kueue-populator-test-unit-main
pull-kueue-test-e2e-upgrade-main fa2f796 link true /test pull-kueue-test-e2e-upgrade-main
pull-kueue-populator-verify-main fa2f796 link true /test pull-kueue-populator-verify-main
pull-kueue-test-tas-scheduling-perf-main fa2f796 link true /test pull-kueue-test-tas-scheduling-perf-main
pull-kueue-test-e2e-kueueviz-main fa2f796 link true /test pull-kueue-test-e2e-kueueviz-main
pull-kueue-test-e2e-customconfigs-main fa2f796 link true /test pull-kueue-test-e2e-customconfigs-main
pull-kueue-populator-test-integration-main fa2f796 link true /test pull-kueue-populator-test-integration-main
pull-kueue-populator-test-e2e-main fa2f796 link true /test pull-kueue-populator-test-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants