Releases: kubernetes-sigs/kueue
v0.16.4
Changes since v0.16.3:
Changes by Kind
Feature
- Helm: Allow setting log level (#9944, @gabesaba)
- TAS: Extend the support for handling NoSchedule taints when the TASReplaceNodeOnNodeTaints feature gate is enabled. (#10003, @j-skiba)
- VisibilityOnDemand: Introduce a new Kueue deployment argument, --visibility-server-port, which allows passing custom port when starting the visibility server. (#9976, @Nilsachy)
Bug or Regression
-
LWS integration: Fixed a bug that the
kueue.x-k8s.io/job-uidlabel was not set on the workloads. (#10010, @mbobrovskyi) -
MultiKueue: Enable AllowWatchBookmarks for remote client watches to prevent idle watch connections from being terminated by HTTP proxies with idle timeouts (e.g., Cloudflare 524 errors). (#9990, @trilamsr)
-
Scheduling: fix the issue that scheduler could indefinitely try re-queueing a workload which was once
inadmissible, but is admissible after an update. The issue affected workloads which don't specify
resource requests explicitly, but rely on defaulting based on limits. (#9913, @mimowo) -
Scheduling: fixed SchedulingEquivalenceHashing so equivalent workloads that become inadmissible through
the preemption path with no candidates are also covered by the mechanism.As a safety measure while the broader fix is validated, the beta SchedulingEquivalenceHashing feature gate
is temporarily disabled by default. (#10007, @mimowo) -
StatefulSet integration: Fixed a bug that the
kueue.x-k8s.io/job-uidlabel was not set on the workloads. (#9902, @mbobrovskyi) -
TAS: Fixed a bug where pods could become stuck in a
Pendingstate during node replacement.
This may occur when a node gets tainted orNotReadyafter the topology assignment phase, but before
the pods are ungated. (#9978, @j-skiba) -
TAS: fix the bug that workloads which only specify resource limits, without requests, are not able to perform
the second-pass scheduling correctly, responsible for NodeHotSwap and ProvisioningRequests. (#9947, @mimowo) -
VisibilityOnDemand: Fix non-deterministic workload ordering with UsageBasedAdmissionFairSharing enabled. (#9955, @sohankunkerkar)
Other (Cleanup or Flake)
v0.15.7
Changes since v0.15.6:
Changes by Kind
Feature
- Helm: Allow setting log level (#9944, @gabesaba)
- TAS: Extend the support for handling NoSchedule taints when the TASReplaceNodeOnNodeTaints feature gate is enabled. (#10002, @j-skiba)
- VisibilityOnDemand: Introduce a new Kueue deployment argument, --visibility-server-port, which allows passing custom port when starting the visibility server. (#9975, @Nilsachy)
Bug or Regression
-
LWS integration: Fixed a bug that the
kueue.x-k8s.io/job-uidlabel was not set on the workloads. (#10011, @mbobrovskyi) -
MultiKueue: Enable AllowWatchBookmarks for remote client watches to prevent idle watch connections from being terminated by HTTP proxies with idle timeouts (e.g., Cloudflare 524 errors). (#9989, @trilamsr)
-
Scheduling: fix the issue that scheduler could indefinitely try re-queueing a workload which was once
inadmissible, but is admissible after an update. The issue affected workloads which don't specify
resource requests explicitly, but rely on defaulting based on limits. (#9912, @mimowo) -
Scheduling: fixed SchedulingEquivalenceHashing so equivalent workloads that become inadmissible through
the preemption path with no candidates are also covered by the mechanism.As a safety measure while the broader fix is validated, the beta SchedulingEquivalenceHashing feature gate
is temporarily disabled by default. (#10008, @mimowo) -
StatefulSet integration: Fixed a bug that the
kueue.x-k8s.io/job-uidlabel was not set on the workloads. (#9903, @mbobrovskyi) -
TAS: Fixed a bug where pods could become stuck in a
Pendingstate during node replacement.
This may occur when a node gets tainted orNotReadyafter the topology assignment phase, but before
the pods are ungated. (#9977, @j-skiba) -
TAS: fix the bug that workloads which only specify resource limits, without requests, are not able to perform
the second-pass scheduling correctly, responsible for NodeHotSwap and ProvisioningRequests. (#9948, @mimowo) -
VisibilityOnDemand: Fix non-deterministic workload ordering with UsageBasedAdmissionFairSharing enabled. (#9956, @sohankunkerkar)
v0.16.3
Changes since v0.16.2:
Changes by Kind
Feature
- Observability: Add scheduler logs for the scheduling cycle phase boundaries. (#9813, @sohankunkerkar)
- Scheduling: Add the alpha SchedulerLongRequeueInterval feature gate (disabled by default) to increase the
inadmissible workload requeue interval from 1s to 10s. This may help to mitigate, on large environments with
many pending workloads, issues with frequent re-queues that prevent the scheduler from reaching schedulable
workloads deeper in the queue and result in constant re-evaluation of the same top workloads. (#9819, @mbobrovskyi) - Scheduling: Add the alpha SchedulerTimestampPreemptionBuffer feature gate (disabled by default) to use
5-minute buffer so that workloads with scheduling timestamps within this buffer don’t preempt each other
based on LowerOrNewerEqualPriority. (#9837, @mbobrovskyi)
Bug or Regression
- FailureRecoveryPolicy: forcefully delete stuck pods (without grace period) in addition to transitioning them
to theFailedphase. This fixes a scenario where foreground propagating deletions were blocked by a stuck pod. (#9673, @kshalot) - Fix a race where updated workload priority could remain stuck in the inadmissible queue and delay rescheduling. (#9678, @sohankunkerkar)
- In fair sharing preemption, bypass DRS strategy gates when the preemptor ClusterQueue is within nominal quota for contested resources, allowing preemption even if the CQ's aggregate DRS is high due to borrowing on other flavors. (#9592, @mukund-wayve)
- Kueueviz: fetch Cohort CRD directly, instead of deriving from ClusterQueues (#9720, @samzong)
- LeaderWorkerSet: fix workload recreation delay during rolling updates by watching for workload deletions. (#9680, @PannagaRao)
- Observability: Fix missing replica_role=leader gauge metrics after HA role transition. (#9794, @IrvingMg)
- Scheduling: Fix a BestEffortFIFO performance issue where many equivalent workloads could
prevent the scheduler from reaching schedulable workloads deeper in the queue. Kueue now
skips redundant evaluation by bulk-moving same-hash workloads to inadmissible when one
representative is categorized as NoFit. (#9698, @sohankunkerkar) - Scheduling: Fix that the Kueue's scheduler could issue duplicate preemption requests and events for the same workload. (#9627, @sohankunkerkar)
- Scheduling: Fixed a race condition where a workload could simultaneously exist in the scheduler's heap
and the "inadmissible workloads" list. This fix prevents unnecessary scheduler cycles and prevents temporary
double counting for the metric of pending workloads. (#9638, @sohankunkerkar) - Scheduling: Reduced the maximum sleep time between scheduling cycles from 100ms to 10ms.
This change fixes a bug where the 100ms delay was excessive on busy systems, in which completed
workloads can trigger requeue events every second. In such cases, the scheduler could spend up to 10%
of the time between requeue events sleeping. Reducing the delay allows the scheduler to spend more time
progressing through the ClusterQueue heap between requeue events. (#9763, @mimowo) - StatefulSet integration: fix the bug that when using
generateNamethe Workload names generated
for two different StatefulSets would conflict, not allowing to run the second StatefulSet. (#9693, @IrvingMg) - TAS: Fix performance bug where snapshotting would take very long due to List and DeepCopy
of all Nodes. Now the cached set of nodes is maintained in event-driven fashion. (#9783, @mbobrovskyi) - TAS: support ResourceTransformations to define "virtual" resources which allow putting a cap on
some "virtual" credits across multiple-flavors, see sharing quotas for quota-only resources.
This is considered a bug since there was no validation preventing such configuration before. (#9688, @mbobrovskyi) - VisibilityOnDemand: Fix the bug that when running Kueue with the custom
--kubeconfigflag the visibility server
fails to initialize, because the custom value of the flag is not propagated to it, leading to errors such as:
"Unable to create and start visibility server","error":"unable to apply VisibilityServerOptions: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: ..." (#9805, @Nilsachy)
v0.15.6
Changes since v0.15.5:
Changes by Kind
Feature
- Observability: Add scheduler logs for the scheduling cycle phase boundaries. (#9815, @sohankunkerkar)
- Scheduling: Add the alpha SchedulerLongRequeueInterval feature gate (disabled by default) to increase the
inadmissible workload requeue interval from 1s to 10s. This may help to mitigate, on large environments with
many pending workloads, issues with frequent re-queues that prevent the scheduler from reaching schedulable
workloads deeper in the queue and result in constant re-evaluation of the same top workloads. (#9820, @mbobrovskyi) - Scheduling: Add the alpha SchedulerTimestampPreemptionBuffer feature gate (disabled by default) to use
5-minute buffer so that workloads with scheduling timestamps within this buffer don’t preempt each other
based on LowerOrNewerEqualPriority. (#9838, @mbobrovskyi)
Bug or Regression
- FailureRecoveryPolicy: forcefully delete stuck pods (without grace period) in addition to transitioning them
to theFailedphase. This fixes a scenario where foreground propagating deletions were blocked by a stuck pod. (#9672, @kshalot) - Fix a race where updated workload priority could remain stuck in the inadmissible queue and delay rescheduling. (#9661, @sohankunkerkar)
- In fair sharing preemption, bypass DRS strategy gates when the preemptor ClusterQueue is within nominal quota for contested resources, allowing preemption even if the CQ's aggregate DRS is high due to borrowing on other flavors. (#9593, @mukund-wayve)
- Kueueviz: fetch Cohort CRD directly, instead of deriving from ClusterQueue (#9744, @samzong)
- LeaderWorkerSet: fix workload recreation delay during rolling updates by watching for workload deletions. (#9631, @PannagaRao)
- Scheduling: Fix a BestEffortFIFO performance issue where many equivalent workloads could
prevent the scheduler from reaching schedulable workloads deeper in the queue. Kueue now
skips redundant evaluation by bulk-moving same-hash workloads to inadmissible when one
representative is categorized as NoFit. (#9698, @sohankunkerkar) - Scheduling: Fix that the Kueue's scheduler could issue duplicate preemption requests and events for the same workload. (#9641, @sohankunkerkar)
- Scheduling: Fixed a race condition where a workload could simultaneously exist in the scheduler's heap
and the "inadmissible workloads" list. This fix prevents unnecessary scheduler cycles and prevents temporary
double counting for the metric of pending workloads. (#9639, @sohankunkerkar) - Scheduling: Reduced the maximum sleep time between scheduling cycles from 100ms to 10ms.
This change fixes a bug where the 100ms delay was excessive on busy systems, in which completed
workloads can trigger requeue events every second. In such cases, the scheduler could spend up to 10%
of the time between requeue events sleeping. Reducing the delay allows the scheduler to spend more time
progressing through the ClusterQueue heap between requeue events. (#9762, @mimowo) - StatefulSet integration: fix the bug that when using
generateNamethe Workload names generated
for two different StatefulSets would conflict, not allowing to run the second StatefulSet. (#9695, @IrvingMg) - TAS: Fix performance bug where snapshotting would take very long due to List and DeepCopy
of all Nodes. Now the cached set of nodes is maintained in event-driven fashion. (#9786, @mbobrovskyi) - TAS: support ResourceTransformations to define "virtual" resources which allow putting a cap on
some "virtual" credits across multiple-flavors, see sharing quotas for quota-only resources.
This is considered a bug since there was no validation preventing such configuration before. (#9691, @mbobrovskyi) - VisibilityOnDemand: Fix the bug that when running Kueue with the custom
--kubeconfigflag the visibility server
fails to initialize, because the custom value of the flag is not propagated to it, leading to errors such as:
"Unable to create and start visibility server","error":"unable to apply VisibilityServerOptions: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: ..." (#9806, @Nilsachy)
v0.16.2
Changes since v0.16.1:
Changes by Kind
Feature
- KueueViz Helm: Add podSecurityContext and containerSecurityContext configuration options to KueueViz Helm chart for restricted pod security profile compliance (#9319, @ziadmoubayed)
- Observability: Increased the maximum finite bucket boundary for admission_wait_time_seconds histogram from ~2.84 hours to ~11.3 hours for better observability of long queue times. (#9507, @mukund-wayve)
Bug or Regression
- ElasticJobs: fix the temporary double-counting of quota during workload replacement.
In particular it was causing double-counting of quota requests for unchanged PodSets. (#9364, @benkermani) - FairSharing: workloads fitting within their ClusterQueue's nominal quota are now preferred over workloads that require borrowing, preventing heavy borrowing on one flavor from deprioritizing a CQ's nominal entitlement on another flavor. (#9532, @mukund-wayve)
- Fix non-deterministic workload ordering in ClusterQueue by adding UID tie-breaker to queue ordering function. (#9140, @sohankunkerkar)
- Fix serverName substitution in kustomize prometheus ServiceMonitor TLS patch for cert-manager deployments. (#9188, @IrvingMg)
- Fixed invalid field name in the
ClusterQueueprinter columns. The "Cohort" column will now correctly display the assigned cohort in kubectl, k9s, and other UI tools instead of being blank. (#9422, @polinasand) - Fixed the bug that prevented managing workloads with duplicated environment variable names in initContainers. This issue manifested when creating the Workload via the API. (#9126, @monabil08)
- FlavorFungability: fix the bug that the semantics for the
flavorFungability.preferenceenum values
(ie. PreemptionOverBorrowing and BorrowingOverPreemption) were swapped. (#9486, @tenzen-y) - LeaderWorkerSet: fix an occasional race condition resulting in workload deletion getting stuck during scale down. (#9135, @PannagaRao)
- MultiKueue: Fix a bug that the remote Job object was occasionally left by MultiKueue GC,
even when the corresponding Job object on the management cluster was deleted.
This issue was observed for LeaderWorkerSet. (#9310, @sohankunkerkar) - MultiKueue: for the StatefulSet integration copy the entire StatefulSet onto the worker clusters. This allows
for proper management (and replacements) of Pods on the worker clusters. (#9539, @IrvingMg) - Observability: Fix missing "replica-role" in the logs from the NonTasUsageReconciler. (#9456, @IrvingMg)
- Observability: Fix the stale "replica-role" value in scheduler logs after leader election. (#9431, @IrvingMg)
- Scheduling: Fix the bug where inadmissible workloads would be re-queued too frequently at scale.
This resulted in excessive processing, lock contention, and starvation of workloads deeper in the queue.
The fix is to throttle the process with a batch period of 1s per CQ or Cohort. (#9490, @gabesaba) - TAS: Fix a bug that LeaderWorkerSet with multiple PodTemplates (
.spec.leaderWorkerTemplate.leaderTemplateand.spec.leaderWorkerTemplate.workerTemplate), Pod indexes are not correctly evaluated during rank-based ordering assignments. (#9368, @tenzen-y) - TAS: fix a bug where NodeHotSwap may assign a Pod, based on rank-ordering, to a node which is already
occupied by another running Pod. (#9282, @j-skiba)
v0.15.5
Changes since v0.15.4:
Changes by Kind
Feature
- KueueViz Helm: Add podSecurityContext and containerSecurityContext configuration options to KueueViz Helm chart for restricted pod security profile compliance (#9320, @ziadmoubayed)
- Observability: Increased the maximum finite bucket boundary for admission_wait_time_seconds histogram from ~2.84 hours to ~11.3 hours for better observability of long queue times. (#9530, @mukund-wayve)
- TAS: Introduce the TASReplaceNodeOnNodeTaints feature gate (alpha) to allow TAS workloads to be evicted or replaced when a node is tainted with NoExecute. (#9441, @j-skiba)
Bug or Regression
- ElasticJobs: fix the temporary double-counting of quota during workload replacement.
In particular it was causing double-counting of quota requests for unchanged PodSets. (#9365, @benkermani) - FairSharing: workloads fitting within their ClusterQueue's nominal quota are now preferred over workloads that require borrowing, preventing heavy borrowing on one flavor from deprioritizing a CQ's nominal entitlement on another flavor. (#9533, @mukund-wayve)
- Fix non-deterministic workload ordering in ClusterQueue by adding UID tie-breaker to queue ordering function. (#9164, @sohankunkerkar)
- Fix serverName substitution in kustomize prometheus ServiceMonitor TLS patch for cert-manager deployments. (#9190, @IrvingMg)
- Fixed invalid field name in the
ClusterQueueprinter columns. The "Cohort" column will now correctly display the assigned cohort in kubectl, k9s, and other UI tools instead of being blank. (#9447, @polinasand) - Fixed the bug that prevented managing workloads with duplicated environment variable names in initContainers. This issue manifested when creating the Workload via the API. (#9127, @monabil08)
- LeaderWorkerSet: fix an occasional race condition resulting in workload deletion getting stuck during scale down. (#9135, @PannagaRao)
- MultiKueue: Fix a bug that the remote Job object was occasionally left by MultiKueue GC,
even when the corresponding Job object on the management cluster was deleted.
This issue was observed for LeaderWorkerSet. (#9309, @sohankunkerkar) - Scheduling: Fix the bug where inadmissible workloads would be re-queued too frequently at scale.
This resulted in excessive processing, lock contention, and starvation of workloads deeper in the queue.
The fix is to throttle the process with a batch period of 1s per CQ or Cohort. (#9232, @gabesaba) - TAS: Fix a bug that LeaderWorkerSet with multiple PodTemplates (
.spec.leaderWorkerTemplate.leaderTemplateand.spec.leaderWorkerTemplate.workerTemplate), Pod indexes are not correctly evaluated during rank-based ordering assignments. (#9369, @tenzen-y) - TAS: fix a bug where NodeHotSwap may assign a Pod, based on rank-ordering, to a node which is already
occupied by another running Pod. (#9283, @j-skiba)
Full Changelog: v0.15.4...v0.15.5
v0.16.1
Changes since v0.16.0:
Changes by Kind
Feature
- KueueViz backend and frontend resource requests/limits are now configurable via Helm values (kueueViz.backend.resources and kueueViz.frontend.resources). (#8981, @david-gang)
Bug or Regression
-
Fix Visibility API OpenAPI schema generation to prevent schema resolution errors when visibility v1beta1/v1beta2 APIServices are installed.
The visibility schema issues result in the following error when re-applying the manifest for Kueue 0.16.0:
failed to load open api schema while syncing cluster cache: error getting openapi resources: SchemaError(sigs.k8s.io/kueue/apis/visibility/v1beta1.PendingWorkloadsSummary.items): unknown model in reference: "sigs.k8s.io~1kueue~1apis~1visibility~1v1beta1.PendingWorkload"(#8901, @vladikkuzn) -
Fix a bug where finished or deactivated workloads blocked ClusterQueue deletion and finalizer removal. (#8936, @sohankunkerkar)
-
LeaderWorkerSet: Fix the bug where rolling updates with maxSurge could get stuck. (#8886, @PannagaRao)
-
LeaderWorkerSet: Fixed bug that doesn't allow to delete Pod after LeaderWorkerSet delete (#8882, @mbobrovskyi)
-
Metrics certificate is now reloaded when certificate data is updated. (#9099, @MaysaMacedo)
-
MultiKueue & ElasticJobs: fix the bug that the new size of a Job was not reflected on the worker cluster. (#9055, @ichekrygin)
-
Observability: Fix Prometheus ServiceMonitor selector and RBAC to enable metrics scraping. (#8980, @IrvingMg)
-
Observability: Fixed a bug where workloads that finished before a Kueue restart were not tracked in the gauge metrics for finished workloads. (#8827, @mbobrovskyi)
-
Observability: fix the bug that the "replica-role" (leader / follower) log decorator was missing in the log lines output by
the webhooks for LeaderWorkerSet and StatefulSet . (#8820, @mszadkow) -
PodIntegration: Fix the bug that Kueue would occasionally remove the custom finalizers when
removing thekueue.x-k8s.io/managedfinalizer. (#8903, @mykysha) -
RayJob integration: Make RayJob top level workload managed by Kueue when autoscaling via
ElasticJobsViaWorkloadSlices is enabled.If you are an alpha user of the ElasticJobsViaWorkloadSlices feature for RayJobs, then upgrading Kueue may impact running live jobs which have autoscaling / workload slicing enabled. For example, if you upgrade Kueue, before
scaling-up completes, the new pods will be stuck in SchedulingGated state. (#9039, @hiboyang) -
TAS: Fix a bug that TAS ignored resources excluded by excludeResourcePrefixes for node placement. (#8990, @sohankunkerkar)
-
TAS: Fixed a bug that pending workloads could be stuck, not being considered by the Kueue's scheduler,
after the restart of Kueue. The workloads would be considered for scheduling again after any update to their
ClusterQueue. (#9056, @sohankunkerkar)
Other (Cleanup or Flake)
- KueueViz: It switches to the v1beta2 API (#8804, @mbobrovskyi)
v0.15.4
Changes since v0.15.3:
Changes by Kind
Feature
- KueueViz backend and frontend resource requests/limits are now configurable via Helm values (kueueViz.backend.resources and kueueViz.frontend.resources). (#8982, @david-gang)
Bug or Regression
-
Fix a bug where finished or deactivated workloads blocked ClusterQueue deletion and finalizer removal. (#8940, @sohankunkerkar)
-
LeaderWorkerSet: Fix the bug where rolling updates with maxSurge could get stuck. (#8887, @PannagaRao)
-
LeaderWorkerSet: Fixed bug that doesn't allow to delete Pod after LeaderWorkerSet delete (#8883, @mbobrovskyi)
-
Metrics certificate is now reloaded when certificate data is updated. (#9100, @MaysaMacedo)
-
MultiKueue & ElasticJobs: fix the bug that the new size of a Job was not reflected on the worker cluster. (#9044, @ichekrygin)
-
Observability: Fix Prometheus ServiceMonitor selector and RBAC to enable metrics scraping. (#8979, @IrvingMg)
-
PodIntegration: Fix the bug that Kueue would occasionally remove the custom finalizers when
removing thekueue.x-k8s.io/managedfinalizer. (#8905, @mykysha) -
RayJob integration: Make RayJob top level workload managed by Kueue when autoscaling via
ElasticJobsViaWorkloadSlices is enabled.If you are an alpha user of the ElasticJobsViaWorkloadSlices feature for RayJobs, then upgrading Kueue may impact running live jobs which have autoscaling / workload slicing enabled. For example, if you upgrade Kueue, before
scaling-up completes, the new pods will be stuck in SchedulingGated state. After Kueue version update, cluster admins probably should migrate from the old RayJob with ElasticJobsViaWorkloadSlices to the new one (recreating). (#9070, @mimowo) -
TAS: Fix a bug that TAS ignored resources excluded by excludeResourcePrefixes for node placement. (#8991, @sohankunkerkar)
-
TAS: Fixed a bug that pending workloads could be stuck, not being considered by the Kueue's scheduler,
after the restart of Kueue. The workloads would be considered for scheduling again after any update to their
ClusterQueue. (#9057, @sohankunkerkar) -
TAS: Fixed handling of the scenario where a Topology instance is re-created (for example, to add a new Topology level).
Previously, this would cause cache corruption, leading to issues such as: -
TAS: Lower verbosity of expected missing pod index label logs. (#8702, @IrvingMg)
v0.16.0
Changes since v0.15.0:
Urgent Upgrade Notes
(No, really, you MUST read this before you upgrade)
-
Removed FlavorFungibilityImplicitPreferenceDefault feature gate.
Configure flavor selection preference using the ClusterQueue field
spec.flavorFungibility.preferenceinstead. (#8134, @mbobrovskyi) -
The short name "wl" for workloads has been removed to avoid potential conflicts with the in-tree workload object coming into Kubernetes.
If you rely on "wl" in your "kubectl" command, you need to migrate to other short names ("kwl", "kueueworkload") or a full resource name ("workloads.kueue.x-k8s.io"). (#8472, @kannon92)
Changes by Kind
API Change
-
Add field multiplyBy for ResourceTransformation (#7599, @calvin0327)
-
Kueue v0.16 starts using
v1beta2API version for storage. The new API brings an optimization for the internal representation of TopologyAssignment (in WorkloadStatus) which allows using TAS for larger workloads (under the assumptions described in issue #7220, it allows to increase the maximal workload size from approx. 20k to approx. 60k nodes).All new Kueue objects created after the upgrade will be stored using
v1beta2.However, existing objects are only auto-converted to the new storage version by Kubernetes during a write request. This means that Kueue API objects that rarely receive updates - such as Topologies, ResourceFlavors, or long-running Workloads - may remain in the older
v1beta1format indefinitely.Ensuring all objects are migrated to
v1beta2is essential for compatibility with future Kueue upgrades. We tentatively plan to discontinue support forv1beta1in version 0.18.To ensure your environment is consistent, we recommend running the following migration script after installing Kueue v0.16 and verifying cluster stability: https://raw.githubusercontent.com/kubernetes-sigs/kueue/main/hack/migrate-to-v1beta2.sh. The script triggers a "no-op" update for all existing Kueue objects, forcing the API server to pass them through conversion webhooks and save them in the
v1beta2version.
Migration instructions (including the official script): #8018 (comment). (#8020, @mbobrovskyi) -
MultiKueue: Allow up to 20 clusters per MultiKueueConfig. (#8614, @IrvingMg)
Feature
- CLI: Support "kwl" and "kueueworkload" as a shortname for Kueue Workloads. (#8379, @kannon92)
- ElasticJobs: Support RayJob InTreeAutoscaling by using the ElasticJobsViaWorkloadSlices feature. (#8082, @hiboyang)
- Enable Pod-based integrations by default (#8096, @sohankunkerkar)
- Logs now include
replica-rolefield to identify Kueue instance roles (leader/follower/standalone). (#8107, @IrvingMg) - MultiKueue: Add support for StatefulSet workloads (#8611, @IrvingMg)
- MultiKueue: ClusterQueues with both MultiKueue and ProvisioningRequest admission checks are marked as inactive with reason "MultiKueueWithProvisioningRequest", as this configuration is invalid on manager clusters. (#8451, @IrvingMg)
- MultiKueue: trigger workload eviction on the management cluster when the corresponding workload is evicted
on the remote worker cluster. In particular this is fixing the issue with workloads using ProvisioningRequests,
which could get stuck in a worker cluster which does not have enough capacity to ever admit the workloads. (#8477, @mszadkow) - Observability: Add more details (the preemptionMode) to the QuotaReserved condition message,
and the related event, about the skipped flavors which were considered for preemption.
Before: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)"
After: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(preemptionMode=Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)" (#8024, @mykysha) - Observability: Introduce the counter metrics for finished workloads: kueue_finished_workloads_total and kueue_local_queue_finished_workloads_total. (#8694, @mbobrovskyi)
- Observability: Introduce the gauge metrics for finished workloads: kueue_finished_workloads and kueue_local_queue_finished_workloads. (#8724, @mbobrovskyi)
- Security: Support customization (TLSMinVersion and CipherSuites) for TLS used by the Kueue's webhooks server,
and the visibility server. (#8563, @kannon92) - TAS: extend the information in condition messages and events about nodes excluded from calculating the
assignment due to various recognized reasons like: taints, node affinity, node resource constraints. (#8043, @sohankunkerkar) - WaitForPodsReady.recoveryTimeout now defaults to the value of waitForPodsReady.timeout when not specified. (#8493, @IrvingMg)
Bug or Regression
-
DRA: fix the race condition bug leading to undefined behavior due to concurrent operations
on the Workload object, manifested by the "WARNING: DATA RACE" in test logs. (#8073, @mbobrovskyi) -
FailureRecovery: Fix Pod Termination Controller's MaxConcurrentReconciles (#8664, @gabesaba)
-
Fix ClusterQueue deletion getting stuck when pending workloads are deleted after being assumed by the scheduler. (#8543, @sohankunkerkar)
-
Fix EnsureWorkloadSlices to finish old slice when new is admitted as replacement (#8456, @sohankunkerkar)
-
Fix
TrainJobcontroller not correctly setting thePodSetcount value based onnumNodesfor the expected number of training nodes. (#8135, @kaisoz) -
Fix a bug that WorkloadPriorityClass value changes do not trigger Workload priority updates. (#8442, @ASverdlov)
-
Fix a performance bug as some "read-only" functions would be taking unnecessary "write" lock. (#8181, @ErikJiang)
-
Fix the race condition bug where the kueue_pending_workloads metric may not be updated to 0 after the last
workload is admitted and there are no new workloads incoming. (#8037, @Singularity23x0) -
Fixed a bug that Kueue's scheduler would re-evaluate and update already finished workloads, significantly
impacting overall scheduling throughput. This re-evaluation of a finished workload would be triggered when:- Kueue is restarted
- There is any event related to LimitRange or RuntimeClass instances referenced by the workload (#8186, @mbobrovskyi)
-
Fixed a bug where workloads requesting zero quantity of a resource not defined in the ClusterQueue were incorrectly rejected. (#8241, @IrvingMg)
-
Fixed the following bugs for the StatefulSet integration by ensuring the Workload object
has the ownerReference to the StatefulSet:- Kueue doesn't keep the StatefulSet as deactivated
- Kueue marks the Workload as Finished if all StatefulSet's Pods are deleted
- changing the "queue-name" label could occasionally result in the StatefulSet getting stuck (#4799, @mbobrovskyi)
-
HC: Avoid redundant requeuing of inadmissible workloads when multiple ClusterQueues in the same cohort hierarchy are processed. (#8441, @sohankunkerkar)
-
Integrations based on Pods: skip using finalizers on the Pods created and managed by integrations.
In particular we skip setting finalizers for Pods managed by the built in Serving Workloads Deployments,
StatefulSets, and LeaderWorkerSets.This improves performance of suspending the workloads, and fixes occasional race conditions when a StatefulSet
could get stuck when deactivating and re-activating in a short interval. (#8530, @mbobrovskyi) -
JobFramework: Fixed a bug that allowed a deactivated workload to be activated. (#8424, @chengjoey)
-
Kubeflow TrainJob v2: fix the bug to prevent duplicate pod template overrides when starting the Job is retried. (#8269, @j-skiba)
-
LeaderWorkerSet: Fixed a bug that prevented deleting the workload when the LeaderWorkerSet was scaled down. (#8671, @mbobrovskyi)
-
LeaderWorkerSet: add missing RBAC configuration for editor and viewer roles to kustomize and helm. (#8513, @kannon92)
-
MultiKueue now waits for WorkloadAdmitted (instead of QuotaReserved) before deleting workloads from non-selected worker clusters. To revert to the previous behavior, disable the
MultiKueueWaitForWorkloadAdmittedfeature gate. (#8592, @IrvingMg) -
MultiKueue via ClusterProfile: Fix the panic if the configuration for ClusterProfiles wasn't provided in the configMap. (#8071, @mszadkow)
-
MultiKueue: Fix a bug that the priority change by mutating the
kueue.x-k8s.io/priority-classlabel on the management cluster is not propagated to the worker clusters. (#8464, @mbobrovskyi) -
MultiKueue: Fixed status sync for CRD-based jobs (JobSet, Kubeflow, Ray, etc.) that was blocked while the local job was suspended. (#8308, @IrvingMg)
-
MultiKueue: fix the bug that for Pod integration the AdmissionCheck status would be kept Pending indefinitely,
even when the Pods are already running.The analogous fix is also done for the batch/Job when the MultiKueueBatchJobWithManagedBy feature gate is disabled. (#8189, @IrvingMg)
-
MultiKueue: fix the eviction when initiated by the manager cluster (due to eg. Preemption or WaitForPodsReady timeout). (#8151, @mbobrovskyi)
-
Observability: Revert the changes in PR #8599 for transitioning
the QuotaReserved, Admitted conditions toFalsefor Finished workloads. This introduced a regression,
because users lost the useful information about the timestamp of the last transitioning of these
conditions to True, without an API replacement to serve the information. (#8599, @mbobrovskyi) -
ProvisioningRequest: Fixed a bug that prevented events from being updated when the AdmissionCheck state changed. (#8394, @mbobrovskyi)
-
Scheduling: fix a bug that evictions submitted by scheduler (preemptions and eviction due to TAS N...
v0.15.3
Changes since v0.15.2:
Changes by Kind
Feature
Bug or Regression
-
Add lws editer and viewer roles to kustomize and helm (#8515, @kannon92)
-
FailureRecovery: Fix Pod Termination Controller's MaxConcurrentReconciles (#8665, @gabesaba)
-
Fix ClusterQueue deletion getting stuck when pending workloads are deleted after being assumed by the scheduler. (#8548, @sohankunkerkar)
-
Fix a bug that WorkloadPriorityClass value changes do not trigger Workload priority updates. (#8499, @ASverdlov)
-
HC: Avoid redundant requeuing of inadmissible workloads when multiple ClusterQueues in the same cohort hierarchy are processed. (#8510, @sohankunkerkar)
-
Integrations based on Pods: skip using finalizers on the Pods created and managed by integrations.
In particular we skip setting finalizers for Pods managed by the built in Serving Workloads Deployments,
StatefulSets, and LeaderWorkerSets.This improves performance of suspending the workloads, and fixes occasional race conditions when a StatefulSet
could get stuck when deactivating and re-activating in a short interval. (#8573, @mbobrovskyi) -
JobFramework: Fixed a bug that allowed a deactivated workload to be activated. (#8438, @chengjoey)
-
LeaderWorkerSet: Fixed a bug that prevented deleting the workload when the LeaderWorkerSet was scaled down. (#8673, @mbobrovskyi)
-
MultiKueue now waits for WorkloadAdmitted (instead of QuotaReserved) before deleting workloads from non-selected worker clusters. To revert to the previous behavior, disable the
MultiKueueWaitForWorkloadAdmittedfeature gate. (#8600, @IrvingMg) -
MultiKueue: Fix a bug that the priority change by mutating the
kueue.x-k8s.io/priority-classlabel on the management cluster is not propagated to the worker clusters. (#8574, @mbobrovskyi) -
MultiKueue: fix the eviction when initiated by the manager cluster (due to eg. Preemption or WairForPodsReady timeout). (#8402, @mbobrovskyi)
-
ProvisioningRequest: Fixed a bug that prevented events from being updated when the AdmissionCheck state changed. (#8404, @mbobrovskyi)
-
Revert the changes in PR #8599 for transitioning
the QuotaReserved, Admitted conditions toFalsefor Finished workloads. This introduced a regression,
because users lost the useful information about the timestamp of the last transitioning of these
conditions to True, without an API replacement to serve the information. (#8612, @mbobrovskyi) -
Scheduling: fix the bug that setting (none -> some) a workload priority class label (kueue.x-k8s.io/priority-class) was ignored. (#8584, @andrewseif)
-
TAS: Fix a bug that MPIJob with runLauncherAsWorker Pod indexes are not correctly evaluated during rank-based ordering assignments. (#8663, @tenzen-y)
-
TAS: Fixed an issue where workloads could remain in the second-pass scheduling queue (used for integration
or TAS with ProvisioningRequests, and for TAS Node Hot Swap) even if they no longer require to be in the queue. (#8431, @skools-here) -
TAS: fix TAS resource flavor controller to extract only scheduling-relevant node updates to prevent unnecessary reconciliation. (#8453, @Ladicle)
-
TAS: significantly improves scheduling performance by replacing Pod listing with an event-driven
cache for non-TAS Pods, thereby avoiding expensive DeepCopy operations during each scheduling cycle. (#8484, @gabesaba)
Full Changelog: v0.15.2...v0.15.3