You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Controller Hot-Loop When Feature Gate is Disabled](#controller-hot-loop-when-feature-gate-is-disabled)
24
23
-[Cross-SIG Impact](#cross-sig-impact)
@@ -69,7 +68,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
69
68
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
70
69
-[ ] (R) Production readiness review completed
71
70
-[ ] (R) Production readiness review approved
72
-
-[] "Implementation History" section is up-to-date for milestone
71
+
-[x] "Implementation History" section is up-to-date for milestone
73
72
-[ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
74
73
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
75
74
@@ -405,7 +404,7 @@ spec:
405
404
406
405
- **Non-Numeric Taint Values**: When a pod toleration uses `Lt` or `Gt` operators, it only matches taints with numeric values. If a node has a taint with a non-numeric value, the toleration will not match, and the pod cannot schedule on that node.
- Pod toleration: `{key: "node.kubernetes.io/sla", operator: "Gt", value: "900"}`
411
410
- **Result**: Toleration does not match and pod cannot schedule on this node
@@ -432,20 +431,6 @@ spec:
432
431
- Consider caching parsed values in scheduler data structures if performance issues arise
433
432
- Feature gate allows disabling if performance problems occur
434
433
435
-
#### Edge Cases in Numeric Parsing
436
-
437
-
**Risk**: Unexpected behavior with edge cases like integer overflow, leading zeros, or malformed input could cause scheduling failures. Leading zeros in values (e.g., `"0950"`) could create user confusion about whether values are treated as strings or numbers.
438
-
439
-
**Mitigation**:
440
-
441
-
- Use Go's standard `strconv.ParseInt()` with well-defined error handling
442
-
- Comprehensive unit tests covering edge cases (overflow, underflow, malformed strings, leading zeros)
443
-
- API validation rejects pods with unparseable values rather than silently failing
444
-
- **API validation explicitly rejects values with leading zeros** when using numeric operators to eliminate confusion
445
-
- Clear error messages help users identify and fix configuration issues
446
-
- Documentation clearly states that leading zeros are not permitted for numeric operators
447
-
- **Performance validation via scheduler-perf tests** to ensure no measurable scheduling latency degradation from integer parsing overhead
448
-
449
434
#### Taint Misconfiguration Detection
450
435
451
436
**Risk**: Node taints intended for numeric comparison may contain non-numeric values (e.g., `node.kubernetes.io/sla=high` instead of `node.kubernetes.io/sla=950`). Since taint values are not validated at node registration time, these misconfigurations are only detected during scheduling when a pod with `Lt`/`Gt` tolerations attempts to match. This can lead to pods remaining in `Pending` state without clear indication of the root cause.
@@ -558,12 +543,6 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie
558
543
toleration.Value, "value must be a valid integer for numeric operators"))
559
544
continue
560
545
}
561
-
562
-
// Reject values with leading zeros to prevent confusion
toleration.Value, "leading zeros are not allowed in numeric values (use '950' instead of '0950')"))
566
-
}
567
546
}
568
547
}
569
548
return allErrors
@@ -625,7 +604,7 @@ N/A
625
604
626
605
##### Unit tests
627
606
628
-
All core changes must be covered by unit tests, in both Taint API, validation, and scheduler sides. Tests must specifically cover leading zeros behavior (e.g., `"0950"` vs `"950"`):
607
+
All core changes must be covered by unit tests, in both Taint API, validation, and scheduler sides.
@@ -649,7 +628,7 @@ Update the following integration tests to include new operators:
649
628
- Dynamic taint addition/removal
650
629
- Pod rescheduling after taint changes
651
630
- Integration with NodeAffinity
652
-
- Feature gate on/off
631
+
- Feature gate on/off
653
632
654
633
##### e2e tests
655
634
@@ -682,9 +661,11 @@ The existing e2e tests will be extended to cover the new taints cases introduced
682
661
### Upgrade / Downgrade Strategy
683
662
684
663
#### Upgrade
664
+
685
665
Enable the feature gate in kube-apiserver first then kube-scheduler. This ensures the API server can accept and validate pods with the new operators before the kube-scheduler tries to process them.
686
666
687
667
#### Downgrade
668
+
688
669
Disable the feature gate in in kube-scheduler then kube-apiserver. Since we want to stop the kube-scheduler from processing the new operators first, then stop the API server from accepting new pods with those operators. This prevents the scheduler from trying to handle features the API server would reject.
689
670
690
671
**What happens when the scheduler doesn't recognize Gt/Lt operators:**
@@ -695,7 +676,7 @@ When the feature gate is disabled and the scheduler encounters a pod with `Gt`/`
695
676
- Pod is considered to have untolerated taints
696
677
- Filter returns `UnschedulableAndUnresolvable` status
697
678
- Pod remains in Pending state.
698
-
- Feature gate on/off test cases
679
+
- Feature gate on/off test cases
699
680
700
681
### Version Skew Strategy
701
682
@@ -728,12 +709,12 @@ Impact on existing pods with Gt/Lt operators when feature is disabled:
728
709
729
710
1. **Already-running pods**: Continue running normally. The kubelet doesn't need to re-evaluate tolerations for running pods.
730
711
731
-
2. **Unscheduled/pending pods**:
712
+
2. **Unscheduled/pending pods**:
732
713
- Remain in the cluster but cannot be scheduled
733
714
- The scheduler's TaintToleration plugin won't recognize Gt/Lt operators and will treat them as non-matching
734
715
- These pods will remain in Pending state with events indicating untolerated taints
735
716
736
-
3. **New pod creation**:
717
+
3. **New pod creation**:
737
718
- API server validation will **reject** new pods with Gt/Lt operators
- Pods must specify exact SLA values, not ranges. A pod cannot say "accept any node with SLA > 950"
950
-
- Multiple tolerations required: If nodes have varying SLA values (e.g., 950, 960, 970, 980, 990), pods need separate `Equal` tolerations for each value they're willing to accept:
951
-
```yaml
952
-
tolerations:
953
-
- key: node.kubernetes.io/sla
954
-
operator: Equal
955
-
value: "950"
956
-
- key: node.kubernetes.io/sla
957
-
operator: Equal
958
-
value: "960"
959
-
- key: node.kubernetes.io/sla
960
-
operator: Equal
961
-
value: "970"
962
-
# ... and so on
963
-
```
964
-
- Poor semantics for "best effort" workloads since you can't easily express "I'll take any spot/preemptible node regardless of SLA" without enumerating all possible low-SLA values
965
-
- Changes to node SLA classification schemes require updating all pod manifests
966
-
896
+
1. **New Dedicated SLA API Resource:** Create `SLAPolicy` CRD
3. **Node Labels + Enhanced NodeAffinity:** Use labels instead of taints, extend NodeAffinity matching.
905
+
- **Pros:** Leverages existing label system.
906
+
- **Cons:**
907
+
- No default push-back behavior
908
+
- No eviction semantics
909
+
- Labels aren't meant for operational constraints.
910
+
911
+
4. **Add Separate `NumValue int64` Field:** Add a dedicated numeric field alongside the existing `Value string` field in Taint/Toleration structs.
912
+
- **Pros:**
913
+
- Eliminates parsing overhead and errors
914
+
- Type-safe integer handling
915
+
- Better performance for numeric comparisons
916
+
- **Cons:**
917
+
- Not aesthetically pleasing API design with dual fields
918
+
- Users might set wrong field or both fields accidentally
919
+
- Complex validation logic for field combinations
920
+
- Memory/storage overhead for additional field
921
+
- API complexity and documentation burden
922
+
923
+
5.**Use Existing `Equal` Operator with Numeric Values (No New Operators):** Instead of introducing `Lt`/`Gt`, use the existing `Equal` operator with numeric taint values. For example:
- Pods must specify exact SLA values, not ranges. A pod cannot say "accept any node with SLA > 950"
930
+
- Multiple tolerations required: If nodes have varying SLA values (e.g., 950, 960, 970, 980, 990), pods need separate `Equal` tolerations for each value they're willing to accept:
931
+
```yaml
932
+
tolerations:
933
+
- key: node.kubernetes.io/sla
934
+
operator: Equal
935
+
value: "950"
936
+
- key: node.kubernetes.io/sla
937
+
operator: Equal
938
+
value: "960"
939
+
- key: node.kubernetes.io/sla
940
+
operator: Equal
941
+
value: "970"
942
+
...
943
+
```
944
+
- Poor semantics for "best effort" workloads since you can't easily express "I'll take any spot/preemptible node regardless of SLA" without enumerating all possible low-SLA values
945
+
- Changes to node SLA classification schemes require updating all pod manifests
0 commit comments