From 361c1c94a58d8f8f48bab1996a69298ef079f384 Mon Sep 17 00:00:00 2001 From: Ajay Sundar Karuppasamy Date: Wed, 24 Sep 2025 17:32:03 +0000 Subject: [PATCH] add kep 5359 for swap api --- .../5359-swap-awareness-api/README.md | 826 ++++++++++++++++++ .../sig-node/5359-swap-awareness-api/kep.yaml | 44 + 2 files changed, 870 insertions(+) create mode 100644 keps/sig-node/5359-swap-awareness-api/README.md create mode 100644 keps/sig-node/5359-swap-awareness-api/kep.yaml diff --git a/keps/sig-node/5359-swap-awareness-api/README.md b/keps/sig-node/5359-swap-awareness-api/README.md new file mode 100644 index 00000000000..7153b3e2af9 --- /dev/null +++ b/keps/sig-node/5359-swap-awareness-api/README.md @@ -0,0 +1,826 @@ +# KEP-5359: Swap Awareness API + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Use Case 1: Swap-Disabled](#use-case-1-swap-disabled) + - [Use Case 2: Explicit limits on swap usage](#use-case-2-explicit-limits-on-swap-usage) + - [Use Case 3: Swap in Guaranteed pods](#use-case-3-swap-in-guaranteed-pods) + - [Notes / Constraints / Caveats](#notes--constraints--caveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Node Configuration](#node-configuration) + - [Proposed Design: Limits-Only Model](#proposed-design-limits-only-model) + - [Swap limit semantics](#swap-limit-semantics) + - [NodeInfo Exposure](#nodeinfo-exposure) + - [User Experience Examples](#user-experience-examples) + - [Use Case 1: Swap-Disabled Workload](#use-case-1-swap-disabled-workload) + - [Use Case 2: Swap-Enabled Workload](#use-case-2-swap-enabled-workload) + - [Use Case 3: Unlimited Swap](#use-case-3-unlimited-swap) +- [Test Plan](#test-plan) +- [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) +- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) +- [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Using Dynamic Resource Allocation](#using-dynamic-resource-allocation) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Summary + +This KEP proposes a new API to allow users to control the amount of swap a +container can use. The current swap behavior in Kubernetes is implicit and can +lead to under-utilization of swap provisioned on the node. The explicit API +control for swap enables k8s users to directly manage swap decisions for their +workloads, eliminating assumptions about preferences. This KEP proposes to +remove existing swap limitations for other features and QoS classes, e.g: In-Pod +Place Pod Resize and CPU pinning in the Guaranteed class, allowing these +workloads to also benefit from swap without restrictions. This proposal +introduces a "WorkloadControlledSwap" mode where swap usage is explicitly +defined by the user per container. This allows for better resource management +and safer overcommitment of swap resources on a node. + +## Motivation + +This KEP aims to give Kubernetes workloads greater control over swap usage, +addressing limitations of the current "LimitedSwap" mode. This allows the +application owners to disable or provision a larger swap space for their +containers as best fitting their needs, enhancing swap management for these +applications. Enabling workloads to define swap limits promotes safer, more +efficient swap usage, balancing performance, cost and OOM protection. + +### Goals + +To effectively manage swap utilization in workloads, the primary goals of this +KEP are to + +- provide an API that allows application owners to specify the degree of + swap an application can use. +- offer the ability to disable swap entirely for a container by setting + `swap.limit=0`. +- enable workloads to declare the maximum _acceptable_ swap limits for + their containers. +- enable users to configure swap with other QoS classes, such as + 'guaranteed', while still giving the required protection by maintaining the + default setting as 'disabled'. +- allow safely overcommit on swap to fully leverage available node capacity. +- facilitate k8s node features like in-place pod resize and CPU pinning on + swap enabled nodes by eliminating implicit swap assumptions on pods. + +### Non Goals + +- define new swap scheduling behavior for workloads; this is managed by a + separate KEP for placement control +- change eviction behavior for swap enabled nodes; this will be + investigated with a separate future KEP if improvements are needed. + +## Proposal + +This proposal introduces a new `swapBehavior` mode in the `kubeletConfiguration` +called `WorkloadControlledSwap`. When this mode is enabled on a node, swap usage +is no longer implicitly calculated (as in `LimitedSwap` mode) but is instead +explicitly defined by the user on a per-container basis. + +This is achieved by introducing a new `swap` resource field under +`resources.limits` for a container. This "limits-only" model allows users to +specify the maximum amount of swap a container can use. If this limit is not +specified, the container will not be allowed to use swap, providing a safe +default. + +This explicit per-container limit allows for: + +1. Disabling swap for specific containers by setting `swap: "0"` +1. Granting specific swap allowances to containers that can benefit from it + eg: `swap: "1Gi"` +1. Enabling swap for QoS classes that were previously incompatible, like + Guaranteed pods, because the user intent is now explicit. +1. Removing restrictions on In-Place Pod Resize feature on swap-enabled + nodes, as resize on memory limits no longer has any side-effects on swap. +1. Safer overcommitment of swap on a node as the control is granular. + +### User Stories + +#### Use Case 1: Swap-Disabled + + A user wants to run a workload that should never use swap. + +#### Use Case 2: Explicit limits on swap usage + +Modern applications with multiple containers often have varying swap +requirements. eg: a log uploader might have more swap tolerance than a main +web-server. + +#### Use Case 3: Swap in Guaranteed pods + +A user has a Guaranteed pod (with CPU pinning) that runs a memory-intensive +process. They want to allow this pod to use a small, fixed amount of swap as a +safety net against OOM kills, which was previously not possible. + +### Notes / Constraints / Caveats + +1. **Why is swap not an allocatable resource?** + +Swap is not modeled as a conventional / allocatable resource as swap is only +consumed when memory pressure occurs. If swap space were 'accounted for' without +being actively used, it could result in scenarios where swap is reserved +unnecessarily, leading to underutilization of other available resources. If +there are use-cases for `resources.swap` rise in the future it could be +discussed. + +1. **The "swap:0" placement problem** + +A key question is whether `swap: "0"` controls placement or just usage. This +proposal adopts the position that limits control usage, not placement. + +- The swap limits are managed at container level and placements are + determined at pod level. A "swap:0" container can be co-existing with + another workload utilizing swap. +- If workload separation for swap is desired, explicit placement controls + like taints or nodeSelector should be the preferred option, separating API + concerns of workload placement from resource usage. +- ‘limits' should not overload the meaning of "swap:0" to mean "I require a + non-swap node". Swap aware scheduling is investigated as a separate KEP + (xref: [#5424](https://github.com/kubernetes/enhancements/issues/5424)). + +### Risks and Mitigations + +<<[UNRESOLVED kannon@]>> + +1. **Risk: Discuss the implications of overcommitting swap further** + +- what k8s should do to make sure node doesn't end up in a place where all + the pods have swap provisioned but cannot utilize anymore. + +> This can be addressed by the user responding with configuring an +additional swap. K8s cannot react to swap is full as swap is a node +resource; with better observability story this concerns could be reduced. + + +- better observability story; user should be able to know when there is + overcommit of swap or there will be swap capacity crunch. + +> We already have swap metrics for capacity and usage at node, pod and +container level +> - Would a new metric for `kubelet_node_swap_allocated_bytes `address this +concern? +> - This would be sum of all `resources.limits.swap` for all containers +running on that node. This can help operators to create precise alert for their +risk for ( allocated / capacity ). +> - New Condition for `SwapPressure` in NPD. + +<<[/UNRESOLVED]>> + +1. Risk: User confusion between `LimitedSwap` and `WorkloadControlledSwap` + modes. + +Mitigation: Swap behavior will be exposed as a field in node-info to be +observable by the user. +## Design Details + +### Node Configuration + +A new `swapBehavior` is introduced in the `kubeletConfiguration` + +``` +kubeletConfiguration: + memorySwap: + swapBehavior: "WorkloadControlledSwap" # Node-level swap enabled, but workloads control usage +``` +### Proposed Design: Limits-Only Model + +Swap limits are configured per container for a cleaner resource model. This +avoids the ambiguity of swap requests. + +- **Rationale:** "policy" fits per pod, swap "limits" are container specific + as swap is treated by kernel per process. Starting with ‘container' limits + first gives us flexibility for unambiguous design. If we start with pod + limits first, this implies all containers and we will need to reconsider + how to support individual container limits in the future. (eg: will it + override?) +- This also avoids handling conflicts with current `PodLevelResource` + behavior of applying limit as request and using for admission time. + +```yaml +resources: + limits: + memory: "2Gi" + swap: "1Gi" # Maximum swap this container can use + requests: + memory: "1Gi" + # No swap ‘requests' as this doesn't make sense +``` + +### Swap limit semantics + +The default behavior for all pods in "WorkloadControlledSwap" mode is "No swap" +(`swap=0)`. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
mode:
+
+workload behavior:
NoSwapLimitedSwapWorkloadControlledSwap
No explicit swap limit - Burstable QoSwill not swapswap as per calculated limitwill not swap (default)
No explicit swap limit - Guaranteed/ BestEffortwill not swapwill not swapwill not swap (default)
swap.limit setwill not swap (No effect)swap as per calculated limit (user limit will have no effect)maximum swap as per user request.
swap.limit=0 (disable)will not swapswap as per calc limit for Burstablewill not swap
+ + +**Note on placement:** For cases, when explicit limit is set, the +`NodeDeclaredFeatures` KEP is an option to explore for implicit scheduling +control – a separate Swap Scheduling KEP is exploring this further. Explicit +limits show clear user intent for these workloads to work with nodes in +`WorkloadControlledSwap` mode, and the scheduler could place them right. This +also can protect the conflicting case by not placing explicitly disabled swap +workloads on a `LimitedSwap` node. + +**Note on coexistence:** K8s cannot support ‘built-in' protection when users +want to have some nodes in `LimitedSwap` and some nodes in +`WorkloadControlledSwap` within a cluster. This placement control can be +achieved with taints or label selectors. NFD (Node Feature Discovery) is seen as +the path to work with swap-labels, which will help with grouping swap nodes for +maintenance or migration. Existing workloads in `LimitedSwap` will continue to +work to protect existing behavior of swap enabled nodes. + +<<[UNRESOLVED skanzhelev@]>> +When do we even need LimitedSwap? As WorkloadControlledSwap is more powerful, +why do we need limited mode? Should we have these as a different mode or have +these as implicit behavior of swap design? + +With LimitedSwap already swap getting adoption in production by many users, +overriding to change behavior may not be preferred, WorkloadControlledSwap would +enable the additional usecases and can coexist with current behavior. +<<[/UNRESOLVED]>> + +### NodeInfo Exposure + +Swap behavior will be exposed in the `Node` status to enable monitoring and +selection: + +``` +nodeInfo: + ... + swap: + behavior: WorkloadControlledSwap + capacity: 53687087104 +``` + +This will enable field selection for monitoring: + +### User Experience Examples + +#### Use Case 1: Swap-Disabled Workload + +Disabling swap can be achieved by setting `swap: "0"`. The `nodeSelector` is +used for explicit placement preference with NFD. + +```yaml +# I don't want swap, prefer non-swap nodes +spec: + nodeSelector: + feature.node.kubernetes.io/memory-swap: "false" + containers: + - resources: + limits: + memory: "2Gi" + swap: "0" +``` + +#### Use Case 2: Swap-Enabled Workload + +```yaml +# I want swap capability, place only in a swap-enabled node with LimitedSwap +spec: + nodeSelector: + feature.node.kubernetes.io/memory-swap: "true" + feature.node.kubernetes.io/memory-swap.behavior: LimitedSwap + containers: + - resources: + limits: + memory: "2Gi" + swap: "1Gi" +``` + +#### Use Case 3: Unlimited Swap + +```yaml +# I want as much swap as the node allows +spec: + nodeSelector: + feature.node.kubernetes.io/memory-swap: "false" + feature.node.kubernetes.io/memory-swap.behavior: WorkloadControlledSwap + containers: + - resources: + limits: + memory: "2Gi" + swap: "8Gi" # Large limit = effectively unlimited +``` + +## Test Plan + +1. I/we understand the owners of the involved components may require + updates to existing tests to make this code solid enough prior to + committing the changes necessary to implement this enhancement. + +**Unit Tests** + +- `k8s.io/apis/core` +- `k8s.io/apis/core/v1/validations` +- `k8s.io/features` +- `k8s.io/kubelet` +- `k8s.io/kubelet/container` + +**Integration Tests** + +Unit and E2E tests provide sufficient coverage for the feature. Integration +tests may be added to cover any gaps that are discovered in the future. + +**e2e tests** + - Verify pod with explicit swap on `WorkloadControlledSwap` node uses swap. + - Verify pod with no limit on `WorkloadControlledSwap` node does not use swap. + - Verify pod with `swap:"0"` on `WorkloadControlledSwap` node does not use swap. + - Verify that a Guaranteed pod with explicit swap set on `WorkloadControlledSwap` + node uses swap. + +## Graduation Criteria + +### Alpha + +- Feature implemented behind a feature flag `WorkloadControlledSwap` +- Initial e2e tests completed and enabled. +- Public documentation on workload controlled swap is updated. + +### Beta + +- API controlled swap functionality is running behind feature flag for at least one release. +- No major bugs reported and user feedback is positive. + +### GA + +- No major bugs reported for three months. + +## Upgrade / Downgrade Strategy + +API server should be upgraded before Kubelets. Kubelets should be downgraded +before the API server. + +## Version Skew Strategy + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `WorkloadControlledSwap` + - Components depending on the feature gate: kubelet, kube-apiserver + +###### Does enabling the feature change any default behavior? + + +Yes. KEP introduces safe default with WorkloadControlledSwap - if explicitly specified use the limits for swap, otherwise set it as 0 (no swap). To ensure backward compatibility, this change will be a new node behavior, so existing users who are working with the LimitedSwap swap behavior will not be impacted. The api set limits are not applicable in LimitedSwap configured nodes. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes. To roll back, the feature gate should be disabled in the API server and +kubelets, and components should be restarted. If a Pod was created with a +`resources.limits.swap` field while the gate was enabled, those will be ignored by +kubelets once the feature is disabled. + +###### What happens if we reenable the feature if it was previously rolled back? + +If the feature is re-enabled, the kubelet will once again recognize and enforce +the swap limits for any Pods that have the field defined. + +###### Are there any tests for feature enablement/disablement? + + + +- Unit test for the API's validation with the feature enabled and disabled. +- Unit test for the kubelet with the feature enabled and disabled. +- Unit test for API on the new field. First enable the feature gate, create a Pod with a container including `resources.limits.swap` field, validation should pass and the Pod API should match the expected result. Second, disable the feature gate, validate the Pod API should still pass and it should match the expected result. Lastly, re-enable the feature gate, validate the Pod API should pass and it should match the expected result. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +If this feature is being actively used in a cluster that has this feature +partially enabled on some nodes, pods on nodes with WorkloadControlledSwap +enabled may configure different swap limits than pods on nodes without this +feature. + +###### What specific metrics should inform a rollback? + + + +Swap is not configured on the workload even when limits are specified. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +No. + +###### Will enabling / using this feature result in introducing new API types? + + + +Enabling this feature will introduce a new field `resources.limits.swap` to the [Container](https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L2601) API spec. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +This feature adds a new key-value pair to the resources.limits map within the [v1.Container](https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L2601) spec for each container that specifies a swap limit. Key: "swap" (4 bytes) and Value: a string like "1Gi" (3 bytes) or "500Mi" (5 bytes). The total increase per container could be 10-15 bytes per container. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +No. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +When the workload is configured with swap and node is under memory pressure, swap utilization may result in increased CPU and I/O usage to offload memory (RAM) to disk. + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +Enabling this feature will add swap utilization for the workload and can result in resource exhaustion of 'swap resource' if swap is overcommitted. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +### Using Dynamic Resource Allocation + +Another possible way to realize this KEP is to leverage Dynamic Resource Allocation (DRA) framework to manage swap. In this model, "swap" could be defined as a ResourceClass, and pods would use a ResourceClaim to request a specific swap limit. DRA requires a full ecosystem of CRDs, a node-level driver, and Kubelet plugins. This is massive overhead for what is ultimately setting a single cgroup value (`memory.swap.max`). The simplicity of the `resources.limits` approach is preferable over the complex DRA approach. + + +## Infrastructure Needed (Optional) + + + + diff --git a/keps/sig-node/5359-swap-awareness-api/kep.yaml b/keps/sig-node/5359-swap-awareness-api/kep.yaml new file mode 100644 index 00000000000..f38dfc7b48c --- /dev/null +++ b/keps/sig-node/5359-swap-awareness-api/kep.yaml @@ -0,0 +1,44 @@ +title: Swap Awareness API +kep-number: 5359 +authors: + - "@ajaysundar" +owning-sig: sig-node +participating-sigs: +status: provisional +creation-date: 2025-09-24 +reviewers: + - "@SergeyKanzhelev" + - "@kannon92" +approvers: + - "@oscar.doe" + +see-also: + - "/keps/sig-node/2400-node-swap" + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.35" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.35" + beta: "v1.36" + stable: "v1.37" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: WorkloadControlledSwap + components: + - kube-apiserver + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: \ No newline at end of file