KEP-5007: Update for Beta Promotion of DRA Device Binding Conditions #5487

ttsuuubasa · 2025-08-18T02:56:54Z

One-line PR description: updating KEP docs for promotion to beta

Issue link: DRA: Device Binding Conditions #5007

Other comments:
This PR updates KEP-5007 for beta promotion by enhancing the Test Plan, Graduation Criteria and the Production Readiness Review Questionnaire.
This feature has already been implemented as alpha since v1.34 and is now ready to meet beta requirements.

Key updates include:
- Modify the Design Details section to update the API field changed in alpha release.
- Clarify the Test Plan and Graduation Criteria for beta release.
- Add the details of the Production Readiness Review Questionnaire to make this feature stable targeting production environment.
These changes aim to make this feature reviewed from the perspective of promotion to beta.

Next Steps & Feedback
- Reviewers: please check the completeness of Graduation Criteria and Test Plan.
- Does the questionnaire cover all your concerns for production readiness?
- Are there any additional failure modes or edge cases we should include?
Feedback is welcome!
cc: @johnbelamaric @pohly @dom4ha @sanposhiho

Signed-off-by: Tsubasa Watanabe <[email protected]>

k8s-ci-robot · 2025-08-18T02:57:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ttsuuubasa
Once this PR has been reviewed and has the lgtm label, please assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-08-18T02:57:02Z

Welcome @ttsuuubasa!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-08-18T02:57:03Z

Hi @ttsuuubasa. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pacoxu · 2025-09-01T10:05:20Z

/ok-to-test

macsko · 2025-09-03T08:38:25Z

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md

 ###### What specific metrics should inform a rollback?

 <!--
 What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->
-Will consider in the beta timeframe.
+
+N/A


Please don't leave the N/A here and below

macsko · 2025-09-03T08:38:28Z

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md

@@ -753,16 +734,19 @@ We expect no non-infra related flakes in the last month as a GA graduation crite

 - Gather feedback from developers and surveys
 - Resolve the following issues
-  - Scheduler does not guarantee to pick up the same node for the Pod after the restart
  - If Scheduler picks up another node for the Pod after the restart, devices are unnecessarily left on the original nodes
    (Composable DRA controller needs to have the function to detach a device automatically if it is not used by a Pod for a certain period of time)
  - Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
  - The in-flight events cache may grow too large when waiting in PreBind


What about these points? Are they resolved?

We removed this point because this issue is expected to be resolved by using Node nomination.

- Scheduler does not guarantee to pick up the same node for the Pod after the restart - If Scheduler picks up another node for the Pod after the restart, devices are unnecessarily left on the original nodes (Composable DRA controller needs to have the function to detach a device automatically if it is not used by a Pod for a certain period of time) - Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down

They are resolved by the latest NNN enhancement, Yes.

- The in-flight events cache may grow too large when waiting in PreBind

But, I'm not sure about this one. I believe it should still happen: the in-flight events are cleared up after pods going through PreBind/WaitOnPermit.

Have we had a discussion for a potential solution for this in-flight event problem? I don't recall by myself.
I know @ttsuuubasa is a new owner and probably don't have a full context. But, @macsko @dom4ha do you remember something discussed before?

But, I'm not sure about this one. I believe it should still happen: the in-flight events are cleared up after pods going through PreBind/WaitOnPermit.

We changed that in kubernetes/kubernetes#130189. Now, the in-flight events are cleared between WaitOnPermit and PreBind.

They are resolved by the latest NNN enhancement

Assuming the NNN enhancement graduates to beta in v1.35

We changed that in kubernetes/kubernetes#130189. Now, the in-flight events are cleared between WaitOnPermit and PreBind.

Ah right! I didn't remember that change

Updated KEP-5007 for promotion to beta

eb0038a

Signed-off-by: Tsubasa Watanabe <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 18, 2025

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 18, 2025

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 18, 2025

k8s-ci-robot requested review from dom4ha and macsko August 18, 2025 02:57

github-project-automation bot added this to SIG Scheduling Aug 18, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Aug 18, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 18, 2025

ttsuuubasa mentioned this pull request Aug 18, 2025

DRA: Device Binding Conditions #5007

Open

6 tasks

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 1, 2025

macsko reviewed Sep 3, 2025

View reviewed changes

helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KEP-5007: Update for Beta Promotion of DRA Device Binding Conditions #5487

KEP-5007: Update for Beta Promotion of DRA Device Binding Conditions #5487

ttsuuubasa commented Aug 18, 2025

Uh oh!

k8s-ci-robot commented Aug 18, 2025

Uh oh!

k8s-ci-robot commented Aug 18, 2025

Uh oh!

k8s-ci-robot commented Aug 18, 2025

Uh oh!

pacoxu commented Sep 1, 2025

Uh oh!

macsko Sep 3, 2025

Uh oh!

macsko Sep 3, 2025

Uh oh!

ttsuuubasa Sep 9, 2025

Uh oh!

sanposhiho Sep 10, 2025

Uh oh!

sanposhiho Sep 10, 2025

Uh oh!

macsko Sep 10, 2025

Uh oh!

macsko Sep 10, 2025

Uh oh!

sanposhiho Sep 10, 2025

Uh oh!

Uh oh!

KEP-5007: Update for Beta Promotion of DRA Device Binding Conditions #5487

Are you sure you want to change the base?

KEP-5007: Update for Beta Promotion of DRA Device Binding Conditions #5487

Conversation

ttsuuubasa commented Aug 18, 2025

Uh oh!

k8s-ci-robot commented Aug 18, 2025

Uh oh!

k8s-ci-robot commented Aug 18, 2025

Uh oh!

k8s-ci-robot commented Aug 18, 2025

Uh oh!

pacoxu commented Sep 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!