-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5007: Update for Beta Promotion of DRA Device Binding Conditions #5487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5007: Update for Beta Promotion of DRA Device Binding Conditions #5487
Conversation
Signed-off-by: Tsubasa Watanabe <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ttsuuubasa The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @ttsuuubasa! |
Hi @ttsuuubasa. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
###### What specific metrics should inform a rollback? | ||
|
||
<!-- | ||
What signals should users be paying attention to when the feature is young | ||
that might indicate a serious problem? | ||
--> | ||
Will consider in the beta timeframe. | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't leave the N/A here and below
@@ -753,16 +734,19 @@ We expect no non-infra related flakes in the last month as a GA graduation crite | |||
|
|||
- Gather feedback from developers and surveys | |||
- Resolve the following issues | |||
- Scheduler does not guarantee to pick up the same node for the Pod after the restart | |||
- If Scheduler picks up another node for the Pod after the restart, devices are unnecessarily left on the original nodes | |||
(Composable DRA controller needs to have the function to detach a device automatically if it is not used by a Pod for a certain period of time) | |||
- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down | |||
- The in-flight events cache may grow too large when waiting in PreBind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about these points? Are they resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We removed this point because this issue is expected to be resolved by using Node nomination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Scheduler does not guarantee to pick up the same node for the Pod after the restart
- If Scheduler picks up another node for the Pod after the restart, devices are unnecessarily left on the original nodes
(Composable DRA controller needs to have the function to detach a device automatically if it is not used by a Pod for a certain period of time)
- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
They are resolved by the latest NNN enhancement, Yes.
- The in-flight events cache may grow too large when waiting in PreBind
But, I'm not sure about this one. I believe it should still happen: the in-flight events are cleared up after pods going through PreBind/WaitOnPermit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we had a discussion for a potential solution for this in-flight event problem? I don't recall by myself.
I know @ttsuuubasa is a new owner and probably don't have a full context. But, @macsko @dom4ha do you remember something discussed before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I'm not sure about this one. I believe it should still happen: the in-flight events are cleared up after pods going through PreBind/WaitOnPermit.
We changed that in kubernetes/kubernetes#130189. Now, the in-flight events are cleared between WaitOnPermit and PreBind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are resolved by the latest NNN enhancement
Assuming the NNN enhancement graduates to beta in v1.35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed that in kubernetes/kubernetes#130189. Now, the in-flight events are cleared between WaitOnPermit and PreBind.
Ah right! I didn't remember that change
Other comments:
This PR updates KEP-5007 for beta promotion by enhancing the Test Plan, Graduation Criteria and the Production Readiness Review Questionnaire.
This feature has already been implemented as alpha since v1.34 and is now ready to meet beta requirements.
Key updates include:
These changes aim to make this feature reviewed from the perspective of promotion to beta.
Next Steps & Feedback
Feedback is welcome!
cc: @johnbelamaric @pohly @dom4ha @sanposhiho