Skip to content

Conversation

HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Aug 23, 2025

  • One-line PR description: This KEP aims to ensure that restarting the kubelet for a short period does not affect the status of pods on the node.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: HirazawaUi
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 23, 2025
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 23, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp August 23, 2025 12:07
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 23, 2025
@HirazawaUi HirazawaUi force-pushed the kep-4781 branch 2 times, most recently from eca940b to acbdb7e Compare August 25, 2025 15:34
@HirazawaUi HirazawaUi changed the title [WIP] KEP-4781 restarting kubelet does not change pod status KEP-4781 restarting kubelet does not change pod status Aug 25, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 25, 2025
By preserving the old state without immediate health checks, there is a delay in recognizing containers that have become unhealthy during or after kubelet's downtime. Services relying on Pod readiness for service discovery might continue directing traffic to Pods with containers that are no longer healthy but are still reported as Ready.
We plan to immediately trigger a probe after that to reduce the risk caused by such delays.

## Design Details
Copy link
Contributor Author

@HirazawaUi HirazawaUi Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not refer to the implementation approach of the previous KEP. After reviewing the POC PR related to that KEP, I found the implementation process somewhat cumbersome, and it also presented some potential edge case issues.

After tracing the pod status transition process, I adopted a new implementation method to achieve the goal: consistently relying on the detection results of the probeManager. This approach simplifies the implementation and helps us avoid certain edge cases. And in this section, the behavioral differences of kubelet under several scenarios are also analyzed. Could you please take a look?

My POC PR: kubernetes/kubernetes#133676

@SergeyKanzhelev @thockin


2. We ensure that if the `Started` field in the container status is true, the container is considered started (since the startupProbe only runs during container startup and will not execute again once completed).

3. If the Kubelet restart occurs within the `nodeMonitorGracePeriod` and the Pod’s Ready condition is set to false, we will set the container’s ready status to false. It will remain in this state until subsequent probes reset it to true.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Kubelet restart occurs within the nodeMonitorGracePeriod

Does this mean the case where the kubelet has been down for longer than nodeMonitorGracePeriod and then restarts afterward?

So basically, the Node Lifecycle Controller notices that the Lease hasn’t been updated past nodeMonitorGracePeriod, marks the Node as NotReady, and flips the Pods’ Ready condition to False. After the kubelet restarts, it fetches the Pod info for its own Node from the API server, and the prober manager simply carries over that Ready condition value, right?

Copy link
Contributor Author

@HirazawaUi HirazawaUi Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario here indeed warrants a more detailed explanation.

Since we cannot delay waiting for the prober manager to trigger a probe before updating the container status in the syncPod process, the pod status update always occurs before the probe. This means that when we first update the container status, we do not know the actual state of the container.

  • For a short kubelet restart, we can confidently assume that the container's state has not changed. Therefore, we can retain the container's state and let the prober manager trigger a probe for the container to correctly update its state in the pod.
  • However, for a prolonged kubelet restart, when the node is already in a NotReady state, we can no longer assume that the container's state in the pod remains unchanged. In this case, we follow the previous behavior by setting the container's Ready field to false (as mentioned in the KEP, before applying the changes in this KEP, after a pod is initially added to the prober manager, the probe result is set to an initial value, and the initial value for the readiness probe is Failure, which sets the container's Ready field to false). Then, the probe is performed, and the container's state is correctly updated.

In summary:

  • For a short kubelet restart, we choose to inherit the container's state from before the kubelet restart.
  • For a prolonged kubelet restart, we follow the pre-change behavior by first setting the container's Ready field to false, waiting for the actual probe result, and eventually driving the container's state to its actual value. However, compared to the pre-change behavior, this approach still has an advantage: it avoids unnecessary state transitions for the container's Ready field (from true -> false -> true) and instead transitions directly from false -> true. This prevents meaningless state flapping, reducing unnecessary reconciliation work for various controllers that watch container status, such as the EndpointSlice controller or external controllers that depend on EndpointSlice, thereby alleviating their workload.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear explanation!

For a prolonged kubelet restart, we follow the pre-change behavior by first setting the container's Ready field to false, waiting for the actual probe result, and eventually driving the container's state to its actual value.

I think I finally understand the part I was a bit unclear about regarding how the UPDATE PodStatus step in your diagram determines readiness. If the kubelet is down longer than nodeMonitorGracePeriod, the container’s ready condition is set to false. In that case, in your PoC, the section below is where the ready state becomes false, right?
https://github.com/kubernetes/kubernetes/blob/d207ce94fe550ec35ff6a6b120faf759b8cb9fae/pkg/kubelet/prober/prober_manager.go#L336-L339

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

@thockin
Copy link
Member

thockin commented Sep 9, 2025

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly in favor of this KEP, but I leave the specific details for people most familiar with Kubelet to iron out :)

@HirazawaUi
Copy link
Contributor Author

Can we include the history of this?

I don’t have many ideas for now, so I’ve simply placed these links in the Motivation section. If you feel the wording needs further description or that some context should be added to the links, please let me know — I’ll be happy to make the necessary changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants