-
Notifications
You must be signed in to change notification settings - Fork 650
[Feature][RayCluster]: Implement the HeadReady condition #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return false, "HeadPodNotRunning", "Head pod is not running" | ||
| } | ||
| for _, cond := range pod.Status.Conditions { | ||
| if cond.Type == corev1.PodReady && cond.Status != corev1.ConditionTrue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the utils.IsRunningAndReady here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my thought was that utils.IsRunningAndReady only give us the information about Pod phase is Running + Pod is ready, it didn't tell us more fine-grain like whether it's due to Pod not running yet or Pod not ready yet (raylet may not pass readiness check yet) so i wrote the own checking
|
|
||
| // Check if the head node is running and ready with proper reason and message | ||
| if features.Enabled(features.RayClusterStatusConditions) { | ||
| isRayHeadRunning, reason, message := utils.CheckRayHeadRunningAndReady(ctx, runtimePods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use PodReady directly. We don't need to check whether the Pod is running, and we don't need to generate the reason and the message ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, I tried using PodReady as much as possible, and notice that pod status condition's reason and message are empty all the time, and since metav1.Condition need to have an non-empty reason
// reason contains a programmatic identifier indicating the reason for the condition's last transition.
// Producers of specific condition types may define expected values and meanings for this field,
// and whether the values are considered a guaranteed API.
// The value should be a CamelCase string.
// This field may not be empty.
// +required
// +kubebuilder:validation:Required
// +kubebuilder:validation:MaxLength=1024
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:Pattern=`^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$`
Reason [string](https://pkg.go.dev/builtin#string) `json:"reason" protobuf:"bytes,5,opt,name=reason"`
otherwise will see this error during reconcile
{"level":"info","ts":"2024-07-23T00:24:52.098Z","logger":"controllers.RayCluster","msg":"Error updating status","RayCluster":{"name":"rayservice-sample-raycluster-45jnx","namespace":"default"},"reconcileID":"2fb18e10-3cd9-41a0-b9e8-4660293fb634","name":"rayservice-sample-raycluster-45jnx","error":"RayCluster.ray.io \"rayservice-sample-raycluster-45jnx\" is invalid: status.conditions[0].reason: Invalid value: \"\": status.conditions[0].reason in body should be at least 1 chars long","RayCluster":{"apiVersion":"ray.io/v1","kind":"RayCluster","namespace":"default","name":"rayservice-sample-raycluster-45jnx"}}
thus i add reason HeadPodRunningAndReady and HeadPodNotReady respectively
|
could you also rebase with the master branch? Thanks! |
| replicaHeadReadyCondition := metav1.Condition{ | ||
| Type: string(rayv1.HeadReady), | ||
| Status: metav1.ConditionFalse, | ||
| Reason: "HeadPodNotReady", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reason is not particular useful because it's already implied with the HeadReady condition being False. Can the reason be more specific, like CrashLoopBack, ContainerCreating, Pending, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to make the condition consistent with Pod's PodReady condition which both Reason and Message are empty. However, @cchen777 found that metav1.Condition has a limitation where Reason should not be empty, which is different from PodCondition. See #2261 (comment) for more details.
We can't keep Reason and Message empty due to metav1.Condition's limitation, but I still want to make our HeadReady condition closer to PodReady. Hence, I may prefer to keep the Reason and Condition simple ("HeadPodNotReady" is OK for me). We can add more details, such as CrashLoopBack, ContainerCreating, and Pending, when we observe enough use cases in the future.
I am curious whether there are obvious use cases for details about whether the Pod is ready or not. I assume that if there are obvious use cases, the upstream PodCondition should not leave Reason and Message empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod status condition's reason and message are empty all the time
apology that this statement i mentioned was actually not true, Reason is empty when pod is running and ready, but when pod is not ready yet, there's actually non empty reasons, i think we can propagate reason from Pod itself but keep a customized reason HeadPodRunningAndReady when it's up and running to fit with metav1.Condition's limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason being empty or something like HeadPodReady when the pod is ready is fine I think. But when it's not ready, the reason should never be empty because we have information from the Pod for why it's not ready (failing readiness probe, crashing, OOMkill, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the logic to use
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 82 to 96 in 41bd563
| if cond.Type == corev1.PodReady { | |
| if cond.Status == corev1.ConditionTrue { | |
| replicaPodReadyCondition = metav1.Condition{ | |
| Type: string(condType), | |
| Status: metav1.ConditionTrue, | |
| Reason: v1.PodRunningAndReady, // metav1.Condition.Reason requires a non-empty value | |
| Message: cond.Message, | |
| } | |
| } else { | |
| replicaPodReadyCondition = metav1.Condition{ | |
| Type: string(condType), | |
| Status: metav1.ConditionFalse, | |
| Reason: cond.Reason, // PodReady condition comes with a reason when it's not ready, e.g. ContainersNotReady | |
| Message: cond.Message, | |
| } |
kubectl get raycluster rayservice-sample-raycluster-tnm5m -o json | jq .status
{
"conditions": [
{
"lastTransitionTime": "2024-07-25T04:25:32Z",
"message": "containers with unready status: [ray-head]",
"reason": "ContainersNotReady",
"status": "False",
"type": "Ready"
}
],
"desiredCPU": "2500m",
"desiredGPU": "0",
"desiredMemory": "4Gi",
"desiredTPU": "0",
"desiredWorkerReplicas": 1,
"endpoints": {
"client": "10001",
"dashboard": "8265",
"gcs-server": "6379",
"metrics": "8080",
"serve": "8000"
},
"head": {
"podIP": "10.244.0.9",
"podName": "rayservice-sample-raycluster-tnm5m-head-m8h95",
"serviceIP": "10.244.0.9",
"serviceName": "rayservice-sample-raycluster-tnm5m-head-svc"
},
"lastUpdateTime": "2024-07-25T04:25:54Z",
"maxWorkerReplicas": 5,
"minWorkerReplicas": 1,
"observedGeneration": 1
}
| } | ||
|
|
||
| replicaHeadReadyCondition := metav1.Condition{ | ||
| Type: string(rayv1.HeadReady), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can rename it from HeadReady to HeadPodReady because the upstream PodConidtion uses PodReady instead of Ready for the condition type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to HeadPodReady with Ready as value
| HeadPodReady RayClusterConditionType = "Ready" |
|
#2266 has been merged |
|
|
||
| // Check if the head node is running and ready by checking the head pod's status. | ||
| if features.Enabled(features.RayClusterStatusConditions) { | ||
| headPod, err := common.GetRayClusterHeadPod(ctx, r, newInstance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible for GetRayClusterHeadPod to return nil, nil when no head Pod exists. We need to handle the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rueian maybe we can return a non-nil error when there is no head Pod exists?
I found that we always return a non-nil error.
headPod, err := common.GetRayClusterHeadPod(ctx, r, newInstance)
if headPod == nil {
// return an error
}| } | ||
|
|
||
| for _, cond := range pod.Status.Conditions { | ||
| if cond.Type == corev1.PodReady { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cond.Type == corev1.PodReady {
var reason := cond.Reason
if reason is empty {
// metav1.Condition.Reason requires a non-empty value
reason = cond.Reason
}
replicaPodReadyCondition = metav1.Condition{
Type: string(condType),
Status: cond.Status,
Reason: reason,
Message: cond.Message,
}
break
}| } | ||
|
|
||
| func FindPodReadyCondition(pod *corev1.Pod, condType rayv1.RayClusterConditionType) metav1.Condition { | ||
| replicaPodReadyCondition := metav1.Condition{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for pod.Status.Conditions not to have corev1.PodReady? If so, we will create a condition with an empty Reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right, i update the logic here to have an initial Condition
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 75 to 79 in 01e90f6
| replicaPodReadyCondition := metav1.Condition{ | |
| Type: string(condType), | |
| Status: metav1.ConditionFalse, | |
| Reason: rayv1.UnknownReason, | |
| } |
plus this logic
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 95 to 98 in 01e90f6
| // Update the reason if it's not empty | |
| if reason != "" { | |
| replicaPodReadyCondition.Reason = reason | |
| } |
to prevent empty reason sneak into later SetStatusCondition
ray-operator/controllers/ray/raycluster_controller_unit_test.go
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| // Update the reason if it's not empty | ||
| if reason != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it could happen when cond.Status != corev1.ConditionTrue and cond.Reason for some reason give us empty string, which i think is pretty rare when we are checking cond.Type == PodReady, but just to be safe i add a check here so it will fall back to default rayv1.UnknownReason non empty reason
Why are these changes needed?
In this PR, we do the following
--feature-gatein helm chartRelated issue number
#2237, which is part of the ray-project/enhancements#54, the step 5 in the design doc
Checks
I've tested the feature with feature gate enabled and verified that ray service controller can work properly using new condition