Skip to content

Conversation

Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Sep 2, 2025

Why are these changes needed?

In RayJob controller, when using existing RayCluster, I need to use clusterSelector to select my RayCluster.

spec:
  entrypoint: python -c "import ray; ray.init(); print(ray.cluster_resources())"
  clusterSelector:
    ray.io/cluster: raycluster-kuberay-for-rayjob-retry

However, I noticed that our validation logic is not implemented in utils/validation.go and is not invoked before the switch statement in the RayJob reconciliation logic, which does not align with the RayJob controller’s design philosophy.

I think we should move the logic from

if useValue, ok = rayJob.Spec.ClusterSelector[RayJobDefaultClusterSelectorKey]; !ok {
return fmt.Errorf("failed to get cluster name in ClusterSelector map, the default key is %v", RayJobDefaultClusterSelectorKey)
}

to

func ValidateRayJobSpec(rayJob *rayv1.RayJob) error {
// KubeRay has some limitations for the suspend operation. The limitations are a subset of the limitations of
// Kueue (https://kueue.sigs.k8s.io/docs/tasks/run_rayjobs/#c-limitations). For example, KubeRay allows users
// to suspend a RayJob with autoscaling enabled, but Kueue doesn't.
if rayJob.Spec.Suspend && !rayJob.Spec.ShutdownAfterJobFinishes {
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false is not allowed to be suspended")
}
if rayJob.Spec.TTLSecondsAfterFinished < 0 {
return fmt.Errorf("TTLSecondsAfterFinished must be a non-negative integer")
}
if !rayJob.Spec.ShutdownAfterJobFinishes && rayJob.Spec.TTLSecondsAfterFinished > 0 {
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished")
}
isClusterSelectorMode := len(rayJob.Spec.ClusterSelector) != 0
if rayJob.Spec.Suspend && isClusterSelectorMode {
return fmt.Errorf("the ClusterSelector mode doesn't support the suspend operation")
}
if rayJob.Spec.RayClusterSpec == nil && !isClusterSelectorMode {
return fmt.Errorf("one of RayClusterSpec or ClusterSelector must be set")
}
// InteractiveMode does not support backoffLimit > 1.
// When a RayJob fails (e.g., due to a missing script) and retries,
// spec.JobId remains set, causing the new job to incorrectly transition
// to Running instead of Waiting or Failed.
// After discussion, we decided to disallow retries in InteractiveMode
// to avoid ambiguous state handling and unintended behavior.
// https://github.com/ray-project/kuberay/issues/3525
if rayJob.Spec.SubmissionMode == rayv1.InteractiveMode && rayJob.Spec.BackoffLimit != nil && *rayJob.Spec.BackoffLimit > 0 {
return fmt.Errorf("BackoffLimit is incompatible with InteractiveMode")
}
if rayJob.Spec.SubmissionMode == rayv1.SidecarMode {
if rayJob.Spec.SubmitterPodTemplate != nil {
return fmt.Errorf("Currently, SidecarMode doesn't support SubmitterPodTemplate")
}
if rayJob.Spec.SubmitterConfig != nil {
return fmt.Errorf("Currently, SidecarMode doesn't support SubmitterConfig")
}
if rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != "" && rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != corev1.RestartPolicyNever {
return fmt.Errorf("restartPolicy for head Pod should be Never or unset when using SidecarMode")
}
}
if rayJob.Spec.RayClusterSpec != nil {
if err := ValidateRayClusterSpec(rayJob.Spec.RayClusterSpec, rayJob.Annotations); err != nil {
return err
}
}
// Validate whether RuntimeEnvYAML is a valid YAML string. Note that this only checks its validity
// as a YAML string, not its adherence to the runtime environment schema.
if _, err := UnmarshalRuntimeEnvYAML(rayJob.Spec.RuntimeEnvYAML); err != nil {
return err
}
if rayJob.Spec.ActiveDeadlineSeconds != nil && *rayJob.Spec.ActiveDeadlineSeconds <= 0 {
return fmt.Errorf("activeDeadlineSeconds must be a positive integer")
}
if rayJob.Spec.BackoffLimit != nil && *rayJob.Spec.BackoffLimit < 0 {
return fmt.Errorf("backoffLimit must be a positive integer")
}
if !features.Enabled(features.RayJobDeletionPolicy) && rayJob.Spec.DeletionStrategy != nil {
return fmt.Errorf("RayJobDeletionPolicy feature gate must be enabled to use the DeletionStrategy feature")
}
if rayJob.Spec.DeletionStrategy != nil {
onSuccessPolicy := rayJob.Spec.DeletionStrategy.OnSuccess
onFailurePolicy := rayJob.Spec.DeletionStrategy.OnFailure
if onSuccessPolicy.Policy == nil {
return fmt.Errorf("the DeletionPolicyType field of DeletionStrategy.OnSuccess cannot be unset when DeletionStrategy is enabled")
}
if onFailurePolicy.Policy == nil {
return fmt.Errorf("the DeletionPolicyType field of DeletionStrategy.OnFailure cannot be unset when DeletionStrategy is enabled")
}
if isClusterSelectorMode {
switch *onSuccessPolicy.Policy {
case rayv1.DeleteCluster:
return fmt.Errorf("the ClusterSelector mode doesn't support DeletionStrategy=DeleteCluster on success")
case rayv1.DeleteWorkers:
return fmt.Errorf("the ClusterSelector mode doesn't support DeletionStrategy=DeleteWorkers on success")
}
switch *onFailurePolicy.Policy {
case rayv1.DeleteCluster:
return fmt.Errorf("the ClusterSelector mode doesn't support DeletionStrategy=DeleteCluster on failure")
case rayv1.DeleteWorkers:
return fmt.Errorf("the ClusterSelector mode doesn't support DeletionStrategy=DeleteWorkers on failure")
}
}
if (*onSuccessPolicy.Policy == rayv1.DeleteWorkers || *onFailurePolicy.Policy == rayv1.DeleteWorkers) && IsAutoscalingEnabled(rayJob.Spec.RayClusterSpec) {
// TODO (rueian): This can be supported in a future Ray version. We should check the RayVersion once we know it.
return fmt.Errorf("DeletionStrategy=DeleteWorkers currently does not support RayCluster with autoscaling enabled")
}
if rayJob.Spec.ShutdownAfterJobFinishes && (*onSuccessPolicy.Policy == rayv1.DeleteNone || *onFailurePolicy.Policy == rayv1.DeleteNone) {
return fmt.Errorf("shutdownAfterJobFinshes is set to 'true' while deletion policy is 'DeleteNone'")
}
}
return nil
}

Related issue number

#4030

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Collaborator

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great👍👍!

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
@rueian rueian changed the title [Bug] RayJob Spec ClusterSelector validation logic [Refactor] RayJob Spec ClusterSelector validation logic Sep 5, 2025
Signed-off-by: Future-Outlier <[email protected]>
Comment on lines 176 to 181
if !ok {
return fmt.Errorf("failed to get cluster name in ClusterSelector map, the key is %v", RayJobClusterSelectorKey)
}
if len(clusterName) == 0 {
return fmt.Errorf("cluster name in ClusterSelector should not be empty")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !ok {
return fmt.Errorf("failed to get cluster name in ClusterSelector map, the key is %v", RayJobClusterSelectorKey)
}
if len(clusterName) == 0 {
return fmt.Errorf("cluster name in ClusterSelector should not be empty")
}
if len(clusterName) == 0 {
return fmt.Errorf("%s in ClusterSelector should not be empty", RayJobClusterSelectorKey)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

@rueian rueian merged commit b775821 into ray-project:master Sep 8, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants