Skip to content

Conversation

@win5923
Copy link
Collaborator

@win5923 win5923 commented Nov 10, 2025

Why are these changes needed?

Currently, when users update a RayCluster spec (e.g., update the image), users must manually delete pods to apply changes.

Ref: #2534 (comment)

Changes

  • Added spec.upgradeStrategy.type field to RayCluster CRD
  • Supports two values:
    • Recreate: Recreate strategy allows deleting old Pods and automatically recreate them with the new spec.
    • None: No new pod will be created while the strategy is set to None

Implementation

  1. Calculates a hash of the original pod template (before any modifications)
  2. Stores the hash in the pod's annotation: ray.io/pod-template-hash
  3. The hash is calculated before RayStartParams and other fields are modified by the operator

Example:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
spec:
  upgradeStrategy:
    type: Recreate
  rayVersion: '2.48.0'

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923 win5923 marked this pull request as draft November 10, 2025 16:24
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch 5 times, most recently from 3fd9821 to 710166a Compare November 10, 2025 16:41
@win5923 win5923 force-pushed the raycluster-upgradeStrategy branch from 710166a to d261b0b Compare November 10, 2025 17:11
@win5923 win5923 changed the title [draft] Support recreate pods for RayCluster using RayClusterSpec [draft] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy Nov 10, 2025
@win5923
Copy link
Collaborator Author

win5923 commented Nov 10, 2025

Hi @andrewsykim, I followed you previous comments to adding a spec.upgradeStrategy API to RayCluster. But for now. I'm concerned this approach may introduce some issues:

  1. Confusion with existing API: We already have upgradeStrategy for RayService. Adding another upgradeStrategy to RayCluster could be confusing for users and creates unclear separation of concerns.
  2. Breaking RayJob workflows: For RayJob, setting upgradeStrategy=Recreate on the RayCluster would cause pod recreation during job execution, leading to job interruption and loss of running jobs.

Maybe we can just add a feature gate instead of using spec.upgradeStrategy.type field in RayCluster to enable the recreate behavior. WDYT?

@andrewsykim
Copy link
Member

Maybe we can just add a feature gate instead of using spec.upgradeStrategy.type field in RayCluster to enable the recreate behavior. WDYT?

Feature gates are used to gate features that are in early development and not ready for wider adoption, it shouldn't be used to change the behavior of RayCluster because it will eventually be on by default (and forced on).

@andrewsykim
Copy link
Member

I think both of those concerns are valid, but I don't think this is a problem with separation of concerns as RayCluster is a building block for both RayService and RayJob. For those cases you mentioned, we should have validation to ensure RayCluster upgrade strategy cannot be set when used w/ RayJob and RayService

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants