Skip to content

Conversation

@andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Nov 5, 2025

Why are these changes needed?

TODO:

  • unit / integration tests
  • support for referencing an existing secret

Related issue number

N/A

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@andrewsykim andrewsykim force-pushed the auth-support branch 2 times, most recently from 5f7e0bb to 132bea2 Compare November 6, 2025 23:02
RAY_ENABLE_AUTOSCALER_V2 = "RAY_enable_autoscaler_v2"

// RAY_AUTH_MODE_ENV_VAR is the Ray environment variable for configuring the authentication mode
RAY_AUTH_MODE_ENV_VAR = "RAY_auth_mode" // TODO: change to RAY_AUTH_MODE once updated in Ray nightly
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Andrew Sy Kim <[email protected]>
verifyAuthTokenEnvVars(t, rayCluster, workerPod)
}

// TODO(andrewsykim): add job submission test with and without token once a Ray version with token support is released.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim andrewsykim changed the title [WIP] Add support for Ray token auth Add support for Ray token auth Nov 7, 2025
@andrewsykim andrewsykim marked this pull request as ready for review November 7, 2025 03:35
@andrewsykim andrewsykim requested review from ryanaoleary and removed request for MortalHappiness and kevin85421 November 7, 2025 03:39
ValueFrom: &corev1.EnvVarSource{
SecretKeyRef: &corev1.SecretKeySelector{
LocalObjectReference: corev1.LocalObjectReference{Name: secretName},
Key: "auth_token",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Key: "auth_token",
Key: utils.RAY_AUTH_TOKEN_SECRET_KEY,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks

}

// IsAuthEnabled returns whether Ray auth is enabled.
func IsAuthEnabled(spec *rayv1.RayClusterSpec) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func IsAuthEnabled(spec *rayv1.RayClusterSpec) bool {
func IsTokenAuthEnabled(spec *rayv1.RayClusterSpec) bool {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. I am still seeing it remains the old IsAuthEnabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I probably forgot to push the latest commits

}

// setAuthEnvVars sets environment variables required for Ray token authentication
func setAuthEnvVars(clusterName string, podTemplate *corev1.PodTemplateSpec) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func setAuthEnvVars(clusterName string, podTemplate *corev1.PodTemplateSpec) {
func setTokenAuthEnvVars(clusterName string, podTemplate *corev1.PodTemplateSpec) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a bit more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Future-Outlier Future-Outlier self-assigned this Nov 7, 2025
Signed-off-by: Andrew Sy Kim <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to include an example (under folder ray-operator/config/samples/) using rayproject/ray:nightly to test this PR prior to merging?

Comment on lines +164 to +165
RAY_AUTH_TOKEN_SECRET_KEY = "auth_token"

Copy link
Member

@Future-Outlier Future-Outlier Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking is ray_auth_token better than auth_token?

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the following example to test this PR.
And I found that the RayCluster's head pod is never READY, do you know why?

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay-auth-token
spec:
  rayVersion: "2.51.0"
  authOptions:
    mode: "token"
  headGroupSpec:
    # rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:nightly
          resources:
            # limits:
            #   cpu: 1
            #   memory: 2G
            requests:
              cpu: 1
              memory: 2G
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265 # Ray dashboard
            name: dashboard
          - containerPort: 10001
            name: client
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 1
    maxReplicas: 5
    # logical group name, for this called small-group, also can be functional
    groupName: workergroup
    # rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
          image: rayproject/ray:nightly
          resources:
            # limits:
            #   cpu: 1
            #   memory: 1G
            requests:
              cpu: 1
              memory: 1G
Image

Comment on lines +247 to +248
Value: "token",
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Value: "token",
})
Value: rayv1.AuthModeToken,
})

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 questions relate to RBAC.

  1. Our controller only calls Get() and Create() on Secrets, never delete, patch, or update. Should we remove these unused verbs from the RBAC definition to follow the principle of least privilege?
  2. Should we add Owns(&corev1.Secret{}) to SetupWithManager? Currently we have list and watch permissions in RBAC, but without Owns(), these permissions aren’t being used.

(but I'm not sure watch and list secret is a good practice or not)

@andrewsykim
Copy link
Member Author

We'll need watch/list for informer cache, good call on delete, we probably don't need that as we can rely on garbage collection from owner ref

@Future-Outlier
Copy link
Member

I might try to build a ray image with this PR and test it tmr.

ray-project/ray#58368

@Future-Outlier
Copy link
Member

We'll need watch/list for informer cache, good call on delete, we probably don't need that as we can rely on garbage collection from owner ref

sorry I didn't write my comments correctly, I agreed we need watch/list for informer cache, and should we add Owns(&corev1.Secret{}) to SetupWithManager?

@andrewsykim
Copy link
Member Author

@Future-Outlier FYI if you're testing, you need to use nightly build and there's a lot of on-going changes for this feature

@Future-Outlier
Copy link
Member

@Future-Outlier FYI if you're testing, you need to use nightly build and there's a lot of on-going changes for this feature

thanks @andrewsykim
I will test more tmr and will probably provide a video or screenshot

@Future-Outlier
Copy link
Member

just successfully built the image

cd ray-project/ray
docker build --progress=plain --build-arg BUILD_DATE="$(date +%Y-%m-%d:%H:%M:%S)" -t $RAY_IMAGE -f ./python/ray/tests/kuberay/Dockerfile .

},
})

// Configure auth token for wait-gcs-ready init container if it exists
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampan-s-nayak is it expected for "ray healh-check` to require auth token?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ray autoscaler also require auth token?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll add this logic shortly


| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `mode` _[AuthMode](#authmode)_ | Mode specifies the authentication mode.<br />Supported values are "disabled" and "token".<br />Defaults to "token". | | Enum: [disabled token] <br /> |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s better to set the default to disabled, as token authentication requires Ray >= 2.51.0 and some users may still be on older pinned versions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning the mode only defaults to token when authOptions != nil

},
})

// Configure auth token for wait-gcs-ready init container if it exists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ray autoscaler also require auth token?

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my current thought:

before this PR get merged
I think we should

  1. test token mode
  2. test k8s mode
  3. test case with 1 and 2 + autoscaler

given that build a ray image is extremely hard, I'm going to wait for the latest ray image release.

@andrewsykim
Copy link
Member Author

@Future-Outlier let's keep this PR scoped to token mode only. The k8s auth mode will be undocumented feature for a couple Ray releases since it needs more testing, we can add KubeRay support once the public docs are ready (probably before KubeRay v1.6 though)

}

if IsAuthEnabled(spec) {
if spec.RayVersion == "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda an out-there edge case but is there any validation that the RayVersion specified matches the image actually being used in the container?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants