-
Notifications
You must be signed in to change notification settings - Fork 650
Add support for Ray token auth #4179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5f7e0bb to
132bea2
Compare
| RAY_ENABLE_AUTOSCALER_V2 = "RAY_enable_autoscaler_v2" | ||
|
|
||
| // RAY_AUTH_MODE_ENV_VAR is the Ray environment variable for configuring the authentication mode | ||
| RAY_AUTH_MODE_ENV_VAR = "RAY_auth_mode" // TODO: change to RAY_AUTH_MODE once updated in Ray nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO after ray-project/ray#58408
Signed-off-by: Andrew Sy Kim <[email protected]>
132bea2 to
0b3ce04
Compare
| verifyAuthTokenEnvVars(t, rayCluster, workerPod) | ||
| } | ||
|
|
||
| // TODO(andrewsykim): add job submission test with and without token once a Ray version with token support is released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO after ray-project/ray#58408
8861359 to
e009bc0
Compare
Signed-off-by: Andrew Sy Kim <[email protected]>
e009bc0 to
8808456
Compare
| ValueFrom: &corev1.EnvVarSource{ | ||
| SecretKeyRef: &corev1.SecretKeySelector{ | ||
| LocalObjectReference: corev1.LocalObjectReference{Name: secretName}, | ||
| Key: "auth_token", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Key: "auth_token", | |
| Key: utils.RAY_AUTH_TOKEN_SECRET_KEY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks
| } | ||
|
|
||
| // IsAuthEnabled returns whether Ray auth is enabled. | ||
| func IsAuthEnabled(spec *rayv1.RayClusterSpec) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func IsAuthEnabled(spec *rayv1.RayClusterSpec) bool { | |
| func IsTokenAuthEnabled(spec *rayv1.RayClusterSpec) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird. I am still seeing it remains the old IsAuthEnabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I probably forgot to push the latest commits
| } | ||
|
|
||
| // setAuthEnvVars sets environment variables required for Ray token authentication | ||
| func setAuthEnvVars(clusterName string, podTemplate *corev1.PodTemplateSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func setAuthEnvVars(clusterName string, podTemplate *corev1.PodTemplateSpec) { | |
| func setTokenAuthEnvVars(clusterName string, podTemplate *corev1.PodTemplateSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a bit more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Andrew Sy Kim <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to include an example (under folder ray-operator/config/samples/) using rayproject/ray:nightly to test this PR prior to merging?
| RAY_AUTH_TOKEN_SECRET_KEY = "auth_token" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking is ray_auth_token better than auth_token?
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the following example to test this PR.
And I found that the RayCluster's head pod is never READY, do you know why?
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-kuberay-auth-token
spec:
rayVersion: "2.51.0"
authOptions:
mode: "token"
headGroupSpec:
# rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:nightly
resources:
# limits:
# cpu: 1
# memory: 2G
requests:
cpu: 1
memory: 2G
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: workergroup
# rayStartParams is optional with RayCluster CRD from KubeRay 1.4.0 or later but required in earlier versions.
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:nightly
resources:
# limits:
# cpu: 1
# memory: 1G
requests:
cpu: 1
memory: 1G
| Value: "token", | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Value: "token", | |
| }) | |
| Value: rayv1.AuthModeToken, | |
| }) |
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 questions relate to RBAC.
- Our controller only calls
Get()andCreate()on Secrets, neverdelete,patch, orupdate. Should we remove these unused verbs from the RBAC definition to follow the principle of least privilege? - Should we add
Owns(&corev1.Secret{})toSetupWithManager? Currently we havelistandwatchpermissions in RBAC, but withoutOwns(), these permissions aren’t being used.
(but I'm not sure watch and list secret is a good practice or not)
|
We'll need watch/list for informer cache, good call on delete, we probably don't need that as we can rely on garbage collection from owner ref |
|
I might try to build a ray image with this PR and test it tmr. |
sorry I didn't write my comments correctly, I agreed we need watch/list for informer cache, and should we add Owns(&corev1.Secret{}) to SetupWithManager? |
|
@Future-Outlier FYI if you're testing, you need to use nightly build and there's a lot of on-going changes for this feature |
thanks @andrewsykim |
|
just successfully built the image cd ray-project/ray
docker build --progress=plain --build-arg BUILD_DATE="$(date +%Y-%m-%d:%H:%M:%S)" -t $RAY_IMAGE -f ./python/ray/tests/kuberay/Dockerfile . |
| }, | ||
| }) | ||
|
|
||
| // Configure auth token for wait-gcs-ready init container if it exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sampan-s-nayak is it expected for "ray healh-check` to require auth token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ray autoscaler also require auth token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll add this logic shortly
|
|
||
| | Field | Description | Default | Validation | | ||
| | --- | --- | --- | --- | | ||
| | `mode` _[AuthMode](#authmode)_ | Mode specifies the authentication mode.<br />Supported values are "disabled" and "token".<br />Defaults to "token". | | Enum: [disabled token] <br /> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s better to set the default to disabled, as token authentication requires Ray >= 2.51.0 and some users may still be on older pinned versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth mentioning the mode only defaults to token when authOptions != nil
| }, | ||
| }) | ||
|
|
||
| // Configure auth token for wait-gcs-ready init container if it exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ray autoscaler also require auth token?
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my current thought:
before this PR get merged
I think we should
- test token mode
- test k8s mode
- test case with 1 and 2 + autoscaler
given that build a ray image is extremely hard, I'm going to wait for the latest ray image release.
|
@Future-Outlier let's keep this PR scoped to |
| } | ||
|
|
||
| if IsAuthEnabled(spec) { | ||
| if spec.RayVersion == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda an out-there edge case but is there any validation that the RayVersion specified matches the image actually being used in the container?
Why are these changes needed?
TODO:
Related issue number
N/A
Checks