-
Couldn't load subscription status.
- Fork 175
RHOAI 2.22 GPU as a Service Lab overlay #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RHOAI 2.22 GPU as a Service Lab overlay #152
Conversation
| @@ -0,0 +1,11 @@ | |||
| apiVersion: kustomize.config.k8s.io/v1alpha1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a duplicate file?
| @@ -0,0 +1,40 @@ | |||
| --- | |||
| apiVersion: batch/v1 | |||
| kind: Job | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different from the aws-gpu-machineset component?
It looks like it is deploying the same GPU instance type and my quick check doesn't seem to deploy something different.
If it is different I would prefer to move this portion into its own component.
The main reason for that is that the aws-gpu job is really a work around for us to get GPUs in the demo environment and not something that we would generally use in a customer environment. We want to easily remove the job that creates the GPU machinesets but we may still want to use time-slicing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks pretty good.
I made a few minor nitpicks that I would appreciate your thoughts on to help make sure we can keep our options flexible in the future.
| @@ -0,0 +1,25 @@ | |||
| # components-distributed-compute | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to update the readme
| components: | ||
| codeflare: | ||
| managementState: Managed | ||
| kueue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it would make more sense to call this something besides kueue-operator since this is kind of the opposite.
I wouldn't mind updating the current component-distributed-compute to disable the kueue operator and including the Red Hat build of Kueue in the AI Accelerator by default since that will be the preferred method moving forward.
I don't think we have any sort of backwards compatibility concerns since keueu/distributed compute was not really used in any of the existing examples.
Thoughts?
What does this PR do?
Add a new overlay for the GPU as a Service Lab, using RHOAI 2.22 and OCP 4.19
Test Plan
Deploy a new demo cluster --> execute bootstrap --> select rhoai-stable-2.22-aws-gpu-time-sliced bootstrap.