Skip to content

Conversation

@shebistar
Copy link

What does this PR do?

Add a new overlay for the GPU as a Service Lab, using RHOAI 2.22 and OCP 4.19

Test Plan

Deploy a new demo cluster --> execute bootstrap --> select rhoai-stable-2.22-aws-gpu-time-sliced bootstrap.

@shebistar shebistar requested a review from a team as a code owner September 19, 2025 12:57
@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1alpha1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a duplicate file?

@@ -0,0 +1,40 @@
---
apiVersion: batch/v1
kind: Job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different from the aws-gpu-machineset component?

It looks like it is deploying the same GPU instance type and my quick check doesn't seem to deploy something different.

If it is different I would prefer to move this portion into its own component.

The main reason for that is that the aws-gpu job is really a work around for us to get GPUs in the demo environment and not something that we would generally use in a customer environment. We want to easily remove the job that creates the GPU machinesets but we may still want to use time-slicing.

Copy link
Member

@strangiato strangiato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks pretty good.

I made a few minor nitpicks that I would appreciate your thoughts on to help make sure we can keep our options flexible in the future.

@@ -0,0 +1,25 @@
# components-distributed-compute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update the readme

components:
codeflare:
managementState: Managed
kueue:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would make more sense to call this something besides kueue-operator since this is kind of the opposite.

I wouldn't mind updating the current component-distributed-compute to disable the kueue operator and including the Red Hat build of Kueue in the AI Accelerator by default since that will be the preferred method moving forward.

I don't think we have any sort of backwards compatibility concerns since keueu/distributed compute was not really used in any of the existing examples.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants