Feature/gpu as a service #146

shebistar · 2025-09-02T15:17:15Z

What does this PR do?

Add GPU as a Service content and overlay for NVIDIA GPU with OCP 4.19 and RHOAI 2.22

Test Plan

Deploy new cluster
Use bootstrap
Select option for rhoai-stable-2.22-aws-gpu-time-sliced

…ll RHBoK (kueue) operator

...operators/openshift-ai/instance/components/model-server-pod-sizes/patch-rhoai-dashboard.yaml

...nts/operators/openshift-ai/instance/components/notebook-pod-sizes/patch-rhoai-dashboard.yaml

shebistar · 2025-09-02T16:00:03Z

...rator-certified/operator/components/console-plugin-25/nvidia-dcgm-exporter-dashboard-cm.yaml

+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: nvidia-dcgm-exporter-dashboard


I don't see any difference, but I did not see it applied when I was applying the overlay, maybe I can include this monitoring-dashboard to this resource.

components/operators/gpu-operator-certified/operator/base/kustomization.yaml

strangiato

There is a lot going on here...

My big concern is that we are in the process of actively trying to remove the existing tenants folder and move its existing content here:
https://github.com/redhat-ai-services/ai-accelerator-examples/

I'm also not sure how the solutions fits into the long term strategy or what it's use for, but it may be better lived in the examples repo as well, or perhaps a totally separate repo. Happy to discuss what your goals are here and how it can maintain it for the bootcamp long term.

The parasol stuff should also live outside of the ai-accelerator. Possibly in the examples repo, but ideally upstream with the official parasol resources.

For the time slicing overlay, is the intention just that this is an easy way to get "more" GPUs in the cluster for the GPUs as a Service exercises?

My general recommendation for most customers is that they should NEVER use time slicing because stuff is going to crash and be unstable. There is a very narrow use case for time slicing but I understand if we have a bit of a contrived scenario where this is the easiest way to setup the resources we need to do the GPU as a Service capabilities.

shebistar · 2025-09-03T11:59:06Z

I like more the idea of using the https://github.com/redhat-ai-services/ai-accelerator-examples/ to put the solution for the lab, we are trying to build an independent folder, not tied with the tenants folder used before. I have removed it from this repo and moved it with this PR: redhat-ai-services/ai-accelerator-examples#10

About the time-slicing concepts, we were not able to get a GPU with MIG support from the demo team. We only got A10 GPU, and the only one supported for MIG in the AWS region we are currently working is the A100, which is too expensive, and we would need to build a business case around it. So we are explaining during the Bootcamp the different ways to have GPU partitioning, and that because of lack of resources in our environment we will test only with time-slice. Our idea initially was to go directly with MIG, because it is what makes more sense for production use cases.

…s/console-plugin-25/, as it was duplicated from components/operators/gpu-operator-certified/instance/components/monitoring-dashboard/

shebistar

Regarding the Console Plugin, I removed the duplicated ConfigMap, and we will be using the one in the components/operators/gpu-operator-certified/instance/components/monitoring-dashboard
About Hardware Profiles, they are still Tech Preview in RHOAI 2.2, we have a plan to include them in the next RHOAI 2.3, then we will update all the code to use Hardware Profiles.

components/operators/gpu-operator-certified/operator/base/kustomization.yaml

...nts/operators/openshift-ai/instance/components/notebook-pod-sizes/patch-rhoai-dashboard.yaml

andifg and others added 30 commits May 7, 2025 21:39

automatic update to repo by bootstrap script

7934a2d

Add parasol-insurance-dev overlay

d3f08ef

update rhoai culler config

f3b6e4d

define workbench and model server sizes

dd59222

add custom workbench

52f0bcb

Merge branch 'redhat-ai-services:main' into main

1356ab4

update openshift-ai version

fbe241f

add minio to parasol insurance

2c48919

fix

fae1908

add standard workbench with s3 access

5dc56d1

add inference service

03d276f

add missing kustomization

6b37691

Merge branch 'redhat-ai-services:main' into main

c621f4b

add data science pipeline application

916cdcf

add rhoai 2.22 overlay

f3f736b

remove ai examples for rhoai-stable 2.22 overlay

d82cc44

tmp change repo + branch

d8c9cb5

added llama-stack distro & playground & ocp-mcp

2801121

activate tenants again

af57a43

based64 the change_me

1cb4463

added llamastack-operator

f8bf3fc

fixed sa

e723434

activate auto sync again

eb8e101

remove common overlay

6c7f31b

remove auto sync for applicationset tenants

b28f8ba

add workbench

4450571

add jupyter notebook

5602821

update llama stack notebook

a5055f3

fix notebook

6063830

Add time-slice overlay

cedfe76

Rodriguez Isaziga, Sebastian (ext) (DI IT DEMA ALM 1) and others added 15 commits August 20, 2025 11:28

Create overlay for rhoai-stable-2.22-aws-gpu-time-sliced

38b6386

Add Machineset and patch for rbac

9556790

update replicas for Machineset to 2

b8b74f8

Add kueue as Unmanaged on the overlay stable-2.22-nvidia-gpu to insta…

177b03d

…ll RHBoK (kueue) operator

Fix Kueue for only the rhoai-stable-2.22-aws-gpu-time-sliced overlay

298638a

remove kueue as managed in DSC

bba483e

Change from Unmanaged to Managed in DSC for kueue

85ed401

Change from Unmanaged to Managed in DSC for kueue

6e51234

Add gpu-as-a-service tenant

0781c49

update repo to feature/gpu-as-a-service branch

6e600bd

removed agentic-ai folder and fix yammlint on gpu-as-a-service

91dd43e

Create main.yml workflow validation

1bea1f5

Merge branch 'main' into feature/gpu-as-a-service

26261ec

Fix linter for aws-time-sliced-2 and aws-time-sliced-4

966fcff

move tenants/gpu-as-a-service to solutions/gpu-as-a-service

981a9a7

shebistar requested a review from a team as a code owner September 2, 2025 15:17

Rodriguez Isaziga, Sebastian (ext) (DI IT DEMA ALM 1) added 3 commits September 2, 2025 17:19

delete workflow duplicated by mistake

54971e2

Update to main branch, main repo

758d90d

Fix YAML linter errors

f73800f

strangiato reviewed Sep 2, 2025

View reviewed changes

...operators/openshift-ai/instance/components/model-server-pod-sizes/patch-rhoai-dashboard.yaml Show resolved Hide resolved

strangiato reviewed Sep 2, 2025

View reviewed changes

...nts/operators/openshift-ai/instance/components/notebook-pod-sizes/patch-rhoai-dashboard.yaml Show resolved Hide resolved

shebistar commented Sep 2, 2025

View reviewed changes

strangiato reviewed Sep 2, 2025

View reviewed changes

components/operators/gpu-operator-certified/operator/base/kustomization.yaml Show resolved Hide resolved

strangiato requested changes Sep 2, 2025

View reviewed changes

shebistar requested a review from strangiato September 3, 2025 11:59

remove components/operators/gpu-operator-certified/operator/component…

8dae0c4

…s/console-plugin-25/, as it was duplicated from components/operators/gpu-operator-certified/instance/components/monitoring-dashboard/

shebistar commented Sep 3, 2025

View reviewed changes

Rodriguez Isaziga, Sebastian (ext) (DI IT DEMA ALM 1) added 2 commits September 5, 2025 14:30

automatic update to branch by bootstrap script

06a5ef6

automatic update to repo by bootstrap script

489ae8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature/gpu as a service #146

Feature/gpu as a service #146

Uh oh!

shebistar commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

shebistar Sep 2, 2025

Uh oh!

Uh oh!

strangiato left a comment

Uh oh!

shebistar commented Sep 3, 2025

Uh oh!

shebistar left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Feature/gpu as a service #146

Are you sure you want to change the base?

Feature/gpu as a service #146

Uh oh!

Conversation

shebistar commented Sep 2, 2025

What does this PR do?

Test Plan

Uh oh!

Uh oh!

Uh oh!

shebistar Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

strangiato left a comment

Choose a reason for hiding this comment

Uh oh!

shebistar commented Sep 3, 2025

Uh oh!

shebistar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants