Skip to content

Conversation

@shebistar
Copy link

What does this PR do?

Add GPU as a Service content and overlay for NVIDIA GPU with OCP 4.19 and RHOAI 2.22

Test Plan

  • Deploy new cluster
  • Use bootstrap
  • Select option for rhoai-stable-2.22-aws-gpu-time-sliced

@shebistar shebistar requested a review from a team as a code owner September 2, 2025 15:17
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-dcgm-exporter-dashboard
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any difference, but I did not see it applied when I was applying the overlay, maybe I can include this monitoring-dashboard to this resource.

Copy link
Member

@strangiato strangiato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot going on here...

My big concern is that we are in the process of actively trying to remove the existing tenants folder and move its existing content here:
https://github.com/redhat-ai-services/ai-accelerator-examples/

I'm also not sure how the solutions fits into the long term strategy or what it's use for, but it may be better lived in the examples repo as well, or perhaps a totally separate repo. Happy to discuss what your goals are here and how it can maintain it for the bootcamp long term.

The parasol stuff should also live outside of the ai-accelerator. Possibly in the examples repo, but ideally upstream with the official parasol resources.

For the time slicing overlay, is the intention just that this is an easy way to get "more" GPUs in the cluster for the GPUs as a Service exercises?

My general recommendation for most customers is that they should NEVER use time slicing because stuff is going to crash and be unstable. There is a very narrow use case for time slicing but I understand if we have a bit of a contrived scenario where this is the easiest way to setup the resources we need to do the GPU as a Service capabilities.

@shebistar
Copy link
Author

I like more the idea of using the https://github.com/redhat-ai-services/ai-accelerator-examples/ to put the solution for the lab, we are trying to build an independent folder, not tied with the tenants folder used before. I have removed it from this repo and moved it with this PR: redhat-ai-services/ai-accelerator-examples#10

About the time-slicing concepts, we were not able to get a GPU with MIG support from the demo team. We only got A10 GPU, and the only one supported for MIG in the AWS region we are currently working is the A100, which is too expensive, and we would need to build a business case around it. So we are explaining during the Bootcamp the different ways to have GPU partitioning, and that because of lack of resources in our environment we will test only with time-slice. Our idea initially was to go directly with MIG, because it is what makes more sense for production use cases.

@shebistar shebistar requested a review from strangiato September 3, 2025 11:59
…s/console-plugin-25/, as it was duplicated from components/operators/gpu-operator-certified/instance/components/monitoring-dashboard/
Copy link
Author

@shebistar shebistar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Regarding the Console Plugin, I removed the duplicated ConfigMap, and we will be using the one in the components/operators/gpu-operator-certified/instance/components/monitoring-dashboard

  • About Hardware Profiles, they are still Tech Preview in RHOAI 2.2, we have a plan to include them in the next RHOAI 2.3, then we will update all the code to use Hardware Profiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants