[Docs] update docs for setting up gpu metrics on nebius #8026

rohansonecha · 2025-11-20T05:14:35Z

The 1-click Deploy Prometheus Operator Application on Nebius no longer works because it relies on pulling Bitnami images that are not longer being served publicly on DockerHub. This PR updates the instructions in our docs for setting up gpu metrics on nebius to use a command to manually deploy the prometheus-community/kube-prometheus-stack chart instead.

I used this command to set up GPU metrics on a test api server and verified that using this command instead of the Nebius 1-click Deploy Prometheus Operator Application and then proceeding to follow the rest of the existing doc worked as expected.

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

rohansonecha · 2025-11-20T05:15:07Z

/build-docs

github-actions · 2025-11-20T05:15:19Z

✅ ReadTheDocs build triggered for branch update-nebius-docs

The documentation will be available at: https://docs.skypilot.co/en/update-nebius-docs/

DanielZhangQD

Thanks! @rohansonecha

DanielZhangQD · 2025-11-20T06:13:02Z

docs/source/reference/api-server/examples/example-deploy-gke-nebius-okta.rst

+  helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
+    --namespace skypilot \
+    --create-namespace \
+    --set prometheus.enabled=false \


Prometheus is required for the external Nebius cluster, the cluster is assumed to be an external cluster in this doc, right? For in-cluster configurations, we may need to include them in the doc here?

This is a great catch, I will update the command. Also, it seems that Nebius is working on a fix for their Prometheus Operator application. I wonder if we should hold off on merging this PR until that work is done? Or just update our docs with the change in this PR to use the prometheus community chart to be future proof.

Updated the command to enable prometheus @DanielZhangQD

I also updated the in-cluster doc.

DanielZhangQD · 2025-11-21T02:16:05Z

docs/source/reference/api-server/examples/api-server-gpu-metrics-setup.rst

 requirements:

 * **NVIDIA GPUs** are available on your worker nodes.
+* The Prometheus Operator is installed.


This is not required, users can also deploy Prom without operator.

DanielZhangQD

Thanks! @rohansonecha

DanielZhangQD · 2025-11-21T02:18:30Z

docs/source/reference/api-server/examples/example-deploy-gke-nebius-okta.rst

-    :alt: Deploy Prometheus Operator
-    :align: center
-    :width: 60%
+.. code-block:: bash


I think we can keep the application installation method, users need to install the device plugin application on the console anyway. Or add these commands as a backup? Tell the users that if this way fails, they can install it manually.

Yep, I think it makes sense to add the commands as a backup. It is easier to install the Prometheus Operator via the Nebius console, but I think there is still value in including the instructions for how to install the operator manually using the community chart in case the Nebius 1-click deploy is down.

DanielZhangQD · 2025-11-21T02:20:49Z

docs/source/reference/api-server/examples/api-server-gpu-metrics-setup.rst

+    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+    helm repo update
+
+    helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \


What about we separate the command into the Check the node exporter setup and Prometheus setup sections as example commands.

I added a section before Check the node exporter setup and Prometheus setup with instructions for deploying the prometheus operator and node exporter.

DanielZhangQD · 2025-11-21T02:24:12Z

docs/source/reference/api-server/examples/api-server-gpu-metrics-setup.rst

+Installing the Prometheus Operator and Node Exporter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Prometheus Operator is necessary for the DCGM-Exporter to start properly. The Prometheus Operator and Node Exporter can be


The Prometheus Operator is necessary for the DCGM-Exporter to start properly. This is not true, these two components are independent. Let's just add the command as example commands as described in the comment for L38.

Removed this line

rohansonecha · 2025-11-24T18:58:18Z

@DanielZhangQD this PR is ready for a final review. I ended up removing the in-cluster instructions from this page, as I felt it was adding additional complexity/confusion to the docs which are already quite clear. Please take a look when you get a chance!

update docs for setting up gpu metrics on nebius

3bd9526

rohansonecha requested a review from DanielZhangQD November 20, 2025 05:14

rohansonecha self-assigned this Nov 20, 2025

github-actions bot added the rtd-preview label Nov 20, 2025

DanielZhangQD reviewed Nov 20, 2025

View reviewed changes

rohansonecha added 4 commits November 20, 2025 12:04

enable prometheus for external cluster

61f4f2a

update in cluster docs

9d0a192

proper noun

fda6a1c

node exporter

0f345d0

rohansonecha force-pushed the update-nebius-docs branch from d173632 to 0f345d0 Compare November 20, 2025 21:05

DanielZhangQD reviewed Nov 21, 2025

View reviewed changes

rohansonecha force-pushed the update-nebius-docs branch 3 times, most recently from 2ddf520 to b8c2227 Compare November 24, 2025 18:50

accuracy

081e0e0

rohansonecha force-pushed the update-nebius-docs branch from b8c2227 to 081e0e0 Compare November 24, 2025 18:50

rohansonecha added 2 commits November 24, 2025 10:51

remove white space

0615815

minimize change radius

d9ef4b7

rohansonecha requested a review from DanielZhangQD November 24, 2025 18:58

[Docs] update docs for setting up gpu metrics on nebius #8026

Are you sure you want to change the base?

[Docs] update docs for setting up gpu metrics on nebius #8026

Uh oh!

Conversation

rohansonecha commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohansonecha commented Nov 20, 2025

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

DanielZhangQD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohansonecha Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohansonecha Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DanielZhangQD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohansonecha commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohansonecha commented Nov 20, 2025 •

edited

Loading

rohansonecha Nov 20, 2025 •

edited

Loading

rohansonecha Nov 20, 2025 •

edited

Loading