Skip to content

Conversation

@rohansonecha
Copy link
Collaborator

@rohansonecha rohansonecha commented Nov 20, 2025

The 1-click Deploy Prometheus Operator Application on Nebius no longer works because it relies on pulling Bitnami images that are not longer being served publicly on DockerHub. This PR updates the instructions in our docs for setting up gpu metrics on nebius to use a command to manually deploy the prometheus-community/kube-prometheus-stack chart instead.

Screenshot 2025-11-19 at 9 17 37 PM

I used this command to set up GPU metrics on a test api server and verified that using this command instead of the Nebius 1-click Deploy Prometheus Operator Application and then proceeding to follow the rest of the existing doc worked as expected.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@rohansonecha rohansonecha self-assigned this Nov 20, 2025
@rohansonecha
Copy link
Collaborator Author

/build-docs

@github-actions
Copy link
Contributor

✅ ReadTheDocs build triggered for branch update-nebius-docs

The documentation will be available at: https://docs.skypilot.co/en/update-nebius-docs/

Copy link
Collaborator

@DanielZhangQD DanielZhangQD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @rohansonecha

helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace skypilot \
--create-namespace \
--set prometheus.enabled=false \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus is required for the external Nebius cluster, the cluster is assumed to be an external cluster in this doc, right? For in-cluster configurations, we may need to include them in the doc here?

Copy link
Collaborator Author

@rohansonecha rohansonecha Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch, I will update the command. Also, it seems that Nebius is working on a fix for their Prometheus Operator application. I wonder if we should hold off on merging this PR until that work is done? Or just update our docs with the change in this PR to use the prometheus community chart to be future proof.

Copy link
Collaborator Author

@rohansonecha rohansonecha Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the command to enable prometheus @DanielZhangQD

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also updated the in-cluster doc.

requirements:

* **NVIDIA GPUs** are available on your worker nodes.
* The Prometheus Operator is installed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required, users can also deploy Prom without operator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Collaborator

@DanielZhangQD DanielZhangQD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @rohansonecha

:alt: Deploy Prometheus Operator
:align: center
:width: 60%
.. code-block:: bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the application installation method, users need to install the device plugin application on the console anyway. Or add these commands as a backup? Tell the users that if this way fails, they can install it manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think it makes sense to add the commands as a backup. It is easier to install the Prometheus Operator via the Nebius console, but I think there is still value in including the instructions for how to install the operator manually using the community chart in case the Nebius 1-click deploy is down.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we separate the command into the Check the node exporter setup and Prometheus setup sections as example commands.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a section before Check the node exporter setup and Prometheus setup with instructions for deploying the prometheus operator and node exporter.

Installing the Prometheus Operator and Node Exporter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Prometheus Operator is necessary for the DCGM-Exporter to start properly. The Prometheus Operator and Node Exporter can be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Prometheus Operator is necessary for the DCGM-Exporter to start properly. This is not true, these two components are independent. Let's just add the command as example commands as described in the comment for L38.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this line

@rohansonecha rohansonecha force-pushed the update-nebius-docs branch 3 times, most recently from 2ddf520 to b8c2227 Compare November 24, 2025 18:50
@rohansonecha
Copy link
Collaborator Author

@DanielZhangQD this PR is ready for a final review. I ended up removing the in-cluster instructions from this page, as I felt it was adding additional complexity/confusion to the docs which are already quite clear. Please take a look when you get a chance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants