-
Notifications
You must be signed in to change notification settings - Fork 851
[Docs] update docs for setting up gpu metrics on nebius #8026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/build-docs |
|
✅ ReadTheDocs build triggered for branch The documentation will be available at: https://docs.skypilot.co/en/update-nebius-docs/ |
DanielZhangQD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! @rohansonecha
| helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \ | ||
| --namespace skypilot \ | ||
| --create-namespace \ | ||
| --set prometheus.enabled=false \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus is required for the external Nebius cluster, the cluster is assumed to be an external cluster in this doc, right? For in-cluster configurations, we may need to include them in the doc here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch, I will update the command. Also, it seems that Nebius is working on a fix for their Prometheus Operator application. I wonder if we should hold off on merging this PR until that work is done? Or just update our docs with the change in this PR to use the prometheus community chart to be future proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the command to enable prometheus @DanielZhangQD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also updated the in-cluster doc.
d173632 to
0f345d0
Compare
| requirements: | ||
|
|
||
| * **NVIDIA GPUs** are available on your worker nodes. | ||
| * The Prometheus Operator is installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not required, users can also deploy Prom without operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
DanielZhangQD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! @rohansonecha
| :alt: Deploy Prometheus Operator | ||
| :align: center | ||
| :width: 60% | ||
| .. code-block:: bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep the application installation method, users need to install the device plugin application on the console anyway. Or add these commands as a backup? Tell the users that if this way fails, they can install it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think it makes sense to add the commands as a backup. It is easier to install the Prometheus Operator via the Nebius console, but I think there is still value in including the instructions for how to install the operator manually using the community chart in case the Nebius 1-click deploy is down.
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts | ||
| helm repo update | ||
| helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about we separate the command into the Check the node exporter setup and Prometheus setup sections as example commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a section before Check the node exporter setup and Prometheus setup with instructions for deploying the prometheus operator and node exporter.
| Installing the Prometheus Operator and Node Exporter | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The Prometheus Operator is necessary for the DCGM-Exporter to start properly. The Prometheus Operator and Node Exporter can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Prometheus Operator is necessary for the DCGM-Exporter to start properly. This is not true, these two components are independent. Let's just add the command as example commands as described in the comment for L38.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this line
2ddf520 to
b8c2227
Compare
b8c2227 to
081e0e0
Compare
|
@DanielZhangQD this PR is ready for a final review. I ended up removing the in-cluster instructions from this page, as I felt it was adding additional complexity/confusion to the docs which are already quite clear. Please take a look when you get a chance! |
The 1-click Deploy Prometheus Operator Application on Nebius no longer works because it relies on pulling Bitnami images that are not longer being served publicly on DockerHub. This PR updates the instructions in our docs for setting up gpu metrics on nebius to use a command to manually deploy the
prometheus-community/kube-prometheus-stackchart instead.I used this command to set up GPU metrics on a test api server and verified that using this command instead of the Nebius 1-click Deploy Prometheus Operator Application and then proceeding to follow the rest of the existing doc worked as expected.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)