Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/extension/linting.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@
'Prime Intellect',
'Cloudflare Zero Trust',
'CoreWeave Object Storage',
'Prometheus Operator',
'Node Exporter',
}


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,24 @@ You can also check the Pod status to verify the installation.

kubectl get pods -n skypilot

If there are any issues with the installation like pods stuck in ``ErrImagePull`` or ``ImagePullBackOff``,
you can install the Prometheus operator manually using the community helm chart:

.. code-block:: bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the application installation method, users need to install the device plugin application on the console anyway. Or add these commands as a backup? Tell the users that if this way fails, they can install it manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think it makes sense to add the commands as a backup. It is easier to install the Prometheus Operator via the Nebius console, but I think there is still value in including the instructions for how to install the operator manually using the community chart in case the Nebius 1-click deploy is down.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we need to delete the application first, and the following command needs to include the customized images?


helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace skypilot \
--create-namespace \
--set prometheus.enabled=true \
--set alertmanager.enabled=false \
--set grafana.enabled=false \
--set kubeStateMetrics.enabled=false \
--set nodeExporter.enabled=true \
--set prometheusOperator.enabled=true

By default, the CPU and memory metrics exported by node exporter do not include the ``node`` label, which is required for the SkyPilot dashboard to display the metrics. You can add the ``node`` label to the metrics by applying the following config to the node exporter service monitor resource:

.. code-block:: bash
Expand Down
Loading