Skip to content

Commit b8c2227

Browse files
committed
accuracy
1 parent 0f345d0 commit b8c2227

File tree

2 files changed

+49
-26
lines changed

2 files changed

+49
-26
lines changed

docs/source/reference/api-server/examples/api-server-gpu-metrics-setup.rst

Lines changed: 21 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -18,33 +18,11 @@ Before you begin, make sure your Kubernetes cluster meets the following
1818
requirements:
1919

2020
* **NVIDIA GPUs** are available on your worker nodes.
21-
* The Prometheus Operator is installed.
2221
* The `NVIDIA device plugin <https://github.com/NVIDIA/k8s-device-plugin>`_ or the NVIDIA **GPU Operator** is installed.
2322
* **DCGM-Exporter** is running on the cluster and exposes metrics on
2423
port ``9400``. Most GPU Operator installations already deploy DCGM-Exporter for you.
2524
* `Node Exporter <https://prometheus.io/docs/guides/node-exporter/>`_ is running on the cluster and exposes metrics on port ``9100``. This is required only if you want to monitor the CPU and Memory metrics.
2625

27-
Installing the Prometheus Operator and Node Exporter
28-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
29-
30-
The Prometheus Operator is necessary for the DCGM-Exporter to start properly. The Prometheus Operator and Node Exporter can be
31-
deployed using the prometheus community helm chart:
32-
33-
.. code-block:: bash
34-
35-
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
36-
helm repo update
37-
38-
helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
39-
--namespace skypilot \
40-
--create-namespace \
41-
--set prometheus.enabled=false \
42-
--set alertmanager.enabled=false \
43-
--set grafana.enabled=false \
44-
--set kubeStateMetrics.enabled=false \
45-
--set nodeExporter.enabled=true \
46-
--set prometheusOperator.enabled=true
47-
4826
Check the dcgm exporter setup
4927
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5028

@@ -104,6 +82,27 @@ If any are missing, edit the Service to add them.
10482
10583
where ``$NAMESPACE`` is the DCGM-Exporter namespace.
10684

85+
Deploying the Prometheus Operator and Node Exporter
86+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
87+
88+
The Prometheus Operator and Node Exporter can be
89+
deployed using the prometheus community helm chart:
90+
91+
.. code-block:: bash
92+
93+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
94+
helm repo update
95+
96+
helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
97+
--namespace skypilot \
98+
--create-namespace \
99+
--set prometheus.enabled=false \
100+
--set alertmanager.enabled=false \
101+
--set grafana.enabled=false \
102+
--set kubeStateMetrics.enabled=false \
103+
--set nodeExporter.enabled=true \
104+
--set prometheusOperator.enabled=true
105+
107106
Check the node exporter setup
108107
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109108

docs/source/reference/api-server/examples/example-deploy-gke-nebius-okta.rst

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -430,6 +430,34 @@ If you are using Nebius Kubernetes cluster, you can setup GPU metrics in the clu
430430
431431
1. Install the Prometheus operator.
432432
433+
On Nebius console, in the detail page of the Nebius Kubernetes cluster, go to ``Applications`` -> Search for ``Prometheus Operator`` -> ``Deploy`` -> Enter ``skypilot`` for the ``Namespace`` -> ``Deploy application``.
434+
435+
.. image:: ../../../images/metrics/search-prom-operator.png
436+
:alt: Search for Prometheus Operator
437+
:align: center
438+
:width: 60%
439+
440+
.. image:: ../../../images/metrics/deploy-prom-operator.png
441+
:alt: Deploy Prometheus Operator
442+
:align: center
443+
:width: 60%
444+
445+
Wait for the Prometheus operator to be installed, the status badge will become ``Deployed``.
446+
447+
.. image:: ../../../images/metrics/status-prom-operator.png
448+
:alt: Status of Prometheus Operator
449+
:align: center
450+
:width: 60%
451+
452+
You can also check the Pod status to verify the installation.
453+
454+
.. code-block:: bash
455+
456+
kubectl get pods -n skypilot
457+
458+
If there are any issues with the installation like the pods stuck in ``ErrImagePull`` or ``ImagePullBackOff``,
459+
you can install the Prometheus operator manually using the community helm chart:
460+
433461
.. code-block:: bash
434462
435463
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
@@ -445,11 +473,7 @@ If you are using Nebius Kubernetes cluster, you can setup GPU metrics in the clu
445473
--set nodeExporter.enabled=true \
446474
--set prometheusOperator.enabled=true
447475
448-
You can check the Pod status to verify the installation.
449476
450-
.. code-block:: bash
451-
452-
kubectl get pods -n skypilot
453477
454478
By default, the CPU and memory metrics exported by node exporter do not include the ``node`` label, which is required for the SkyPilot dashboard to display the metrics. You can add the ``node`` label to the metrics by applying the following config to the node exporter service monitor resource:
455479

0 commit comments

Comments
 (0)