A comprehensive solution for deploying Prometheus, Grafana, node_exporter, and DCGM for monitoring compute nodes and GPU metrics at NERSC.
This toolkit provides a set of scripts to easily deploy a complete metrics collection and visualization stack on NERSC systems. It's designed to work within the NERSC JupyterHub environment and supports dynamic registration of monitoring targets through HTTP service discovery.
Figure 1: Example deployment showing how the NERSC Metrics scripts can be used to monitor multiple compute nodes, with each node running node_exporter and DCGM collectors that register with the central Prometheus instance
- Prometheus for metrics collection and storage
- Grafana for metrics visualization and dashboarding
- node_exporter for system-level metrics (CPU, memory, network, etc.)
- DCGM exporter for NVIDIA GPU metrics
- HTTP Service Discovery API for dynamic target registration
- Works within NERSC JupyterHub environment
-
Clone the repository to your local machine:
git clone https://github.com/yourusername/nersc-metrics-scripts.git cd nersc-metrics-scripts -
Install dependencies and download node_exporter:
make all
This will:
- Create a Python user base directory in your SCRATCH space
- Install required Python packages (FastAPI, uvicorn)
- Download and extract node_exporter
Launch the core monitoring infrastructure within JupyterHub:
./start_grafana_prometheus.shThis script:
- Starts the HTTP service discovery API
- Deploys Prometheus in a container
- Deploys Grafana in a container
- Configures all services to work within JupyterHub
After running, you'll see URLs to access Prometheus and Grafana through the JupyterHub proxy.
To monitor a compute node, run one or both of these scripts:
For system metrics:
./start_node_exporter_collector.sh http://hostname:8080 &For NVIDIA GPU metrics:
./start_dcgm_collector.sh http://hostname:8080 &Replace hostname:8080 with the address of your HTTP service discovery API.
After deployment, you can access:
- Prometheus:
https://jupyter.nersc.gov/user/<username>/proxy/9090/query - Grafana:
https://jupyter.nersc.gov/user/<username>/proxy/3000/login
Default Grafana login is admin/admin. You'll be prompted to change the password on first login.
- Log in to Grafana
- Add Prometheus as a data source:
- URL:
http://localhost:9090 - Access: Server (default)
- URL:
- Import the provided dashboard or create your own
Here’s a preview of the Grafana dashboard:
-
Services not starting:
- Check if ports are already in use
- Verify you have the necessary permissions
-
Cannot access dashboards:
- Ensure JupyterHub is running
- Check the service prefix in the URLs
-
No metrics appearing:
- Verify collectors are running
- Check Prometheus targets page for errors
- Check the terminal output for error messages
- Prometheus logs are available in the Prometheus web UI under Status > Runtime & Build Information
To stop all services, press Ctrl+C in the terminal where you started the services.
To remove all installed components:
make cleanEdit prometheus_template.yml to modify the Prometheus configuration, such as:
- Scrape intervals
- Retention policies
- Alert rules
You can modify the container images used in start_grafana_prometheus.sh:
- Change
PROM_IMAGEto use a different Prometheus version - Change
GF_IMAGEto use a different Grafana version - Adjust container parameters as needed
If the default ports conflict with other services, you can modify:
PROM_PORT(default: 9090) for PrometheusGF_PORT(default: 3000) for GrafanaHTTP_SD_PORT(default: 8080) for the HTTP service discovery API
In start_node_exporter_collector.sh, you can enable or disable specific collectors:
- Add or remove
--collector.<name>flags to customize which system metrics are collected - Current enabled collectors include CPU, memory, network devices, InfiniBand, and network stats
- See the node_exporter documentation for all available collectors
For GPU metrics in start_dcgm_collector.sh:
- Modify the
dcgm_metrics.csvfile to select which GPU metrics to collect - Adjust the collection interval with the
-cparameter (in milliseconds) - See the DCGM Exporter documentation for more options
- Create a new script similar to the existing collector scripts (see
start_dcgm_collector.shfor help) - Register your exporter with the HTTP service discovery API using the appropriate labels
