NERSC Metrics Scripts

A comprehensive solution for deploying Prometheus, Grafana, node_exporter, and DCGM for monitoring compute nodes and GPU metrics at NERSC.

Overview

This toolkit provides a set of scripts to easily deploy a complete metrics collection and visualization stack on NERSC systems. It's designed to work within the NERSC JupyterHub environment and supports dynamic registration of monitoring targets through HTTP service discovery.

Architecture Example

Multi-node monitoring example with NERSC Metrics

Figure 1: Example deployment showing how the NERSC Metrics scripts can be used to monitor multiple compute nodes, with each node running node_exporter and DCGM collectors that register with the central Prometheus instance

Features

Prometheus for metrics collection and storage
Grafana for metrics visualization and dashboarding
node_exporter for system-level metrics (CPU, memory, network, etc.)
DCGM exporter for NVIDIA GPU metrics
HTTP Service Discovery API for dynamic target registration
Works within NERSC JupyterHub environment

Installation

Clone the repository to your local machine:

git clone https://github.com/yourusername/nersc-metrics-scripts.git
cd nersc-metrics-scripts

Install dependencies and download node_exporter:
```
make all
```
This will:
- Create a Python user base directory in your SCRATCH space
- Install required Python packages (FastAPI, uvicorn)
- Download and extract node_exporter

Usage

1. Start Prometheus and Grafana

Launch the core monitoring infrastructure within JupyterHub:

./start_grafana_prometheus.sh

This script:

Starts the HTTP service discovery API
Deploys Prometheus in a container
Deploys Grafana in a container
Configures all services to work within JupyterHub

After running, you'll see URLs to access Prometheus and Grafana through the JupyterHub proxy.

2. Start Collectors on Compute Nodes

To monitor a compute node, run one or both of these scripts:

For system metrics:

./start_node_exporter_collector.sh http://hostname:8080 &

For NVIDIA GPU metrics:

./start_dcgm_collector.sh http://hostname:8080 &

Replace hostname:8080 with the address of your HTTP service discovery API.

Dashboards

After deployment, you can access:

Prometheus: https://jupyter.nersc.gov/user/<username>/proxy/9090/query
Grafana: https://jupyter.nersc.gov/user/<username>/proxy/3000/login

Default Grafana login is admin/admin. You'll be prompted to change the password on first login.

Configuring Grafana

Log in to Grafana
Add Prometheus as a data source:
- URL: http://localhost:9090
- Access: Server (default)
Import the provided dashboard or create your own

Dashboard Preview

Here’s a preview of the Grafana dashboard:

Troubleshooting

Common Issues

Services not starting:
- Check if ports are already in use
- Verify you have the necessary permissions
Cannot access dashboards:
- Ensure JupyterHub is running
- Check the service prefix in the URLs
No metrics appearing:
- Verify collectors are running
- Check Prometheus targets page for errors

Logs

Check the terminal output for error messages
Prometheus logs are available in the Prometheus web UI under Status > Runtime & Build Information

Cleaning Up

To stop all services, press Ctrl+C in the terminal where you started the services.

To remove all installed components:

make clean

Advanced Configuration

Customizing Prometheus

Edit prometheus_template.yml to modify the Prometheus configuration, such as:

Scrape intervals
Retention policies
Alert rules

Customizing Container Images

You can modify the container images used in start_grafana_prometheus.sh:

Change PROM_IMAGE to use a different Prometheus version
Change GF_IMAGE to use a different Grafana version
Adjust container parameters as needed

Customizing Network Ports

If the default ports conflict with other services, you can modify:

PROM_PORT (default: 9090) for Prometheus
GF_PORT (default: 3000) for Grafana
HTTP_SD_PORT (default: 8080) for the HTTP service discovery API

Customizing Metrics Collection

Node Exporter Metrics

In start_node_exporter_collector.sh, you can enable or disable specific collectors:

Add or remove --collector.<name> flags to customize which system metrics are collected
Current enabled collectors include CPU, memory, network devices, InfiniBand, and network stats
See the node_exporter documentation for all available collectors

DCGM Metrics

For GPU metrics in start_dcgm_collector.sh:

Modify the dcgm_metrics.csv file to select which GPU metrics to collect
Adjust the collection interval with the -c parameter (in milliseconds)
See the DCGM Exporter documentation for more options

Adding Custom Exporters

Create a new script similar to the existing collector scripts (see start_dcgm_collector.sh for help)
Register your exporter with the HTTP service discovery API using the appropriate labels

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dcgm_metrics.csv		dcgm_metrics.csv
example_metrics_diagram.svg		example_metrics_diagram.svg
grafana_dashboard.json		grafana_dashboard.json
grafana_dashboard_preview.png		grafana_dashboard_preview.png
prometheus_http_sd.py		prometheus_http_sd.py
prometheus_template.yml		prometheus_template.yml
start_dcgm_collector.sh		start_dcgm_collector.sh
start_envoy_collector.sh		start_envoy_collector.sh
start_grafana_prometheus.sh		start_grafana_prometheus.sh
start_node_exporter_collector.sh		start_node_exporter_collector.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NERSC Metrics Scripts

Table of Contents

Overview

Architecture Example

Features

Installation

Usage

1. Start Prometheus and Grafana

2. Start Collectors on Compute Nodes

Dashboards

Configuring Grafana

Dashboard Preview

Troubleshooting

Common Issues

Logs

Cleaning Up

Advanced Configuration

Customizing Prometheus

Customizing Container Images

Customizing Network Ports

Customizing Metrics Collection

Node Exporter Metrics

DCGM Metrics

Adding Custom Exporters

About

Uh oh!

Releases

Packages

Languages

License

asnaylor/nersc-metrics-scripts

Folders and files

Latest commit

History

Repository files navigation

NERSC Metrics Scripts

Table of Contents

Overview

Architecture Example

Features

Installation

Usage

1. Start Prometheus and Grafana

2. Start Collectors on Compute Nodes

Dashboards

Configuring Grafana

Dashboard Preview

Troubleshooting

Common Issues

Logs

Cleaning Up

Advanced Configuration

Customizing Prometheus

Customizing Container Images

Customizing Network Ports

Customizing Metrics Collection

Node Exporter Metrics

DCGM Metrics

Adding Custom Exporters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages