Monitor a single machine or an entire cluster with the same Docker image.
Single machine:
docker run -d --gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latestMultiple machines:
# On each GPU server
docker run -d --gpus all -p 1312:1312 -e NODE_NAME=$(hostname) ghcr.io/psalias2006/gpu-hot:latest
# On a hub machine (no GPU required)
docker run -d -p 1312:1312 -e GPU_HOT_MODE=hub -e NODE_URLS=http://server1:1312,http://server2:1312,http://server3:1312 ghcr.io/psalias2006/gpu-hot:latestOpen http://localhost:1312
Older GPUs: Add -e NVIDIA_SMI=true if metrics don't appear.
Process monitoring: Add --init --pid=host to see process names. Note: This allows the container to access host process information.
From source:
git clone https://github.com/psalias2006/gpu-hot
cd gpu-hot
docker-compose up --buildRequirements: Docker + NVIDIA Container Toolkit
- Real-time metrics (sub-second)
- Automatic multi-GPU detection
- Process monitoring (PID, memory usage)
- Historical charts (utilization, temperature, power, clocks)
- System metrics (CPU, RAM)
- Scale from 1 to 100+ GPUs
Metrics: Utilization, temperature, memory, power draw, fan speed, clock speeds, PCIe info, P-State, throttle status, encoder/decoder sessions
Environment variables:
NVIDIA_VISIBLE_DEVICES=0,1 # Specific GPUs (default: all)
NVIDIA_SMI=true # Force nvidia-smi mode for older GPUs
GPU_HOT_MODE=hub # Set to 'hub' for multi-node aggregation (default: single node)
NODE_NAME=gpu-server-1 # Node display name (default: hostname)
NODE_URLS=http://host:1312... # Comma-separated node URLs (required for hub mode)Backend (core/config.py):
UPDATE_INTERVAL = 0.5 # Polling interval
PORT = 1312 # Server portGET / # Dashboard
GET /api/gpu-data # JSON metricssocket.on('gpu_data', (data) => {
// Updates every 0.5s (configurable)
// Contains: data.gpus, data.processes, data.system
});gpu-hot/
βββ app.py # Flask + WebSocket server
βββ core/
β βββ config.py # Configuration
β βββ monitor.py # NVML GPU monitoring
β βββ handlers.py # WebSocket handlers
β βββ routes.py # HTTP routes
β βββ metrics/
β βββ collector.py # Metrics collection
β βββ utils.py # Metric utilities
βββ static/
β βββ js/
β β βββ charts.js # Chart configs
β β βββ gpu-cards.js # UI components
β β βββ socket-handlers.js # WebSocket + rendering
β β βββ ui.js # View management
β β βββ app.js # Init
β βββ css/styles.css
βββ templates/index.html
βββ Dockerfile
βββ docker-compose.ymlNo GPUs detected:
nvidia-smi # Verify drivers work
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi # Test Docker GPU accessHub can't connect to nodes:
curl http://node-ip:1312/api/gpu-data # Test connectivity
sudo ufw allow 1312/tcp # Check firewallPerformance issues: Increase UPDATE_INTERVAL in core/config.py
PRs welcome. Open an issue for major changes.
MIT - see LICENSE
