Skip to content

psalias2006/gpu-hot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GPU Hot

Real-time NVIDIA GPU monitoring dashboard. Web-based, no SSH required.

Python Docker License: MIT NVIDIA

GPU Hot Dashboard

Usage

Monitor a single machine or an entire cluster with the same Docker image.

Single machine:

docker run -d --gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latest

Multiple machines:

# On each GPU server
docker run -d --gpus all -p 1312:1312 -e NODE_NAME=$(hostname) ghcr.io/psalias2006/gpu-hot:latest

# On a hub machine (no GPU required)
docker run -d -p 1312:1312 -e GPU_HOT_MODE=hub -e NODE_URLS=http://server1:1312,http://server2:1312,http://server3:1312 ghcr.io/psalias2006/gpu-hot:latest

Open http://localhost:1312

Older GPUs: Add -e NVIDIA_SMI=true if metrics don't appear.

Process monitoring: Add --init --pid=host to see process names. Note: This allows the container to access host process information.

From source:

git clone https://github.com/psalias2006/gpu-hot
cd gpu-hot
docker-compose up --build

Requirements: Docker + NVIDIA Container Toolkit


Features

  • Real-time metrics (sub-second)
  • Automatic multi-GPU detection
  • Process monitoring (PID, memory usage)
  • Historical charts (utilization, temperature, power, clocks)
  • System metrics (CPU, RAM)
  • Scale from 1 to 100+ GPUs

Metrics: Utilization, temperature, memory, power draw, fan speed, clock speeds, PCIe info, P-State, throttle status, encoder/decoder sessions


Configuration

Environment variables:

NVIDIA_VISIBLE_DEVICES=0,1     # Specific GPUs (default: all)
NVIDIA_SMI=true                # Force nvidia-smi mode for older GPUs
GPU_HOT_MODE=hub               # Set to 'hub' for multi-node aggregation (default: single node)
NODE_NAME=gpu-server-1         # Node display name (default: hostname)
NODE_URLS=http://host:1312...  # Comma-separated node URLs (required for hub mode)

Backend (core/config.py):

UPDATE_INTERVAL = 0.5  # Polling interval
PORT = 1312            # Server port

API

HTTP

GET /              # Dashboard
GET /api/gpu-data  # JSON metrics

WebSocket

socket.on('gpu_data', (data) => {
  // Updates every 0.5s (configurable)
  // Contains: data.gpus, data.processes, data.system
});

Project Structure

gpu-hot/
β”œβ”€β”€ app.py                      # Flask + WebSocket server
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ config.py               # Configuration
β”‚   β”œβ”€β”€ monitor.py              # NVML GPU monitoring
β”‚   β”œβ”€β”€ handlers.py             # WebSocket handlers
β”‚   β”œβ”€β”€ routes.py               # HTTP routes
β”‚   └── metrics/
β”‚       β”œβ”€β”€ collector.py        # Metrics collection
β”‚       └── utils.py            # Metric utilities
β”œβ”€β”€ static/
β”‚   β”œβ”€β”€ js/
β”‚   β”‚   β”œβ”€β”€ charts.js           # Chart configs
β”‚   β”‚   β”œβ”€β”€ gpu-cards.js        # UI components
β”‚   β”‚   β”œβ”€β”€ socket-handlers.js  # WebSocket + rendering
β”‚   β”‚   β”œβ”€β”€ ui.js               # View management
β”‚   β”‚   └── app.js              # Init
β”‚   └── css/styles.css
β”œβ”€β”€ templates/index.html
β”œβ”€β”€ Dockerfile
└── docker-compose.yml

Troubleshooting

No GPUs detected:

nvidia-smi  # Verify drivers work
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi  # Test Docker GPU access

Hub can't connect to nodes:

curl http://node-ip:1312/api/gpu-data  # Test connectivity
sudo ufw allow 1312/tcp                # Check firewall

Performance issues: Increase UPDATE_INTERVAL in core/config.py


Contributing

PRs welcome. Open an issue for major changes.

License

MIT - see LICENSE