This repository contains research projects focused on analyzing and reproducing performance benchmarks for distributed deep learning frameworks on high-performance computing (HPC) clusters, with particular emphasis on NVIDIA H100 GPU architectures.
This repository serves as a comprehensive resource for:
- Performance analysis of state-of-the-art GPU architectures
- Benchmarking frameworks for deep learning workloads
- Reproducible experiments on HPC clusters
- Scaling studies for distributed training
ai_high_performance_computing/
βββ 1_performance_analysis_cnn_marcel_2025/
β βββ Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks.pdf
βββ 2_h100_scaling_performance_jack_2023/
βββ NVIDIA_Hopper_H100_GPU_Scaling_Performance.pdf
βββ analysis_plan.md
βββ cluster_setup.py
βββ benchmark_resnet50.py
βββ requirements.txt
Focus: Large-scale performance analysis of distributed deep learning frameworks for convolutional neural networks
Key Areas:
- Framework comparison (PyTorch, TensorFlow, JAX)
- Distributed training efficiency
- Memory optimization strategies
- Communication overhead analysis
Focus: NVIDIA H100 Hopper architecture scaling performance analysis and reproduction
Key Features:
- Multi-GPU scaling benchmarks
- Memory bandwidth utilization
- NVLink interconnect performance
- Power efficiency analysis
- Architecture: Hopper (4th Gen Tensor Cores)
- Memory: 80GB HBM3 (3.35 TB/s bandwidth)
- Compute: 989 TFLOPS (FP16), 1979 TOPS (INT8)
- Interconnect: NVLink 4.0 (900 GB/s)
- Process Node: TSMC 4N (4nm)
- Transformer Engine: Optimized for large language models
- Multi-Instance GPU (MIG): Up to 7 isolated instances
- Confidential Computing: Hardware-level security
- Enhanced Memory: 3x bandwidth improvement over A100
High Performance Computing (HPC) refers to the practice of aggregating computing power to deliver higher performance than typical desktop computers and workstations. In AI/ML contexts, HPC enables:
- Parallel Processing: Distributing workloads across multiple GPUs/nodes
- Scalable Training: Training larger models with massive datasets
- Reduced Time-to-Solution: Faster experimentation and iteration
- Resource Optimization: Efficient utilization of expensive hardware
- High-performance servers with multiple GPUs
- Typically 4-8 H100 GPUs per node
- High-bandwidth memory and fast storage
- InfiniBand: Low-latency, high-bandwidth networking (200-400 Gbps)
- NVLink: Direct GPU-to-GPU communication (900 GB/s per H100)
- Ethernet: Cost-effective option for less demanding workloads
- Parallel File Systems: Lustre, GPFS for high-throughput data access
- NVMe Storage: Fast local storage for temporary data
- Object Storage: S3-compatible systems for dataset storage
- SLURM: Most common HPC job scheduler
- PBS/Torque: Alternative scheduling systems
- Kubernetes: Container orchestration for ML workloads
1 GPU: 100 samples/sec
2 GPUs: 200 samples/sec (2.0x speedup)
4 GPUs: 400 samples/sec (4.0x speedup)
8 GPUs: 800 samples/sec (8.0x speedup)
1 GPU: 100 samples/sec
2 GPUs: 190 samples/sec (1.9x speedup) - 95% efficiency
4 GPUs: 360 samples/sec (3.6x speedup) - 90% efficiency
8 GPUs: 680 samples/sec (6.8x speedup) - 85% efficiency
Scaling Bottlenecks:
- Communication overhead between GPUs
- Memory bandwidth limitations
- Load balancing inefficiencies
- Framework-specific optimizations
- NVIDIA H100 GPUs (recommended) or compatible CUDA GPUs
- CUDA Toolkit 11.8+
- Python 3.8+
- High-speed interconnect (NVLink/InfiniBand preferred)
- Clone the repository:
git clone https://github.com/mirjunaid26/ai_high_performance_computing.git
cd ai_high_performance_computing- Navigate to H100 project:
cd 2_h100_scaling_performance_jack_2023- Install dependencies:
pip install -r requirements.txt- Check cluster configuration:
python cluster_setup.py- Run ResNet-50 scaling benchmark:
python benchmark_resnet50.py --batch-size 256 --epochs 5- ResNet-50: Image classification scaling performance
- BERT: Natural language processing transformer scaling
- GPT Models: Large language model training efficiency
- Vision Transformer: Computer vision transformer scaling
- Throughput: Samples/second, tokens/second
- Memory Utilization: GPU memory usage patterns
- Scaling Efficiency: Speedup relative to single GPU
- Power Consumption: Energy efficiency analysis
- Communication Overhead: Inter-GPU data transfer costs
The cluster_setup.py script automatically detects:
- Available H100 GPUs
- System memory and CPU specifications
- CUDA compatibility
- Network topology
Key parameters for reproducible results:
- Batch Size: Optimized for GPU memory capacity
- Model Size: Scaled appropriately for available resources
- Data Pipeline: Efficient data loading and preprocessing
- Mixed Precision: FP16/BF16 for optimal performance
Based on NVIDIA specifications and research literature:
| Model | Single H100 | 2x H100 | 4x H100 | 8x H100 |
|---|---|---|---|---|
| ResNet-50 | ~2,000 img/s | ~3,800 img/s | ~7,200 img/s | ~13,600 img/s |
| BERT-Base | ~1,200 seq/s | ~2,280 seq/s | ~4,320 seq/s | ~8,160 seq/s |
| GPT-3 175B | Memory bound | Distributed | Multi-node | Multi-node |
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-benchmark) - Commit your changes (
git commit -am 'Add new benchmark') - Push to the branch (
git push origin feature/new-benchmark) - Create a Pull Request
- "Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks" (2025)
- "NVIDIA Hopper H100 GPU Scaling Performance" (2023)
This project is licensed under the MIT License - see the LICENSE file for details.
For questions about this research or collaboration opportunities:
- Repository: ai_high_performance_computing
- Issues: Use GitHub Issues for bug reports and feature requests
Note: This repository is designed for research and educational purposes. Performance results may vary based on hardware configuration, software versions, and system optimization.