A practical study of GPU kernel optimization using matrix multiplication as a case study, with progressive CUDA kernels and profiling-driven performance comparisons between RTX 3090 and RTX 4050 GPUs.
Walkthrough , profiling, and findings
Matrix Size: 4096 × 4096
| Kernel | GFLOPs/s | Performance Relative to cuBLAS | ||
|---|---|---|---|---|
| RTX 3090 | RTX 4050 | RTX 3090 | RTX 4050 | |
| 1: Naive | 309.0 | 103.6 | 1.3% | 1.8% |
| 2: GMEM Coalescing | 1986.5 | 671.3 | 8.5% | 12.5% |
| 3: SMEM Caching | 2980.3 | 929.8 | 12.8% | 15.8% |
| 4: 1D Blocktiling | 8474.7 | 2484.8 | 36.5% | 42.1% |
| 5: 2D Blocktiling | 15971.7 | 5347.2 | 68.7% | 90.6% |
| 6: Vectorized Mem Access | 18237.3 | 5353.9 | 78.4% | 90.7% |
| 0: cuBLAS | 23249.6 | 5899.5 | 100.0% | 100.0% |
-
Achieved 90.7% of cuBLAS on RTX 4050 vs. 78.4% on RTX 3090 (kernel 6); the constrained mobile GPU narrows the performance gap, while the desktop GPU’s higher peak capabilities amplify remaining kernel inefficiencies relative to cuBLAS.
-
RTX 3090 scales normally from 128 to 4096, whereas RTX 4050 peaks at 2048 and drops at 4096 due to architectural limits; detailed analysis is provided in
ANALYSIS.md. -
Identified memory bandwidth as primary bottleneck on mobile GPU
- Hardware Specifications - GPU comparison
- Nsight Compute Evidence- Profiling screenshots used as supporting evidence for the analysis
build/ # Generated CMake build directory (ignored in VCS)
docs/
├── nsight_images/ # Nsight Compute screenshots used as visual evidence
├── hardware.md # Detailed GPU hardware specifications and comparison
results/ # Per-kernel outputs and logs
src/
├── kernels/ # Progressive SGEMM kernel implementations
│ ├── 1_naive.cuh # Baseline naive global memory kernel
│ ├── 2_global_mem.cuh # Improved global memory access patterns
│ ├── 3_shared_mem.cuh # Shared memory tiling
│ ├── 4_1D_blocktiling.cuh # 1D block tiling optimization
│ ├── 5_2D_blocktiling.cuh # 2D block tiling for higher occupancy
│ └── 6_vectorize.cuh # Vectorized loads/stores (float4)
├── kernels.cuh # Kernel registration and dispatch logic
├── run.cu # Kernel launcher + benchmarking harness
├── run.cuh # Launcher declarations
main/
├── sgemm.cu # Program entry point (custom kernels)
├── cuBLAS_sgemm.cu # cuBLAS SGEMM reference implementation
ANALYSIS.md # Detailed performance + architectural analysis
README.md # High-level project overview and key findings
LICENSE # License
CMakeLists.txt # CMake build configuration
build.sh # Convenience build script
-
Mobile GPUs reach architectural bottlenecks earlier than desktop GPUs, causing performance to saturate or decline at large problem sizes despite similar kernel optimizations.
-
Optimization strategies that reduce global memory traffic,such as tiling, data reuse, and register blocking are disproportionately effective on bandwidth and power-constrained devices.
-
Performance scaling behavior differs significantly across architectures, making profiling-based analysis essential to distinguish kernel inefficiencies from hardware-imposed limits.
Base implementations adapted from https://github.com/siboehm/SGEMM_CUDA. This project focuses on analysis, profiling, and cross-architecture experimentation, providing architectural insights through progressive SGEMM kernel development on the RTX 4050.