Skip to content

Reyansh00/CUDA_Matrix_Multiplication_Cross_Architecture_Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA_Matrix_Multiplication_Cross_Architecture_Study

A practical study of GPU kernel optimization using matrix multiplication as a case study, with progressive CUDA kernels and profiling-driven performance comparisons between RTX 3090 and RTX 4050 GPUs.

Walkthrough , profiling, and findings

🎯 Performance Comparison

Matrix Size: 4096 × 4096

Kernel GFLOPs/s Performance Relative to cuBLAS
RTX 3090 RTX 4050 RTX 3090 RTX 4050
1: Naive 309.0 103.6 1.3% 1.8%
2: GMEM Coalescing 1986.5 671.3 8.5% 12.5%
3: SMEM Caching 2980.3 929.8 12.8% 15.8%
4: 1D Blocktiling 8474.7 2484.8 36.5% 42.1%
5: 2D Blocktiling 15971.7 5347.2 68.7% 90.6%
6: Vectorized Mem Access 18237.3 5353.9 78.4% 90.7%
0: cuBLAS 23249.6 5899.5 100.0% 100.0%

🎯 Key Results & Observations

  • Achieved 90.7% of cuBLAS on RTX 4050 vs. 78.4% on RTX 3090 (kernel 6); the constrained mobile GPU narrows the performance gap, while the desktop GPU’s higher peak capabilities amplify remaining kernel inefficiencies relative to cuBLAS.

  • RTX 3090 scales normally from 128 to 4096, whereas RTX 4050 peaks at 2048 and drops at 4096 due to architectural limits; detailed analysis is provided in ANALYSIS.md.

  • Identified memory bandwidth as primary bottleneck on mobile GPU

Quick Links:

📁 Project Structure

build/                  # Generated CMake build directory (ignored in VCS)
docs/
├── nsight_images/      # Nsight Compute screenshots used as visual evidence
├── hardware.md         # Detailed GPU hardware specifications and comparison
results/                # Per-kernel outputs and logs
src/
├── kernels/            # Progressive SGEMM kernel implementations
│   ├── 1_naive.cuh            # Baseline naive global memory kernel
│   ├── 2_global_mem.cuh       # Improved global memory access patterns
│   ├── 3_shared_mem.cuh       # Shared memory tiling
│   ├── 4_1D_blocktiling.cuh   # 1D block tiling optimization
│   ├── 5_2D_blocktiling.cuh   # 2D block tiling for higher occupancy
│   └── 6_vectorize.cuh        # Vectorized loads/stores (float4)
├── kernels.cuh         # Kernel registration and dispatch logic
├── run.cu              # Kernel launcher + benchmarking harness
├── run.cuh             # Launcher declarations
main/
├── sgemm.cu            # Program entry point (custom kernels)
├── cuBLAS_sgemm.cu     # cuBLAS SGEMM reference implementation
ANALYSIS.md             # Detailed performance + architectural analysis
README.md               # High-level project overview and key findings
LICENSE                 # License
CMakeLists.txt          # CMake build configuration
build.sh                # Convenience build script

🔑 Key Learnings

  1. Mobile GPUs reach architectural bottlenecks earlier than desktop GPUs, causing performance to saturate or decline at large problem sizes despite similar kernel optimizations.

  2. Optimization strategies that reduce global memory traffic,such as tiling, data reuse, and register blocking are disproportionately effective on bandwidth and power-constrained devices.

  3. Performance scaling behavior differs significantly across architectures, making profiling-based analysis essential to distinguish kernel inefficiencies from hardware-imposed limits.

🙏 Acknowledgments

Base implementations adapted from https://github.com/siboehm/SGEMM_CUDA. This project focuses on analysis, profiling, and cross-architecture experimentation, providing architectural insights through progressive SGEMM kernel development on the RTX 4050.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors