CUDA_Matrix_Multiplication_Cross_Architecture_Study

A practical study of GPU kernel optimization using matrix multiplication as a case study, with progressive CUDA kernels and profiling-driven performance comparisons between RTX 3090 and RTX 4050 GPUs.

📊 Full Analysis & Report →

Walkthrough , profiling, and findings

🎯 Performance Comparison

Matrix Size: 4096 × 4096

Kernel	GFLOPs/s		Performance Relative to cuBLAS
	RTX 3090	RTX 4050	RTX 3090	RTX 4050
1: Naive	309.0	103.6	1.3%	1.8%
2: GMEM Coalescing	1986.5	671.3	8.5%	12.5%
3: SMEM Caching	2980.3	929.8	12.8%	15.8%
4: 1D Blocktiling	8474.7	2484.8	36.5%	42.1%
5: 2D Blocktiling	15971.7	5347.2	68.7%	90.6%
6: Vectorized Mem Access	18237.3	5353.9	78.4%	90.7%
0: cuBLAS	23249.6	5899.5	100.0%	100.0%

🎯 Key Results & Observations

Achieved 90.7% of cuBLAS on RTX 4050 vs. 78.4% on RTX 3090 (kernel 6); the constrained mobile GPU narrows the performance gap, while the desktop GPU’s higher peak capabilities amplify remaining kernel inefficiencies relative to cuBLAS.
RTX 3090 scales normally from 128 to 4096, whereas RTX 4050 peaks at 2048 and drops at 4096 due to architectural limits; detailed analysis is provided in ANALYSIS.md.
Identified memory bandwidth as primary bottleneck on mobile GPU

Quick Links:

Hardware Specifications - GPU comparison
Nsight Compute Evidence- Profiling screenshots used as supporting evidence for the analysis

📁 Project Structure

build/                  # Generated CMake build directory (ignored in VCS)
docs/
├── nsight_images/      # Nsight Compute screenshots used as visual evidence
├── hardware.md         # Detailed GPU hardware specifications and comparison
results/                # Per-kernel outputs and logs
src/
├── kernels/            # Progressive SGEMM kernel implementations
│   ├── 1_naive.cuh            # Baseline naive global memory kernel
│   ├── 2_global_mem.cuh       # Improved global memory access patterns
│   ├── 3_shared_mem.cuh       # Shared memory tiling
│   ├── 4_1D_blocktiling.cuh   # 1D block tiling optimization
│   ├── 5_2D_blocktiling.cuh   # 2D block tiling for higher occupancy
│   └── 6_vectorize.cuh        # Vectorized loads/stores (float4)
├── kernels.cuh         # Kernel registration and dispatch logic
├── run.cu              # Kernel launcher + benchmarking harness
├── run.cuh             # Launcher declarations
main/
├── sgemm.cu            # Program entry point (custom kernels)
├── cuBLAS_sgemm.cu     # cuBLAS SGEMM reference implementation
ANALYSIS.md             # Detailed performance + architectural analysis
README.md               # High-level project overview and key findings
LICENSE                 # License
CMakeLists.txt          # CMake build configuration
build.sh                # Convenience build script

🔑 Key Learnings

Mobile GPUs reach architectural bottlenecks earlier than desktop GPUs, causing performance to saturate or decline at large problem sizes despite similar kernel optimizations.
Optimization strategies that reduce global memory traffic,such as tiling, data reuse, and register blocking are disproportionately effective on bandwidth and power-constrained devices.
Performance scaling behavior differs significantly across architectures, making profiling-based analysis essential to distinguish kernel inefficiencies from hardware-imposed limits.

🙏 Acknowledgments

Base implementations adapted from https://github.com/siboehm/SGEMM_CUDA. This project focuses on analysis, profiling, and cross-architecture experimentation, providing architectural insights through progressive SGEMM kernel development on the RTX 4050.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA_Matrix_Multiplication_Cross_Architecture_Study

📊 Full Analysis & Report →

🎯 Performance Comparison

🎯 Key Results & Observations

Quick Links:

📁 Project Structure

🔑 Key Learnings

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
main		main
results		results
src		src
.gitignore		.gitignore
ANALYSIS.md		ANALYSIS.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh

Folders and files

Latest commit

History

Repository files navigation

CUDA_Matrix_Multiplication_Cross_Architecture_Study

📊 Full Analysis & Report →

🎯 Performance Comparison

🎯 Key Results & Observations

Quick Links:

📁 Project Structure

🔑 Key Learnings

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages