Role Simulation: Software System Designer - Compilers (AMD) Objective: Analyze the performance of C++ workloads under different GCC optimization levels and identify compiler shortcomings in memory management.
This project implements a Matrix Multiplication benchmark to evaluate the efficiency of the GCC auto-vectorizer (-O3). Using Valgrind Cachegrind and Linux Perf, I identified severe cache thrashing in the compiler-generated code and implemented a Loop Tiling optimization that outperformed the compiler by 420%.
This project simulates the workflow of a Workload Performance Analyst at AMD.
The objective was to evaluate the code generation quality of the GCC compiler (v11.4) at maximum optimization (-O3) for memory-bound workloads. Using Matrix Multiplication as a proxy for intensive scientific computing, I analyzed the generated assembly and memory behavior.
Key Finding: While GCC successfully auto-vectorized the arithmetic instructions (using AVX2), it failed to account for L1 Data Cache latency. The default memory access pattern caused severe cache thrashing, stalling the CPU.
By implementing a manual Loop Tiling (Cache Blocking) algorithm, I achieved a 4.2x speedup over the compiler's best attempt, proving that memory locality optimization is critical for maximizing IPC (Instructions Per Cycle) on x86 architectures.
| Implementation | Compiler Flags | Execution Time | Speedup | Observation |
|---|---|---|---|---|
| Naive (Baseline) | -O0 |
11.72s | 1.0x | No optimization. High overhead. |
| Naive (Compiler) | -O3 |
3.76s | 3.1x | Auto-vectorization active (AVX), but poor cache locality. |
| Tiled (Optimized) | -O3 |
0.88s | 13.3x | L1 Cache Blocking implemented. 4.2x faster than GCC -O3. |
Profiling the Naive -O3 implementation revealed that despite using SIMD instructions, the CPU was stalling on memory fetches.
- Problem: The standard
i-j-kloop access pattern causes non-contiguous memory access for matrix B (column-major traversal). - Consequence: This evicts data from the L1 Data Cache before it can be reused, leading to a high D1 Miss Rate.
I refactored the kernel to process the matrix in small 32x32 blocks.
- Why it works: A 32x32 block of
doubles(8 bytes) takes 8KB, which fits comfortably inside the standard 32KB L1 Data Cache. - Impact: The CPU reuses data already in the L1 cache, significantly reducing latency and increasing IPC (Instructions Per Cycle).
Using valgrind --tool=cachegrind, I verified the improvement in cache behavior:
- Naive Approach: High L1 Data Read Misses.
- Tiled Approach: Near-zero L1 Data Read Misses.
Using valgrind --tool=cachegrind, I isolated the performance hotspot in the Naive implementation:
Ir (Instructions) Source Line
----------------------------------------------------------------
4,296,015,872 (57%) for (int k = 0; k < N; k++) {
3,221,225,472 (42%) sum += A[i * N + k] * B[k * N + j];
Prerequisites:
- Linux (Ubuntu 20.04+) or WSL2
g++(GCC)valgrind(for cache simulation)
Run the Analysis Script:
chmod +x scripts/benchmark.sh
./scripts/benchmark.sh