High-Performance Compiler Workload Analysis

Project Overview

Role Simulation: Software System Designer - Compilers (AMD) Objective: Analyze the performance of C++ workloads under different GCC optimization levels and identify compiler shortcomings in memory management.

This project implements a Matrix Multiplication benchmark to evaluate the efficiency of the GCC auto-vectorizer (-O3). Using Valgrind Cachegrind and Linux Perf, I identified severe cache thrashing in the compiler-generated code and implemented a Loop Tiling optimization that outperformed the compiler by 420%.

Executive Summary

This project simulates the workflow of a Workload Performance Analyst at AMD.

The objective was to evaluate the code generation quality of the GCC compiler (v11.4) at maximum optimization (-O3) for memory-bound workloads. Using Matrix Multiplication as a proxy for intensive scientific computing, I analyzed the generated assembly and memory behavior.

Key Finding: While GCC successfully auto-vectorized the arithmetic instructions (using AVX2), it failed to account for L1 Data Cache latency. The default memory access pattern caused severe cache thrashing, stalling the CPU.

By implementing a manual Loop Tiling (Cache Blocking) algorithm, I achieved a 4.2x speedup over the compiler's best attempt, proving that memory locality optimization is critical for maximizing IPC (Instructions Per Cycle) on x86 architectures.

Performance Results (AMD Ryzen / Intel Core)

Implementation	Compiler Flags	Execution Time	Speedup	Observation
Naive (Baseline)	`-O0`	11.72s	1.0x	No optimization. High overhead.
Naive (Compiler)	`-O3`	3.76s	3.1x	Auto-vectorization active (AVX), but poor cache locality.
Tiled (Optimized)	`-O3`	0.88s	13.3x	L1 Cache Blocking implemented. 4.2x faster than GCC -O3.

Technical Analysis

1. The Bottleneck: Cache Thrashing

Profiling the Naive -O3 implementation revealed that despite using SIMD instructions, the CPU was stalling on memory fetches.

Problem: The standard i-j-k loop access pattern causes non-contiguous memory access for matrix B (column-major traversal).
Consequence: This evicts data from the L1 Data Cache before it can be reused, leading to a high D1 Miss Rate.

2. The Solution: Loop Tiling (Blocking)

I refactored the kernel to process the matrix in small 32x32 blocks.

Why it works: A 32x32 block of doubles (8 bytes) takes 8KB, which fits comfortably inside the standard 32KB L1 Data Cache.
Impact: The CPU reuses data already in the L1 cache, significantly reducing latency and increasing IPC (Instructions Per Cycle).

3. Verification

Using valgrind --tool=cachegrind, I verified the improvement in cache behavior:

Naive Approach: High L1 Data Read Misses.
Tiled Approach: Near-zero L1 Data Read Misses.

Deep Dive Analysis

Using valgrind --tool=cachegrind, I isolated the performance hotspot in the Naive implementation:

Ir (Instructions)   Source Line
----------------------------------------------------------------
4,296,015,872 (57%)   for (int k = 0; k < N; k++) {
3,221,225,472 (42%)       sum += A[i * N + k] * B[k * N + j];

How to Run

Prerequisites:

Linux (Ubuntu 20.04+) or WSL2
g++ (GCC)
valgrind (for cache simulation)

Run the Analysis Script:

chmod +x scripts/benchmark.sh
./scripts/benchmark.sh

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Performance Compiler Workload Analysis

Project Overview

Executive Summary

Performance Results (AMD Ryzen / Intel Core)

Technical Analysis

1. The Bottleneck: Cache Thrashing

2. The Solution: Loop Tiling (Blocking)

3. Verification

Deep Dive Analysis

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

High-Performance Compiler Workload Analysis

Project Overview

Executive Summary

Performance Results (AMD Ryzen / Intel Core)

Technical Analysis

1. The Bottleneck: Cache Thrashing

2. The Solution: Loop Tiling (Blocking)

3. Verification

Deep Dive Analysis

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages