Skip to content

manohar6317/AMD-Compiler-Performance-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

High-Performance Compiler Workload Analysis

Project Overview

Role Simulation: Software System Designer - Compilers (AMD) Objective: Analyze the performance of C++ workloads under different GCC optimization levels and identify compiler shortcomings in memory management.

This project implements a Matrix Multiplication benchmark to evaluate the efficiency of the GCC auto-vectorizer (-O3). Using Valgrind Cachegrind and Linux Perf, I identified severe cache thrashing in the compiler-generated code and implemented a Loop Tiling optimization that outperformed the compiler by 420%.

Executive Summary

This project simulates the workflow of a Workload Performance Analyst at AMD.

The objective was to evaluate the code generation quality of the GCC compiler (v11.4) at maximum optimization (-O3) for memory-bound workloads. Using Matrix Multiplication as a proxy for intensive scientific computing, I analyzed the generated assembly and memory behavior.

Key Finding: While GCC successfully auto-vectorized the arithmetic instructions (using AVX2), it failed to account for L1 Data Cache latency. The default memory access pattern caused severe cache thrashing, stalling the CPU.

By implementing a manual Loop Tiling (Cache Blocking) algorithm, I achieved a 4.2x speedup over the compiler's best attempt, proving that memory locality optimization is critical for maximizing IPC (Instructions Per Cycle) on x86 architectures.

Performance Results (AMD Ryzen / Intel Core)

Implementation Compiler Flags Execution Time Speedup Observation
Naive (Baseline) -O0 11.72s 1.0x No optimization. High overhead.
Naive (Compiler) -O3 3.76s 3.1x Auto-vectorization active (AVX), but poor cache locality.
Tiled (Optimized) -O3 0.88s 13.3x L1 Cache Blocking implemented. 4.2x faster than GCC -O3.

Technical Analysis

1. The Bottleneck: Cache Thrashing

Profiling the Naive -O3 implementation revealed that despite using SIMD instructions, the CPU was stalling on memory fetches.

  • Problem: The standard i-j-k loop access pattern causes non-contiguous memory access for matrix B (column-major traversal).
  • Consequence: This evicts data from the L1 Data Cache before it can be reused, leading to a high D1 Miss Rate.

2. The Solution: Loop Tiling (Blocking)

I refactored the kernel to process the matrix in small 32x32 blocks.

  • Why it works: A 32x32 block of doubles (8 bytes) takes 8KB, which fits comfortably inside the standard 32KB L1 Data Cache.
  • Impact: The CPU reuses data already in the L1 cache, significantly reducing latency and increasing IPC (Instructions Per Cycle).

3. Verification

Using valgrind --tool=cachegrind, I verified the improvement in cache behavior:

  • Naive Approach: High L1 Data Read Misses.
  • Tiled Approach: Near-zero L1 Data Read Misses.

Deep Dive Analysis

Using valgrind --tool=cachegrind, I isolated the performance hotspot in the Naive implementation:

Ir (Instructions)   Source Line
----------------------------------------------------------------
4,296,015,872 (57%)   for (int k = 0; k < N; k++) {
3,221,225,472 (42%)       sum += A[i * N + k] * B[k * N + j];

How to Run

Prerequisites:

  • Linux (Ubuntu 20.04+) or WSL2
  • g++ (GCC)
  • valgrind (for cache simulation)

Run the Analysis Script:

chmod +x scripts/benchmark.sh
./scripts/benchmark.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors