Skip to content

Msundara19/fpga_cnn_accelerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

11 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Hardware Acceleration of VGG Model on CIFAR-10 using High-Level Synthesis

License: MIT Platform HLS

๐Ÿ“‹ Project Overview

This project demonstrates the complete end-to-end pipeline for FPGA-based hardware acceleration of deep learning models, specifically a ReducedVGG architecture for CIFAR-10 classification. The implementation covers the full workflow from PyTorch training through High-Level Synthesis (HLS) to actual hardware deployment on the PYNQ-Z2 board.

๐ŸŽฏ Key Achievements

  • โœ… Complete FPGA Deployment Pipeline: PyTorch โ†’ HLS โ†’ Vivado โ†’ PYNQ
  • โœ… Model Design: Custom ReducedVGG with 1.44M parameters achieving 85.69% accuracy
  • โœ… Quantization: INT16 weight-only quantization with minimal accuracy loss (0.1%)
  • โœ… Timing Closure: Achieved at 66.7 MHz on Zynq-7020 FPGA
  • โœ… Resource Efficiency: Fits within PYNQ-Z2 constraints (70% BRAM, 14% DSP, 62% LUT)
  • โœ… Hardware Validation: Successfully deployed and tested on PYNQ-Z2 board

๐Ÿ“ธ Proof of Work & Validation

Hardware Deployment Evidence

Our project includes comprehensive validation artifacts demonstrating successful end-to-end implementation:

1. HLS Synthesis Reports

  • C Simulation: 0 errors, MSE = 0 on 10 test images
  • Timing closure achieved: WNS = 0.183 ns (positive slack โœ“)
  • Resource estimates validated against final implementation
  • See: Interactive Resource Utilization

2. Vivado Implementation

  • Post-implementation timing: No violations
  • Power analysis: 1.451 W total on-chip power
  • Bitstream generation successful: design_1_wrapper.bit (45 MB)
  • See: Design Workflow Visualization

3. PYNQ Hardware Execution

[PYNQ Deployment Log - December 2025]
โœ“ Bitstream loaded successfully
โœ“ 52 parameter arrays loaded (1,441,066 values)
โœ“ 54 FPGA memory buffers allocated (~5.5 MB)
โœ“ All 96 AXI addresses configured

Hardware Performance:
- Parameter Sync Time: 6.78 ms
- Computation Time: 459.75 ms
- Total Latency: 466.53 ms
- Predicted Class: airplane (correct)
- Power Consumption: 1.451 W

4. Interactive Visualizations

All design decisions, performance metrics, and architecture details are documented in interactive visualizations:

Validation Methodology

Our validation follows a rigorous multi-stage approach:

  1. Functional Verification (C Simulation)

    • Bit-accurate C++ model tested against PyTorch golden reference
    • 10 CIFAR-10 test images: 100% match
    • MSE = 0 between C simulation and PyTorch
  2. RTL Verification (HLS Synthesis)

    • Timing analysis: All paths meet 15 ns constraint
    • Resource utilization: Within Zynq-7020 limits
    • Latency bounds: 7.98 ms (best) to 256 sec (worst with stalls)
  3. Hardware Validation (PYNQ Deployment)

    • Actual measured latency: 466.53 ms
    • Accuracy on hardware: 85.59% (within 1.35% of GPU)
    • Correct classification on test images
    • Power consumption verified: 1.451 W
  4. Performance Benchmarking

๐Ÿ“Š Performance Summary

GPU vs FPGA Comparison

Metric GPU (Tesla T4) FPGA (Zynq-7020) Winner
Inference Latency 1.296 ms 466.53 ms GPU (360ร—)
Throughput 771.5 img/s 2.14 img/s GPU (360ร—)
Test Accuracy 86.94% 85.59% GPU (+1.35%)
Power Consumption 70 W (TDP) 1.451 W FPGA (48ร—)
Energy/Inference 0.091 J 0.677 J GPU (7.4ร—)
Efficiency Score 67.07 acc/ms 0.183 acc/ms GPU (366ร—)

Model Architecture

ReducedVGG Specifications:

  • Parameters: 1,439,146
  • Channel Progression: [32, 64, 128, 256]
  • FLOPs per Inference: 106.72 MFLOPs
  • Input: 32ร—32ร—3 (CIFAR-10)
  • Output: 10 classes

๐Ÿ—๏ธ Architecture

Model Structure

Input (32ร—32ร—3)
โ”œโ”€โ”€ Block 0: [Conv3ร—3(32) + BN + ReLU] ร— 2 โ†’ MaxPool
โ”œโ”€โ”€ Block 1: [Conv3ร—3(64) + BN + ReLU] ร— 2 โ†’ MaxPool
โ”œโ”€โ”€ Block 2: [Conv3ร—3(128) + BN + ReLU] ร— 2 โ†’ MaxPool
โ”œโ”€โ”€ Block 3: [Conv3ร—3(256) + BN + ReLU] ร— 2 โ†’ MaxPool
โ””โ”€โ”€ Classifier: Flatten โ†’ FC(1024โ†’256) โ†’ Dropout โ†’ FC(256โ†’10)

FPGA System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         ZYNQ-7020 Processing System         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ ARM Cortex-A9โ”‚โ—„โ”€โ”€โ”€โ”€โ–บโ”‚  DDR Controller โ”‚ โ”‚
โ”‚  โ”‚  (667 MHz)   โ”‚      โ”‚   (512 MB)      โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚         โ”‚ AXI                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚    AXI Interconnect (Control Path)    โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    VGG Accelerator IP (HLS Generated)      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  96 AXI-Lite Registers (Parameters)  โ”‚  โ”‚
โ”‚  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ”‚
โ”‚  โ”‚     Tiled Convolution Engine (8ร—8)   โ”‚  โ”‚
โ”‚  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ”‚
โ”‚  โ”‚  BatchNorm + ReLU + MaxPool Units    โ”‚  โ”‚
โ”‚  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ”‚
โ”‚  โ”‚       Fully Connected Layers         โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚           โ–ฒ                                 โ”‚
โ”‚           โ”‚ AXI Master (DDR Access)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚
     [DDR Memory]

๐Ÿ”ง Implementation Details

Data Types (HLS Fixed-Point)

typedef ap_fixed<16,12> fm_t;   // Feature maps (Q12.4)
typedef ap_fixed<16,12> wt_t;   // Weights/Bias (Q12.4)
typedef ap_fixed<32,24> acc_t;  // Accumulator (Q24.8)

Key Optimizations

  1. Loop Pipelining: Parallel computation within convolution kernels
  2. Array Partitioning: On-chip buffering for tiled convolution
  3. Dataflow: Pipeline parallelism between layers
  4. Fixed-Point Quantization: INT16 reduces memory by 2ร— with <0.1% accuracy loss

Resource Utilization (Post-Implementation)

Resource Used Available Utilization
LUT 9,044 53,200 17%
LUTRAM 532 17,400 1%
Flip-Flops 12,764 106,400 12%
BRAM 3.5 280 1%
DSP Blocks 5 220 2%

๐Ÿ“ Project Structure

fpga_cnn_accelerator/
โ”œโ”€โ”€ README.md                           # This file
โ”œโ”€โ”€ LICENSE                             # MIT License
โ”œโ”€โ”€ CONTRIBUTING.md                     # Contribution guidelines
โ”œโ”€โ”€ .gitignore                          # Git ignore patterns
โ”‚
โ”œโ”€โ”€ report/
โ”‚   โ””โ”€โ”€ ECE588_Final_Project_Report.pdf # Comprehensive 39-page report
โ”‚
โ”œโ”€โ”€ training/
โ”‚   โ”œโ”€โ”€ copy.ipynb                      # PyTorch training notebook
โ”‚   โ”œโ”€โ”€ ece588_finalGPU.ipynb          # GPU performance benchmarking
โ”‚   โ””โ”€โ”€ models/
โ”‚       โ””โ”€โ”€ reduced_vgg_best.pth        # Trained model checkpoint
โ”‚
โ”œโ”€โ”€ hls/
โ”‚   โ”œโ”€โ”€ tiled_conv.hpp                  # Header: data types & constants
โ”‚   โ”œโ”€โ”€ tiled_conv.cpp                  # Top-level HLS inference function
โ”‚   โ”œโ”€โ”€ utils.cpp                       # Layer implementations
โ”‚   โ”œโ”€โ”€ utils.hpp                       # Utility function headers
โ”‚   โ”œโ”€โ”€ tb_conv.cpp                     # C++ testbench
โ”‚   โ”œโ”€โ”€ Makefile                        # Build automation
โ”‚   โ”œโ”€โ”€ vitis_hls.tcl                   # HLS synthesis script
โ”‚   โ””โ”€โ”€ run_csim.tcl                    # C simulation script
โ”‚
โ”œโ”€โ”€ weights/
โ”‚   โ”œโ”€โ”€ params_int32/                   # INT32 quantized weights (60 files)
โ”‚   โ”œโ”€โ”€ params_int16/                   # INT16 converted weights (48 files)
โ”‚   โ”œโ”€โ”€ convert_weights.py              # INT32โ†’INT16 converter
โ”‚   โ””โ”€โ”€ create_test_files.py            # Test data generator
โ”‚
โ”œโ”€โ”€ vivado/
โ”‚   โ”œโ”€โ”€ design_1.bd                     # Block design
โ”‚   โ””โ”€โ”€ constraints/                    # Timing constraints
โ”‚
โ”œโ”€โ”€ pynq/
โ”‚   โ”œโ”€โ”€ design_1_wrapper.bit            # FPGA bitstream (45 MB)
โ”‚   โ”œโ”€โ”€ design_1.hwh                    # Hardware handoff
โ”‚   โ””โ”€โ”€ deploy_pynq_runtime.py          # Deployment script
โ”‚
โ”œโ”€โ”€ visualizations/                     # Interactive HTML visualizations
โ”‚   โ”œโ”€โ”€ viz_decision_framework.html     # Design decision tree
โ”‚   โ”œโ”€โ”€ viz_design_workflow.html        # Implementation pipeline
โ”‚   โ”œโ”€โ”€ viz_memory_architecture.html    # Memory organization
โ”‚   โ”œโ”€โ”€ viz_performance_comparison.html # GPU vs FPGA metrics
โ”‚   โ””โ”€โ”€ viz_resource_utilization.html   # FPGA resource breakdown
โ”‚
โ””โ”€โ”€ docs/
    โ”œโ”€โ”€ setup.md                        # Environment setup guide
    โ””โ”€โ”€ usage.md                        # Usage instructions

๐Ÿ’ก Tip: Explore the interactive visualizations to understand the complete design flow and performance analysis.

๐Ÿš€ Quick Start

Prerequisites

Software:

  • Python 3.8+
  • PyTorch 2.x
  • Vitis HLS 2022.2
  • Vivado 2022.2
  • PYNQ v3.0.1

Hardware:

  • PYNQ-Z2 board (Zynq-7020 FPGA)
  • MicroSD card (16 GB+)
  • Host PC with Linux (Ubuntu 20.04+)

1. Training the Model

# Open the Jupyter notebook on Google Colab or locally
jupyter notebook training/copy.ipynb

# The notebook will:
# - Load CIFAR-10 dataset
# - Train ReducedVGG for 20 epochs
# - Export quantized weights to params_int32/

2. Weight Conversion

cd weights
python convert_weights.py \
    --input params_int32/ \
    --output params_int16/ \
    --format "ap_fixed<16,12>"

3. HLS Synthesis

cd hls

# C Simulation (verify functionality)
make csim

# C Synthesis (generate RTL)
make csynth

# Export IP
make ip

4. Vivado Integration

# Open Vivado and source the block design
vivado -mode batch -source scripts/create_block_design.tcl

# Generate bitstream
vivado -mode batch -source scripts/generate_bitstream.tcl

5. PYNQ Deployment

# Copy files to PYNQ board
scp pynq/design_1_wrapper.bit xilinx@192.168.2.99:~/
scp pynq/design_1.hwh xilinx@192.168.2.99:~/
scp -r weights/params_int16/ xilinx@192.168.2.99:~/

# SSH into PYNQ and run inference
ssh xilinx@192.168.2.99
python3 deploy_pynq_runtime.py

๐Ÿ“Š Interactive Visualizations

Explore detailed interactive visualizations of our implementation and results:

๐ŸŽจ Design & Architecture

โšก Performance Analysis

โœ… Validation & Proofs of Work

These visualizations provide evidence of:

  • โœ“ Complete hardware-software co-design methodology
  • โœ“ Systematic performance measurement and analysis
  • โœ“ Thorough resource utilization optimization
  • โœ“ End-to-end validation from training to deployment

Note: These interactive HTML visualizations are best viewed in a modern web browser with JavaScript enabled.

๐Ÿ“ˆ Results & Analysis

Training Curves

The model converges smoothly over 20 epochs:

  • Final Training Accuracy: ~92%
  • Final Validation Accuracy: ~87%
  • Test Accuracy: 86.94%

Quantization Impact

Configuration Accuracy Latency Score
FP32 (Baseline) 86.94% 1.296 ms 67.07
INT32 Weight-only 86.94% 1.291 ms 67.37
INT16 Weight-only 86.94% 1.298 ms 66.99
INT16 (FPGA) 85.59% 466.53 ms 0.183

Performance Breakdown (FPGA)

Phase Time (ms) Percentage
Parameter Sync (DDR) 6.78 1.5%
Computation (FPGA) 459.75 98.5%
Total Latency 466.53 100%

๐Ÿ” Key Findings

What Worked Well โœ…

  1. Complete Pipeline Success: End-to-end flow from PyTorch to hardware deployment
  2. Timing Closure: Achieved at 66.7 MHz with positive slack (0.183 ns)
  3. Resource Fit: Optimized design fits within Zynq-7020 constraints
  4. Quantization Effectiveness: INT16 preserves accuracy with 2ร— memory reduction
  5. Functional Correctness: C simulation passed with 0 errors on 10 test images

Challenges Identified โš ๏ธ

  1. Memory-Bound Performance: 58ร— gap between theoretical (7.98 ms) and measured (466.53 ms) latency
  2. DDR Bandwidth Bottleneck: Sequential parameter loading dominates execution time
  3. Complex Address Management: 96 AXI register ports require careful orchestration
  4. Limited Parallelism: Memory access patterns prevent full compute utilization

Performance Gap Analysis

Why is the FPGA slower?

  1. Runtime Parameter Loading: 2.8 MB loaded from DDR for each inference
  2. Sequential Memory Access: Single AXI master limits bandwidth
  3. Memory-Bound vs Compute-Bound: DDR access (98.5%) dominates compute (1.5%)
  4. Small Model Size: GPU easily fits in cache, FPGA requires DDR

Theoretical vs Measured Latency:

  • HLS Synthesis Estimate: 7.98 ms (best case)
  • Measured Hardware: 466.53 ms
  • Gap: 58ร— โ†’ Memory architecture bottleneck

๐ŸŽ“ Lessons Learned

Technical Insights

  1. Memory Bandwidth is Critical: Compute is cheap; data movement is expensive
  2. Embedded Weights Would Transform Performance: Moving 2.8 MB to BRAM could achieve 20-50ร— speedup
  3. Design for the Architecture: Runtime flexibility (96 AXI ports) added complexity without benefit
  4. Quantization Works: INT16 preserved accuracy while enabling efficient DSP mapping

Educational Value

Despite the performance gap vs GPU, this project successfully:

  • Demonstrated complete FPGA design methodology
  • Identified critical bottlenecks through systematic analysis
  • Validated theoretical understanding through hardware measurement
  • Provided realistic expectations for FPGA acceleration

๐Ÿ”ฎ Future Improvements

Recommended Optimizations

Optimization Expected Speedup Difficulty
Embed parameters in BRAM 20-50ร— High
Optimize AXI burst size 2-3ร— Medium
Increase frequency to 100 MHz 1.5ร— Low
Implement dataflow parallelism 2-4ร— High
Mixed precision (INT8/INT16) 1.5-2ร— Medium
Combined Potential 60-600ร— Very High

When FPGAs Could Be Competitive

  1. Edge Deployment: 48ร— lower power (1.45 W vs 70 W) matters
  2. Embedded Weights: Fixed models with BRAM-resident parameters
  3. Batch Processing: Amortize parameter loading across many images
  4. Custom Data Paths: Non-standard operations not well-suited to GPUs

๐Ÿ“š Documentation

๐Ÿ“– References

  1. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," ICLR 2015.
  2. A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," Technical Report, University of Toronto, 2009.
  3. Xilinx, "Vitis High-Level Synthesis User Guide (UG1399)," 2023.
  4. PYNQ Project, "Python productivity for Zynq," http://www.pynq.io/
  5. C. Zhang et al., "Optimizing FPGA-based accelerator design for deep convolutional neural networks," FPGA 2015.

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ‘ฅ collaborators


Project Status: โœ… Complete (December 2025)
Hardware Validated: โœ… Yes (PYNQ-Z2)
Report Available: โœ… Yes (Download PDF)
Interactive Demos: โœ… Live Visualizations

About

Hardware acceleration of CNN inference on FPGA vs GPU. Achieved 85.59% accuracy on Zynq-7020 with 1.45W power vs GPU's 70W. Full implementation with interactive performance visualizations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors