Skip to content
This repository was archived by the owner on Jan 30, 2026. It is now read-only.

mikhailbahdashych/3d-convolution-grad-cam-exercises

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exercise Recognition with 3D Convolutional Neural Networks and GradCAM

A comprehensive implementation of video-based exercise recognition using 3D Convolutional Neural Networks with GradCAM visualization for model interpretability.


Table of Contents

  1. Overview
  2. Technical Architecture
  3. Dataset
  4. Installation
  5. Quick Start Guide
  6. Usage
  7. Project Structure
  8. Configuration
  9. Implementation Details
  10. Expected Results

Overview

This project implements a complete pipeline for recognizing exercises from video data. The system uses a 3D Convolutional Neural Network (C3D architecture) to learn spatio-temporal features from video clips and provides interpretability through GradCAM heatmap visualizations.

Key Features

  • 3D CNN (C3D) architecture for video classification
  • Focal Loss for handling severe class imbalance
  • Class 1 downsampling option for balanced training
  • Segmentation mask support for background removal
  • Skeleton-guided attention using YOLO pose detection to focus on key body joints
  • GradCAM visualization for model interpretability with dynamic probability charts
  • Support for multiple devices: CUDA (NVIDIA GPU), MPS (Apple Silicon), and CPU
  • Multi-GPU training support with DataParallel
  • Comprehensive evaluation metrics and confusion matrix generation
  • Modular and extensible codebase

Use Case

The system is designed for automatic exercise recognition from video footage, capable of classifying 17 different exercise types from short video clips.


Technical Architecture

Model: C3D (3D Convolutional Network)

Architecture Overview:

  • Input: RGB video clip of shape (3, 16, 112, 112)
    • 3 color channels
    • 16 temporal frames
    • 112x112 spatial resolution
  • 5 convolutional blocks with 3x3x3 kernels
  • Batch normalization and ReLU activation
  • 3D max pooling for temporal and spatial downsampling
  • Global average pooling
  • Fully connected classifier
  • Output: 17-class predictions

Model Statistics:

  • Total parameters: 27,797,137 (approximately 27.8 million)
  • Trainable parameters: 27,797,137

Network Details:

Block 1: Conv3D(3->64)   + BatchNorm + ReLU + MaxPool(1,2,2)
Block 2: Conv3D(64->128) + BatchNorm + ReLU + MaxPool(2,2,2)
Block 3: Conv3D(128->256)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Block 4: Conv3D(256->512)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Block 5: Conv3D(512->512)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Global Average Pooling -> Dropout(0.5) -> Linear(512->256) -> ReLU -> Dropout(0.3) -> Linear(256->17)

Dataset

Data Statistics

  • Total subjects: 60
  • Train/test split: 30 subjects each
  • Exercise classes: 17 (labeled as classes 1-16)
  • Video format: MP4, 30 FPS
  • Original resolution: approximately 400x550 pixels
  • Total training clips: 28,413 (with temporal sliding window)
  • Total test clips: approximately 23,000

Data Structure

dataset/
├── dataset/
│   ├── anon/              # Anonymized video files (.mp4)
│   ├── mask/              # Segmentation masks (.png)
│   └── skeleton/          # YOLO pose detection outputs
│       ├── yolo/          # Pose videos
│       └── yolo_pose_csv/ # Pose keypoints in CSV format
├── label/                 # Frame-level labels (.csv)
└── split.csv             # Train/test split specification

Label Format

Labels are stored as CSV files with format:

frame_number, column1, column2
  • column2 contains the exercise class: -1 for background, 0-16 for exercises
  • One label file per subject

Class Distribution

The dataset exhibits severe class imbalance:

  • Class 1: approximately 64,000 training frames (dominant class)
  • Classes 2-16: approximately 9,000-11,000 frames each

Installation

Prerequisites

  • Python 3.12 or higher
  • Virtual environment tool (venv)
  • uv package manager

Setup Instructions

  1. Clone the repository and navigate to the project directory

  2. Activate the virtual environment:

source .venv/bin/activate
  1. Install dependencies using uv:
uv add torch torchvision
uv add opencv-python pandas scikit-learn
uv add matplotlib seaborn tensorboard
uv add tqdm pyyaml grad-cam

Note: The decord package is not available for macOS ARM architecture and is optional. OpenCV is used for video loading instead.


Quick Start Guide

This section provides the recommended sequence for implementing and running the complete pipeline.

Step 1: Verify Installation

Test that all components are properly installed:

python scripts/quick_test.py

This will verify:

  • Dataset loading functionality
  • Model initialization
  • Forward and backward passes
  • Device selection (CPU/MPS/CUDA)

Step 2: Analyze the Dataset

Explore the dataset statistics and distribution:

python scripts/data_analysis.py

This generates:

  • Train/test split statistics
  • Class distribution analysis
  • Video property statistics
  • Visualization plots saved to outputs/analysis/

Step 3: Run Initial Training Test

Perform a quick training test (2 epochs) to verify the pipeline:

python scripts/train.py --epochs 2 --batch-size 4

Expected duration: 5-10 minutes on MPS/GPU

This validates:

  • Data loading pipeline
  • Model training loop
  • Checkpoint saving
  • Metrics tracking

Step 4: Full Training

Launch full training run:

# Default configuration (100 epochs)
python scripts/train.py

# Custom configuration
python scripts/train.py --epochs 50 --batch-size 8

# With class balancing (recommended for better per-class performance)
python scripts/train.py --epochs 50 --batch-size 8 --downsample-class1

# With background removal using segmentation masks
python scripts/train.py --epochs 50 --batch-size 8 --use-masks

# Combined: balanced training with background removal
python scripts/train.py --epochs 50 --batch-size 8 --downsample-class1 --use-masks

# With skeleton-guided attention (focuses on body joints)
python scripts/train.py --epochs 50 --batch-size 8 --use-skeleton-attention

# Full preprocessing: downsampling + masks + skeleton attention
python scripts/train.py --epochs 30 --batch-size 8 --downsample-class1 --use-masks --use-skeleton-attention

# Specify device
python scripts/train.py --epochs 50 --device cuda
python scripts/train.py --epochs 50 --device mps

Expected duration: 2-4 hours for 50 epochs on MPS/GPU

Training artifacts saved to:

  • Checkpoints: outputs/checkpoints/
  • Logs: outputs/logs/
  • TensorBoard logs: outputs/logs/

Step 5: Monitor Training Progress

In a separate terminal, launch TensorBoard:

tensorboard --logdir outputs/logs

Access the dashboard at: http://localhost:6006

Metrics tracked:

  • Training and validation loss
  • Training and validation accuracy
  • Per-class accuracy
  • Learning rate schedule

Step 5b: Plot Training Metrics

The training process automatically saves metrics to CSV files after each epoch in outputs/logs/metrics/. Generate publication-ready plots from these metrics:

# Generate all plots with default settings
python scripts/plot_training.py

# Specify custom directories
python scripts/plot_training.py --metrics-dir outputs/logs/metrics --save-dir outputs/plots

# Plot only top 5 classes by accuracy
python scripts/plot_training.py --top-classes 5

Generated plots:

  • loss_curves.png - Training and validation loss over time
  • accuracy_curves.png - Training and validation accuracy over time
  • learning_rate.png - Learning rate schedule
  • combined_metrics.png - All metrics in a single figure including generalization gap
  • per_class_accuracy.png - Per-class accuracy evolution (all classes)
  • per_class_accuracy_top10.png - Top 10 performing classes
  • per_class_heatmap.png - Heatmap showing per-class accuracy over epochs
  • final_class_comparison.png - Bar chart of final accuracy per class

CSV files saved to outputs/logs/metrics/:

  • training_metrics.csv - Epoch-level loss, accuracy, and learning rate
  • per_class_accuracy.csv - Per-class accuracy for every epoch

Step 6: Evaluate Model

After training completes, evaluate on test set:

python scripts/evaluate.py --checkpoint outputs/checkpoints/best_model_acc.pth

This generates:

  • Overall accuracy metrics
  • Per-class precision, recall, F1-score
  • Confusion matrix (counts and normalized)
  • Predictions CSV
  • Error analysis

Results saved to: outputs/results/

Step 7: Generate GradCAM Visualizations

Visualize model attention with GradCAM:

# Visualize 20 random test samples
python scripts/visualize_gradcam.py --checkpoint outputs/checkpoints/best_model_acc.pth --num-samples 20

# Visualize only misclassified samples
python scripts/visualize_gradcam.py --checkpoint outputs/checkpoints/best_model_acc.pth --num-samples 20 --misclassified-only

# Custom visualization settings
python scripts/visualize_gradcam.py \
    --checkpoint outputs/checkpoints/best_model_acc.pth \
    --num-samples 50 \
    --layer block5 \
    --alpha 0.6 \
    --fps 10

Outputs for each sample:

  • original.mp4: Original video clip
  • heatmap.mp4: GradCAM heatmap
  • overlay.mp4: Heatmap overlayed on video
  • overlay_with_probs.mp4: Overlay with class probability bar chart (scaled up for readable text, default 336x336 + chart)
  • side_by_side.mp4: Original and overlay side-by-side
  • metadata.txt: Sample information

Visualizations saved to: outputs/visualizations/


Usage

Training Options

python scripts/train.py [OPTIONS]

Options:
  --config PATH         Path to configuration YAML file
  --seed INT           Random seed for reproducibility (default: 42)
  --device DEVICE      Device to use: cuda/mps/cpu (auto-detected if not specified)
  --epochs INT         Number of training epochs (default: 100)
  --batch-size INT     Batch size for training (default: 8)
  --downsample-class1  Downsample Class 1 to match median count of other classes (fixes class imbalance)
  --use-masks          Apply segmentation masks to remove background from frames during training
  --use-skeleton-attention  Apply skeleton-guided attention masks to focus on body joints
  --skeleton-sigma FLOAT    Gaussian sigma for skeleton attention blobs (default: 10.0)

Evaluation Options

python scripts/evaluate.py [OPTIONS]

Required:
  --checkpoint PATH    Path to model checkpoint (.pth file)

Optional:
  --config PATH        Path to configuration file
  --split SPLIT        Dataset split to evaluate: train/test (default: test)
  --batch-size INT     Batch size for evaluation (default: 16)
  --device DEVICE      Device to use: cuda/mps/cpu
  --save-dir PATH      Directory to save results (default: outputs/results)
  --use-masks          Apply segmentation masks to remove background (use if model was trained with masks)

GradCAM Visualization Options

python scripts/visualize_gradcam.py [OPTIONS]

Required:
  --checkpoint PATH    Path to model checkpoint (.pth file)

Optional:
  --config PATH           Path to configuration file
  --split SPLIT           Dataset split: train/test (default: test)
  --num-samples INT       Number of samples to visualize (default: 20)
  --device DEVICE         Device to use: cuda/mps/cpu
  --save-dir PATH         Directory to save visualizations (default: outputs/visualizations)
  --layer LAYER           Target layer for GradCAM: block3/block4/block5 (default: block4)
  --alpha FLOAT           Overlay transparency: 0.0-1.0 (default: 0.5)
  --fps INT               Output video frame rate (default: 10)
  --misclassified-only    Only visualize misclassified samples
  --use-masks             Apply segmentation masks to remove background (use if model was trained with masks)
  --use-skeleton-attention  Apply skeleton-guided attention (use if model was trained with skeleton attention)
  --skeleton-sigma FLOAT  Gaussian sigma for skeleton attention blobs (default: 10.0)
  --scale-factor INT      Scale factor for overlay_with_probs video (default: 3, scales 112x112 to 336x336)
  --long-sample           Generate longer video samples with dynamic probability updates
  --sample-duration INT   Duration of long samples in frames (default: 300)
  --start-frame INT       Starting frame for long sample mode (default: 0)
  --subject-id STR        Specific subject ID to visualize in long sample mode

Project Structure

PROJECT/
├── src/                              # Source code
│   ├── config/
│   │   ├── __init__.py
│   │   └── config.py                 # Configuration management
│   ├── data/
│   │   ├── __init__.py
│   │   ├── dataset.py                # PyTorch Dataset for video clips
│   │   ├── transforms.py             # Video augmentation transforms
│   │   └── utils.py                  # Video loading utilities
│   ├── models/
│   │   ├── __init__.py
│   │   └── cnn3d.py                  # C3D architecture implementation
│   ├── training/
│   │   ├── __init__.py
│   │   ├── trainer.py                # Training loop and logic
│   │   └── losses.py                 # Focal Loss implementation
│   ├── evaluation/
│   │   ├── __init__.py
│   │   ├── metrics.py                # Evaluation metrics
│   │   └── evaluator.py              # Evaluation pipeline
│   ├── visualization/
│   │   ├── __init__.py
│   │   └── gradcam.py                # GradCAM wrapper for 3D CNN
│   └── utils/
│       ├── __init__.py
│       ├── device.py                 # Device selection (CPU/MPS/CUDA)
│       ├── logging.py                # Logging utilities
│       └── checkpointing.py          # Model save/load utilities
├── scripts/                          # Executable scripts
│   ├── train.py                      # Main training script
│   ├── evaluate.py                   # Model evaluation script
│   ├── visualize_gradcam.py          # GradCAM visualization script
│   ├── data_analysis.py              # Dataset analysis script
│   ├── test_dataset.py               # Dataset loading test
│   ├── test_training_init.py         # Training initialization test
│   └── quick_test.py                 # Quick system test
├── outputs/                          # Generated outputs
│   ├── checkpoints/                  # Model checkpoints (.pth files)
│   ├── logs/                         # Training logs and TensorBoard
│   ├── results/                      # Evaluation results
│   ├── visualizations/               # GradCAM visualizations
│   └── analysis/                     # Dataset analysis plots
├── dataset/                          # Dataset directory (not in repo)
├── .venv/                            # Virtual environment
├── .gitignore                        # Git ignore file
├── pyproject.toml                    # Project dependencies
└── README.md                         # This file

Configuration

Default Configuration

The default configuration is defined in src/config/config.py. Key parameters include:

Data Processing:

  • Clip length: 16 frames
  • Temporal stride: 8 frames (for sliding window during training)
  • Spatial size: 112x112 pixels
  • FPS: 30 (original video framerate)

Model:

  • Architecture: C3D
  • Number of classes: 17
  • Dropout: 0.5
  • Pretrained weights: False

Training:

  • Batch size: 8
  • Number of epochs: 100
  • Number of workers: 4
  • Pin memory: True

Optimizer:

  • Type: AdamW
  • Learning rate: 0.0001
  • Weight decay: 0.00001

Learning Rate Scheduler:

  • Type: ReduceLROnPlateau
  • Mode: min (reduce on validation loss)
  • Factor: 0.5
  • Patience: 5 epochs
  • Minimum LR: 0.0000001

Loss Function:

  • Type: Focal Loss
  • Gamma: 2.0
  • Class weights: Computed automatically from training data

Data Augmentation (Training Only):

  • Horizontal flip: 50% probability
  • Rotation: +/- 10 degrees
  • Color jitter: brightness, contrast, saturation, hue
  • Normalization: ImageNet mean and std

Early Stopping:

  • Patience: 10 epochs
  • Minimum delta: 0.001

Checkpointing:

  • Save frequency: Every 10 epochs
  • Keep best 3 checkpoints
  • Save best model by validation loss
  • Save best model by validation accuracy

Device:

  • Preference order: CUDA > MPS > CPU
  • Automatic fallback to CPU if GPU unavailable

Custom Configuration

You can create a custom YAML configuration file and use it:

python scripts/train.py --config path/to/config.yaml

Implementation Details

Data Pipeline

Video Processing:

  1. Videos are loaded using OpenCV
  2. Clips of 16 consecutive frames are extracted using a sliding window
  3. Frames are resized to 112x112 pixels
  4. Clips are normalized using ImageNet statistics
  5. Optional: Segmentation masks can be applied to remove background (--use-masks)

Temporal Sampling:

  • Training: Overlapping clips with stride=8 frames
  • Validation/Testing: Non-overlapping clips with stride=16 frames

Class Imbalance Handling:

  1. Class 1 Downsampling: Optional flag to downsample the dominant class to match other classes (--downsample-class1)
  2. Weighted Random Sampling: Minority classes are oversampled during training
  3. Focal Loss: Focuses learning on hard-to-classify examples
  4. Class Weights: Loss is weighted by inverse class frequency

Training Process

Training Loop:

  1. Forward pass through model
  2. Compute Focal Loss with class weights
  3. Backward pass with gradient computation
  4. Gradient clipping (max norm = 1.0)
  5. Optimizer step (AdamW)
  6. Metrics tracking (loss, accuracy)

Validation Loop:

  1. No gradient computation
  2. Forward pass only
  3. Compute validation loss and accuracy
  4. Per-class accuracy tracking

Checkpointing Strategy:

  • Save best model by validation loss
  • Save best model by validation accuracy
  • Save periodic checkpoints every 10 epochs
  • Each checkpoint includes:
    • Model state dict
    • Optimizer state dict
    • Epoch number
    • Metrics
    • Configuration

Early Stopping:

  • Monitors validation accuracy
  • Stops if no improvement for 10 consecutive epochs

GradCAM Implementation

GradCAM Process:

  1. Forward pass through model with target class
  2. Extract activations from target layer (default: block4 for better spatial resolution)
  3. Compute gradients of target class with respect to activations
  4. Weight activations by gradients
  5. Generate heatmap by averaging across channels
  6. Resize heatmap to input spatial dimensions
  7. Overlay on original frames using colormap

Target Layer Options:

  • block3: 14x14 spatial resolution (highest detail, less semantic)
  • block4: 7x7 spatial resolution (default, balanced detail and semantics)
  • block5: 3x3 spatial resolution (most semantic, lowest spatial detail)

Visualization:

  • Heatmaps are generated for each frame in the clip
  • Multiple output formats: original, heatmap, overlay, overlay with probability chart, side-by-side
  • Probability chart shows class prediction confidence with true and predicted labels highlighted
  • Saved as MP4 videos for temporal analysis

Evaluation Metrics

Overall Metrics:

  • Accuracy: Percentage of correct predictions
  • Top-3 Accuracy: Percentage where true class is in top 3 predictions
  • Mean Class Accuracy: Average per-class accuracy (accounts for imbalance)

Per-Class Metrics:

  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1-Score: Harmonic mean of precision and recall
  • Support: Number of samples per class

Confusion Matrix:

  • Generated in both count and normalized forms
  • Visualized as heatmap
  • Saved as CSV and PNG

Expected Results

Performance Targets

Based on the dataset characteristics and model architecture, expected performance metrics are:

Overall Performance:

  • Overall Accuracy: 70% or higher
  • Mean Class Accuracy: 60% or higher (accounts for class imbalance)
  • Top-3 Accuracy: 85% or higher

Per-Class Performance:

  • Precision: 50% or higher for all classes
  • Recall: 50% or higher for all classes
  • F1-Score: 50% or higher for all classes

Training Time Estimates

Approximate training times per epoch (with batch size 8):

  • NVIDIA GPU (CUDA): 1-2 minutes per epoch
  • Apple Silicon (MPS): 2-3 minutes per epoch
  • CPU: 15-20 minutes per epoch (not recommended)

Full training (50 epochs):

  • GPU: 1-2 hours
  • MPS: 2-3 hours
  • CPU: 12-16 hours

Known Limitations

  1. Class Imbalance: Despite mitigation strategies, the dominant class (Class 1) may still achieve higher accuracy than minority classes

  2. Temporal Context: 16-frame clips may not capture complete exercise movements for some exercise types

  3. Spatial Resolution: 112x112 resolution is relatively low and may miss fine-grained details

  4. Device Compatibility: MPS (Apple Silicon) support may have limitations for some operations compared to CUDA


Experimental Results

This section presents the results from training the C3D model for 30 epochs on the exercise recognition dataset using advanced preprocessing techniques. The analysis provides an honest assessment of model performance, including both strengths and areas for improvement.

Training Configuration

  • Total epochs: 30
  • Batch size: 8
  • Initial learning rate: 0.0001
  • Learning rate schedule: ReduceLROnPlateau (reduced at epochs 17 and 24)
  • Loss function: Focal Loss with automatic class weights
  • Optimizer: AdamW
  • Device: Multi-GPU training with 3x NVIDIA A40 GPUs
  • Training samples: 21,287 clips (after Class 1 downsampling)
  • Validation samples: 11,861 clips

Advanced Preprocessing Techniques

This experiment incorporated three key preprocessing enhancements:

  1. Class 1 Downsampling: Reduced the dominant Class 1 from ~64,000 frames to match the median count of other classes, creating a more balanced training distribution.

  2. Segmentation Masks: Applied person segmentation masks to remove background noise, helping the model focus on the exercising person rather than environmental features.

  3. Skeleton-Guided Attention: Used YOLO pose detection to create spatial attention masks centered on key body joints (shoulders, elbows, wrists, hips), guiding the model to focus on relevant body parts during exercise recognition.

Overall Performance

The model achieved the following metrics on the test set:

Metric Value
Overall Accuracy 73.11%
Mean Class Accuracy 64.68%
Top-3 Accuracy 92.56%
Mean Precision 73.53%
Mean Recall 60.87%
Mean F1-Score 62.36%

The validation accuracy reached 89.73% during training, indicating strong learning capability. The gap between validation and test accuracy suggests some domain shift between the validation and test sets.

Training Dynamics

Loss Curves Figure 1: Training and validation loss over 30 epochs.

Accuracy Curves Figure 2: Training and validation accuracy over 30 epochs.

Combined Metrics Figure 3: Comprehensive view of training metrics including generalization gap.

Observations

Convergence Behavior:

  • Training accuracy increased rapidly from 20.98% (epoch 1) to 99.82% (epoch 30)
  • Validation accuracy increased from 20.54% (epoch 1) to 89.73% (epoch 30)
  • The model showed fast initial learning, reaching 74.27% training accuracy by epoch 3

Overfitting Analysis:

  • A generalization gap emerged after epoch 10, where training accuracy continued climbing while validation accuracy plateaued around 88-90%
  • The gap between training and validation accuracy at epoch 30 is approximately 10 percentage points
  • This indicates moderate overfitting, which is common in video classification tasks

Learning Rate Impact:

  • The learning rate was reduced from 0.0001 to 0.00005 at epoch 17, and further to 0.000025 at epoch 24
  • These reductions helped stabilize validation accuracy in the 88-90% range
  • Best validation accuracy of 89.73% was achieved at epoch 30

Per-Class Performance

Per-Class Accuracy Heatmap Figure 4: Heatmap showing per-class accuracy evolution across epochs.

Final Class Comparison Figure 5: Final accuracy comparison across all 16 exercise classes.

High-Performing Classes (F1-Score > 80%)

Class Precision Recall F1-Score Support
1 82.42% 95.37% 88.42% 3,668
3 80.65% 96.10% 87.70% 564
13 94.59% 75.71% 84.10% 601
15 79.02% 86.02% 82.37% 565
14 96.67% 71.00% 81.87% 531

These classes represent exercises with distinctive motion patterns that the model with skeleton-guided attention successfully captured.

Moderate-Performing Classes (F1-Score 55-80%)

Class Precision Recall F1-Score Support
5 79.34% 71.91% 75.44% 598
7 93.93% 63.12% 75.50% 564
4 73.62% 71.79% 72.69% 521
6 94.31% 56.85% 70.94% 496
11 95.59% 44.38% 60.61% 489
8 51.84% 67.83% 58.76% 603
10 93.67% 42.45% 58.42% 523
2 54.07% 58.45% 56.18% 568

These classes show variable performance, often with high precision but lower recall, indicating the model is conservative in its predictions.

Challenging Classes (F1-Score < 50%)

Class Precision Recall F1-Score Support
9 78.97% 31.71% 45.25% 533
16 29.87% 92.73% 45.19% 509
12 71.43% 9.47% 16.72% 528

These classes present the greatest challenges. Class 12 in particular has very low recall (9.47%), indicating the model rarely predicts this class. Class 16 shows the opposite pattern with high recall but low precision, suggesting it's being over-predicted.

Confusion Matrix Analysis

Confusion Matrix (Normalized) Figure 6: Normalized confusion matrix showing classification patterns.

Confusion Matrix (Counts) Figure 7: Confusion matrix with absolute counts.

Common Misclassification Patterns

The most frequent misclassifications include:

  1. Class 12 → Class 16 (399 instances): 75.6% of Class 12 samples misclassified, the most severe confusion
  2. Class 9 → Class 8 (313 instances): 58.7% of Class 9 samples, indicating strong similarity between these exercises
  3. Class 11 → Class 16 (230 instances): 47.0% of Class 11 samples
  4. Class 8 → Class 1 (178 instances): 29.5% of Class 8 samples
  5. Class 5 → Class 1 (156 instances): 26.1% of Class 5 samples

The skeleton-guided attention approach, while helpful for focusing on body movements, may have reduced the model's ability to distinguish exercises that differ primarily in subtle arm or hand positions. The strong confusion between Classes 9, 10, and 11 with Class 8 suggests these exercises share similar upper body postures.

Learning Stability Analysis

Per-Class Accuracy Evolution Figure 8: Per-class accuracy evolution showing learning stability across all classes.

Stability Observations

Analysis of per-class accuracy across epochs reveals varied learning patterns:

  • Stable high performers: Classes 1, 3, and 15 showed consistent improvement and stable high accuracy throughout training
  • Variable performers: Classes 9, 10, and 12 exhibited significant fluctuations, reflecting the difficulty in learning these exercise patterns
  • Late learners: Some classes showed improvement only after learning rate reductions at epochs 17 and 24

The skeleton-guided attention mechanism appears to help stabilize learning for exercises with clear body postures, while exercises requiring finer motion discrimination remain challenging.

Key factors affecting stability:

  1. Skeleton attention coverage: Exercises well-captured by the 8 attention keypoints (shoulders, elbows, wrists, hips) showed more stable learning
  2. Motion similarity: Classes with overlapping movement patterns (8-9-10-11) showed correlated fluctuations
  3. Class balancing effect: Downsampling Class 1 allowed better gradient distribution to minority classes

Class Distribution Context

Class Distribution Figure 9: Training set class distribution after Class 1 downsampling.

The original dataset exhibited severe class imbalance with Class 1 containing approximately 64,000 frames while other classes contained 9,000-11,000 frames each. Through Class 1 downsampling, the training distribution was balanced to approximately 21,287 total clips.

The combination of mitigation strategies showed mixed results:

  • Weighted Random Sampling + Focal Loss: Helped prevent Class 1 dominance
  • Class 1 Downsampling: Reduced training data but improved class balance
  • Skeleton Attention: Focused learning on body movements but may have limited fine-grained discrimination

GradCAM Visualizations

GradCAM heatmaps were generated using skeleton-guided attention to understand model focus patterns. The visualizations reveal:

  1. Joint-focused attention: The model strongly attends to the 8 keypoint regions (shoulders, elbows, wrists, hips) as guided by the skeleton attention masks
  2. Temporal consistency: Attention patterns follow body movements across the 16-frame clips
  3. Exercise-specific patterns: Different exercises show distinct attention distributions based on which body parts are most active

The visualization includes real-time probability charts showing the model's confidence across all 16 classes as the video progresses.

Example visualizations are available in outputs/visualizations/ with:

  • Original video clips
  • GradCAM heatmap overlays
  • Side-by-side comparisons
  • Dynamic probability charts showing predictions over time

Critical Assessment

Strengths

  1. High validation accuracy (89.73%) demonstrates the model's learning capability
  2. Excellent top-3 accuracy (92.56%) indicates predictions are meaningful even when the top-1 is incorrect
  3. Effective class balancing: Downsampling and Focal Loss prevented Class 1 dominance
  4. Interpretable attention: Skeleton-guided attention provides clear visualization of model focus
  5. Strong performance on key classes: Classes 1, 3, 13, 14, and 15 achieved >80% F1-score

Weaknesses

  1. Test-validation gap: Significant drop from 89.73% validation to 73.11% test accuracy suggests domain shift
  2. Class confusion clusters: Classes 8-9-10-11-12 show severe mutual confusion
  3. Low recall on several classes: Classes 9, 10, 11, and 12 have <50% recall
  4. Skeleton attention limitations: May not capture fine-grained hand/finger movements important for some exercises
  5. Class 16 over-prediction: High recall (92.73%) but low precision (29.87%) indicates systematic bias

Recommendations for Improvement

  1. Expand skeleton keypoints: Include hand and finger keypoints for exercises requiring fine motor discrimination
  2. Increase temporal window: Extend from 16 to 32 frames to capture complete movement cycles
  3. Address class confusion: Apply class-specific augmentation to confused class pairs (8-9-10-11-12)
  4. Investigate domain shift: Analyze differences between validation and test sets to understand the accuracy gap
  5. Ensemble approach: Combine skeleton attention model with a full-frame model for complementary features
  6. Adjust attention sigma: Experiment with different Gaussian sigma values for skeleton attention masks

Computational Performance

Training on 3x NVIDIA A40 GPUs with video caching and skeleton data preprocessing:

  • Time per epoch: Approximately 4-5 minutes (with multi-GPU DataParallel)
  • Total training time (30 epochs): Approximately 2.5 hours
  • GPU memory usage: ~12-15 GB per GPU
  • Data preprocessing: Skeleton attention masks computed on-the-fly during data loading
  • Video caching: All videos cached in RAM for fast access

The multi-GPU setup with DataParallel provided efficient training, though skeleton attention computation added overhead compared to baseline training.

Conclusion

This experiment explored advanced preprocessing techniques for exercise recognition, combining Class 1 downsampling, segmentation masks, and skeleton-guided attention. The C3D model achieved 89.73% validation accuracy and 73.11% test accuracy with 92.56% top-3 accuracy.

Key findings:

  1. Skeleton-guided attention successfully focuses the model on relevant body parts, providing interpretable visualizations of model decisions
  2. Class balancing through downsampling and Focal Loss prevents dominant class bias
  3. Segmentation masks help remove background distractions but may be insufficient alone
  4. Test-validation gap indicates potential domain shift between data splits that warrants further investigation

The model performs well on exercises with distinctive upper body movements (Classes 1, 3, 13-15) but struggles with exercises that require fine motor discrimination or have similar postures (Classes 8-12). Future work should explore expanding skeleton keypoints to include hands, increasing temporal windows, and investigating the source of the validation-test accuracy gap.

The GradCAM visualizations with dynamic probability charts provide valuable interpretability, showing how model confidence changes across exercise transitions in real-time.


Troubleshooting

Common Issues

Out of Memory (OOM) Errors:

  • Reduce batch size: --batch-size 4 or --batch-size 2
  • Reduce spatial resolution in config
  • Reduce number of data loader workers

Slow Data Loading:

  • Reduce number of workers if experiencing bottlenecks
  • Ensure dataset is on fast storage (SSD preferred)

Training Not Converging:

  • Check learning rate (may need adjustment)
  • Verify data augmentation is not too aggressive
  • Monitor gradient norms for exploding/vanishing gradients

GradCAM Errors on MPS:

  • Try running GradCAM on CPU: --device cpu
  • Some operations in pytorch-grad-cam may not be fully MPS-compatible

About

A project that is showing a demo of exercise recognition based on 3D-CONV with heatmap visualization (grad-CAM).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages