Exercise Recognition with 3D Convolutional Neural Networks and GradCAM

A comprehensive implementation of video-based exercise recognition using 3D Convolutional Neural Networks with GradCAM visualization for model interpretability.

Overview

This project implements a complete pipeline for recognizing exercises from video data. The system uses a 3D Convolutional Neural Network (C3D architecture) to learn spatio-temporal features from video clips and provides interpretability through GradCAM heatmap visualizations.

Key Features

3D CNN (C3D) architecture for video classification
Focal Loss for handling severe class imbalance
Class 1 downsampling option for balanced training
Segmentation mask support for background removal
Skeleton-guided attention using YOLO pose detection to focus on key body joints
GradCAM visualization for model interpretability with dynamic probability charts
Support for multiple devices: CUDA (NVIDIA GPU), MPS (Apple Silicon), and CPU
Multi-GPU training support with DataParallel
Comprehensive evaluation metrics and confusion matrix generation
Modular and extensible codebase

Use Case

The system is designed for automatic exercise recognition from video footage, capable of classifying 17 different exercise types from short video clips.

Technical Architecture

Model: C3D (3D Convolutional Network)

Architecture Overview:

Input: RGB video clip of shape (3, 16, 112, 112)
- 3 color channels
- 16 temporal frames
- 112x112 spatial resolution
5 convolutional blocks with 3x3x3 kernels
Batch normalization and ReLU activation
3D max pooling for temporal and spatial downsampling
Global average pooling
Fully connected classifier
Output: 17-class predictions

Model Statistics:

Total parameters: 27,797,137 (approximately 27.8 million)
Trainable parameters: 27,797,137

Network Details:

Block 1: Conv3D(3->64)   + BatchNorm + ReLU + MaxPool(1,2,2)
Block 2: Conv3D(64->128) + BatchNorm + ReLU + MaxPool(2,2,2)
Block 3: Conv3D(128->256)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Block 4: Conv3D(256->512)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Block 5: Conv3D(512->512)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Global Average Pooling -> Dropout(0.5) -> Linear(512->256) -> ReLU -> Dropout(0.3) -> Linear(256->17)

Dataset

Data Statistics

Total subjects: 60
Train/test split: 30 subjects each
Exercise classes: 17 (labeled as classes 1-16)
Video format: MP4, 30 FPS
Original resolution: approximately 400x550 pixels
Total training clips: 28,413 (with temporal sliding window)
Total test clips: approximately 23,000

Data Structure

dataset/
├── dataset/
│   ├── anon/              # Anonymized video files (.mp4)
│   ├── mask/              # Segmentation masks (.png)
│   └── skeleton/          # YOLO pose detection outputs
│       ├── yolo/          # Pose videos
│       └── yolo_pose_csv/ # Pose keypoints in CSV format
├── label/                 # Frame-level labels (.csv)
└── split.csv             # Train/test split specification

Label Format

Labels are stored as CSV files with format:

frame_number, column1, column2

column2 contains the exercise class: -1 for background, 0-16 for exercises
One label file per subject

Class Distribution

The dataset exhibits severe class imbalance:

Class 1: approximately 64,000 training frames (dominant class)
Classes 2-16: approximately 9,000-11,000 frames each

Installation

Prerequisites

Python 3.12 or higher
Virtual environment tool (venv)
uv package manager

Setup Instructions

Clone the repository and navigate to the project directory
Activate the virtual environment:

source .venv/bin/activate

Install dependencies using uv:

uv add torch torchvision
uv add opencv-python pandas scikit-learn
uv add matplotlib seaborn tensorboard
uv add tqdm pyyaml grad-cam

Note: The decord package is not available for macOS ARM architecture and is optional. OpenCV is used for video loading instead.

Quick Start Guide

This section provides the recommended sequence for implementing and running the complete pipeline.

Step 1: Verify Installation

Test that all components are properly installed:

python scripts/quick_test.py

This will verify:

Dataset loading functionality
Model initialization
Forward and backward passes
Device selection (CPU/MPS/CUDA)

Step 2: Analyze the Dataset

Explore the dataset statistics and distribution:

python scripts/data_analysis.py

This generates:

Train/test split statistics
Class distribution analysis
Video property statistics
Visualization plots saved to outputs/analysis/

Step 3: Run Initial Training Test

Perform a quick training test (2 epochs) to verify the pipeline:

python scripts/train.py --epochs 2 --batch-size 4

Expected duration: 5-10 minutes on MPS/GPU

This validates:

Data loading pipeline
Model training loop
Checkpoint saving
Metrics tracking

Step 4: Full Training

Launch full training run:

# Default configuration (100 epochs)
python scripts/train.py

# Custom configuration
python scripts/train.py --epochs 50 --batch-size 8

# With class balancing (recommended for better per-class performance)
python scripts/train.py --epochs 50 --batch-size 8 --downsample-class1

# With background removal using segmentation masks
python scripts/train.py --epochs 50 --batch-size 8 --use-masks

# Combined: balanced training with background removal
python scripts/train.py --epochs 50 --batch-size 8 --downsample-class1 --use-masks

# With skeleton-guided attention (focuses on body joints)
python scripts/train.py --epochs 50 --batch-size 8 --use-skeleton-attention

# Full preprocessing: downsampling + masks + skeleton attention
python scripts/train.py --epochs 30 --batch-size 8 --downsample-class1 --use-masks --use-skeleton-attention

# Specify device
python scripts/train.py --epochs 50 --device cuda
python scripts/train.py --epochs 50 --device mps

Expected duration: 2-4 hours for 50 epochs on MPS/GPU

Training artifacts saved to:

Checkpoints: outputs/checkpoints/
Logs: outputs/logs/
TensorBoard logs: outputs/logs/

Step 5: Monitor Training Progress

In a separate terminal, launch TensorBoard:

tensorboard --logdir outputs/logs

Access the dashboard at: http://localhost:6006

Metrics tracked:

Training and validation loss
Training and validation accuracy
Per-class accuracy
Learning rate schedule

Step 5b: Plot Training Metrics

The training process automatically saves metrics to CSV files after each epoch in outputs/logs/metrics/. Generate publication-ready plots from these metrics:

# Generate all plots with default settings
python scripts/plot_training.py

# Specify custom directories
python scripts/plot_training.py --metrics-dir outputs/logs/metrics --save-dir outputs/plots

# Plot only top 5 classes by accuracy
python scripts/plot_training.py --top-classes 5

Generated plots:

loss_curves.png - Training and validation loss over time
accuracy_curves.png - Training and validation accuracy over time
learning_rate.png - Learning rate schedule
combined_metrics.png - All metrics in a single figure including generalization gap
per_class_accuracy.png - Per-class accuracy evolution (all classes)
per_class_accuracy_top10.png - Top 10 performing classes
per_class_heatmap.png - Heatmap showing per-class accuracy over epochs
final_class_comparison.png - Bar chart of final accuracy per class

CSV files saved to outputs/logs/metrics/:

training_metrics.csv - Epoch-level loss, accuracy, and learning rate
per_class_accuracy.csv - Per-class accuracy for every epoch

Step 6: Evaluate Model

After training completes, evaluate on test set:

python scripts/evaluate.py --checkpoint outputs/checkpoints/best_model_acc.pth

This generates:

Overall accuracy metrics
Per-class precision, recall, F1-score
Confusion matrix (counts and normalized)
Predictions CSV
Error analysis

Results saved to: outputs/results/

Step 7: Generate GradCAM Visualizations

Visualize model attention with GradCAM:

# Visualize 20 random test samples
python scripts/visualize_gradcam.py --checkpoint outputs/checkpoints/best_model_acc.pth --num-samples 20

# Visualize only misclassified samples
python scripts/visualize_gradcam.py --checkpoint outputs/checkpoints/best_model_acc.pth --num-samples 20 --misclassified-only

# Custom visualization settings
python scripts/visualize_gradcam.py \
    --checkpoint outputs/checkpoints/best_model_acc.pth \
    --num-samples 50 \
    --layer block5 \
    --alpha 0.6 \
    --fps 10

Outputs for each sample:

original.mp4: Original video clip
heatmap.mp4: GradCAM heatmap
overlay.mp4: Heatmap overlayed on video
overlay_with_probs.mp4: Overlay with class probability bar chart (scaled up for readable text, default 336x336 + chart)
side_by_side.mp4: Original and overlay side-by-side
metadata.txt: Sample information

Visualizations saved to: outputs/visualizations/

Usage

Training Options

python scripts/train.py [OPTIONS]

Options:
  --config PATH         Path to configuration YAML file
  --seed INT           Random seed for reproducibility (default: 42)
  --device DEVICE      Device to use: cuda/mps/cpu (auto-detected if not specified)
  --epochs INT         Number of training epochs (default: 100)
  --batch-size INT     Batch size for training (default: 8)
  --downsample-class1  Downsample Class 1 to match median count of other classes (fixes class imbalance)
  --use-masks          Apply segmentation masks to remove background from frames during training
  --use-skeleton-attention  Apply skeleton-guided attention masks to focus on body joints
  --skeleton-sigma FLOAT    Gaussian sigma for skeleton attention blobs (default: 10.0)

Evaluation Options

python scripts/evaluate.py [OPTIONS]

Required:
  --checkpoint PATH    Path to model checkpoint (.pth file)

Optional:
  --config PATH        Path to configuration file
  --split SPLIT        Dataset split to evaluate: train/test (default: test)
  --batch-size INT     Batch size for evaluation (default: 16)
  --device DEVICE      Device to use: cuda/mps/cpu
  --save-dir PATH      Directory to save results (default: outputs/results)
  --use-masks          Apply segmentation masks to remove background (use if model was trained with masks)

GradCAM Visualization Options

python scripts/visualize_gradcam.py [OPTIONS]

Required:
  --checkpoint PATH    Path to model checkpoint (.pth file)

Optional:
  --config PATH           Path to configuration file
  --split SPLIT           Dataset split: train/test (default: test)
  --num-samples INT       Number of samples to visualize (default: 20)
  --device DEVICE         Device to use: cuda/mps/cpu
  --save-dir PATH         Directory to save visualizations (default: outputs/visualizations)
  --layer LAYER           Target layer for GradCAM: block3/block4/block5 (default: block4)
  --alpha FLOAT           Overlay transparency: 0.0-1.0 (default: 0.5)
  --fps INT               Output video frame rate (default: 10)
  --misclassified-only    Only visualize misclassified samples
  --use-masks             Apply segmentation masks to remove background (use if model was trained with masks)
  --use-skeleton-attention  Apply skeleton-guided attention (use if model was trained with skeleton attention)
  --skeleton-sigma FLOAT  Gaussian sigma for skeleton attention blobs (default: 10.0)
  --scale-factor INT      Scale factor for overlay_with_probs video (default: 3, scales 112x112 to 336x336)
  --long-sample           Generate longer video samples with dynamic probability updates
  --sample-duration INT   Duration of long samples in frames (default: 300)
  --start-frame INT       Starting frame for long sample mode (default: 0)
  --subject-id STR        Specific subject ID to visualize in long sample mode

Project Structure

PROJECT/
├── src/                              # Source code
│   ├── config/
│   │   ├── __init__.py
│   │   └── config.py                 # Configuration management
│   ├── data/
│   │   ├── __init__.py
│   │   ├── dataset.py                # PyTorch Dataset for video clips
│   │   ├── transforms.py             # Video augmentation transforms
│   │   └── utils.py                  # Video loading utilities
│   ├── models/
│   │   ├── __init__.py
│   │   └── cnn3d.py                  # C3D architecture implementation
│   ├── training/
│   │   ├── __init__.py
│   │   ├── trainer.py                # Training loop and logic
│   │   └── losses.py                 # Focal Loss implementation
│   ├── evaluation/
│   │   ├── __init__.py
│   │   ├── metrics.py                # Evaluation metrics
│   │   └── evaluator.py              # Evaluation pipeline
│   ├── visualization/
│   │   ├── __init__.py
│   │   └── gradcam.py                # GradCAM wrapper for 3D CNN
│   └── utils/
│       ├── __init__.py
│       ├── device.py                 # Device selection (CPU/MPS/CUDA)
│       ├── logging.py                # Logging utilities
│       └── checkpointing.py          # Model save/load utilities
├── scripts/                          # Executable scripts
│   ├── train.py                      # Main training script
│   ├── evaluate.py                   # Model evaluation script
│   ├── visualize_gradcam.py          # GradCAM visualization script
│   ├── data_analysis.py              # Dataset analysis script
│   ├── test_dataset.py               # Dataset loading test
│   ├── test_training_init.py         # Training initialization test
│   └── quick_test.py                 # Quick system test
├── outputs/                          # Generated outputs
│   ├── checkpoints/                  # Model checkpoints (.pth files)
│   ├── logs/                         # Training logs and TensorBoard
│   ├── results/                      # Evaluation results
│   ├── visualizations/               # GradCAM visualizations
│   └── analysis/                     # Dataset analysis plots
├── dataset/                          # Dataset directory (not in repo)
├── .venv/                            # Virtual environment
├── .gitignore                        # Git ignore file
├── pyproject.toml                    # Project dependencies
└── README.md                         # This file

Configuration

Default Configuration

The default configuration is defined in src/config/config.py. Key parameters include:

Data Processing:

Clip length: 16 frames
Temporal stride: 8 frames (for sliding window during training)
Spatial size: 112x112 pixels
FPS: 30 (original video framerate)

Model:

Architecture: C3D
Number of classes: 17
Dropout: 0.5
Pretrained weights: False

Training:

Batch size: 8
Number of epochs: 100
Number of workers: 4
Pin memory: True

Optimizer:

Type: AdamW
Learning rate: 0.0001
Weight decay: 0.00001

Learning Rate Scheduler:

Type: ReduceLROnPlateau
Mode: min (reduce on validation loss)
Factor: 0.5
Patience: 5 epochs
Minimum LR: 0.0000001

Loss Function:

Type: Focal Loss
Gamma: 2.0
Class weights: Computed automatically from training data

Data Augmentation (Training Only):

Horizontal flip: 50% probability
Rotation: +/- 10 degrees
Color jitter: brightness, contrast, saturation, hue
Normalization: ImageNet mean and std

Early Stopping:

Patience: 10 epochs
Minimum delta: 0.001

Checkpointing:

Save frequency: Every 10 epochs
Keep best 3 checkpoints
Save best model by validation loss
Save best model by validation accuracy

Device:

Preference order: CUDA > MPS > CPU
Automatic fallback to CPU if GPU unavailable

Custom Configuration

You can create a custom YAML configuration file and use it:

python scripts/train.py --config path/to/config.yaml

Implementation Details

Data Pipeline

Video Processing:

Videos are loaded using OpenCV
Clips of 16 consecutive frames are extracted using a sliding window
Frames are resized to 112x112 pixels
Clips are normalized using ImageNet statistics
Optional: Segmentation masks can be applied to remove background (--use-masks)

Temporal Sampling:

Training: Overlapping clips with stride=8 frames
Validation/Testing: Non-overlapping clips with stride=16 frames

Class Imbalance Handling:

Class 1 Downsampling: Optional flag to downsample the dominant class to match other classes (--downsample-class1)
Weighted Random Sampling: Minority classes are oversampled during training
Focal Loss: Focuses learning on hard-to-classify examples
Class Weights: Loss is weighted by inverse class frequency

Training Process

Training Loop:

Forward pass through model
Compute Focal Loss with class weights
Backward pass with gradient computation
Gradient clipping (max norm = 1.0)
Optimizer step (AdamW)
Metrics tracking (loss, accuracy)

Validation Loop:

No gradient computation
Forward pass only
Compute validation loss and accuracy
Per-class accuracy tracking

Checkpointing Strategy:

Save best model by validation loss
Save best model by validation accuracy
Save periodic checkpoints every 10 epochs
Each checkpoint includes:
- Model state dict
- Optimizer state dict
- Epoch number
- Metrics
- Configuration

Early Stopping:

Monitors validation accuracy
Stops if no improvement for 10 consecutive epochs

GradCAM Implementation

GradCAM Process:

Forward pass through model with target class
Extract activations from target layer (default: block4 for better spatial resolution)
Compute gradients of target class with respect to activations
Weight activations by gradients
Generate heatmap by averaging across channels
Resize heatmap to input spatial dimensions
Overlay on original frames using colormap

Target Layer Options:

block3: 14x14 spatial resolution (highest detail, less semantic)
block4: 7x7 spatial resolution (default, balanced detail and semantics)
block5: 3x3 spatial resolution (most semantic, lowest spatial detail)

Visualization:

Heatmaps are generated for each frame in the clip
Multiple output formats: original, heatmap, overlay, overlay with probability chart, side-by-side
Probability chart shows class prediction confidence with true and predicted labels highlighted
Saved as MP4 videos for temporal analysis

Evaluation Metrics

Overall Metrics:

Accuracy: Percentage of correct predictions
Top-3 Accuracy: Percentage where true class is in top 3 predictions
Mean Class Accuracy: Average per-class accuracy (accounts for imbalance)

Per-Class Metrics:

Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1-Score: Harmonic mean of precision and recall
Support: Number of samples per class

Confusion Matrix:

Generated in both count and normalized forms
Visualized as heatmap
Saved as CSV and PNG

Expected Results

Performance Targets

Based on the dataset characteristics and model architecture, expected performance metrics are:

Overall Performance:

Overall Accuracy: 70% or higher
Mean Class Accuracy: 60% or higher (accounts for class imbalance)
Top-3 Accuracy: 85% or higher

Per-Class Performance:

Precision: 50% or higher for all classes
Recall: 50% or higher for all classes
F1-Score: 50% or higher for all classes

Training Time Estimates

Approximate training times per epoch (with batch size 8):

NVIDIA GPU (CUDA): 1-2 minutes per epoch
Apple Silicon (MPS): 2-3 minutes per epoch
CPU: 15-20 minutes per epoch (not recommended)

Full training (50 epochs):

GPU: 1-2 hours
MPS: 2-3 hours
CPU: 12-16 hours

Known Limitations

Class Imbalance: Despite mitigation strategies, the dominant class (Class 1) may still achieve higher accuracy than minority classes
Temporal Context: 16-frame clips may not capture complete exercise movements for some exercise types
Spatial Resolution: 112x112 resolution is relatively low and may miss fine-grained details
Device Compatibility: MPS (Apple Silicon) support may have limitations for some operations compared to CUDA

Experimental Results

This section presents the results from training the C3D model for 30 epochs on the exercise recognition dataset using advanced preprocessing techniques. The analysis provides an honest assessment of model performance, including both strengths and areas for improvement.

Training Configuration

Total epochs: 30
Batch size: 8
Initial learning rate: 0.0001
Learning rate schedule: ReduceLROnPlateau (reduced at epochs 17 and 24)
Loss function: Focal Loss with automatic class weights
Optimizer: AdamW
Device: Multi-GPU training with 3x NVIDIA A40 GPUs
Training samples: 21,287 clips (after Class 1 downsampling)
Validation samples: 11,861 clips

Advanced Preprocessing Techniques

This experiment incorporated three key preprocessing enhancements:

Class 1 Downsampling: Reduced the dominant Class 1 from ~64,000 frames to match the median count of other classes, creating a more balanced training distribution.
Segmentation Masks: Applied person segmentation masks to remove background noise, helping the model focus on the exercising person rather than environmental features.
Skeleton-Guided Attention: Used YOLO pose detection to create spatial attention masks centered on key body joints (shoulders, elbows, wrists, hips), guiding the model to focus on relevant body parts during exercise recognition.

Overall Performance

The model achieved the following metrics on the test set:

Metric	Value
Overall Accuracy	73.11%
Mean Class Accuracy	64.68%
Top-3 Accuracy	92.56%
Mean Precision	73.53%
Mean Recall	60.87%
Mean F1-Score	62.36%

The validation accuracy reached 89.73% during training, indicating strong learning capability. The gap between validation and test accuracy suggests some domain shift between the validation and test sets.

Training Dynamics

Figure 1: Training and validation loss over 30 epochs.

Figure 2: Training and validation accuracy over 30 epochs.

Figure 3: Comprehensive view of training metrics including generalization gap.

Observations

Convergence Behavior:

Training accuracy increased rapidly from 20.98% (epoch 1) to 99.82% (epoch 30)
Validation accuracy increased from 20.54% (epoch 1) to 89.73% (epoch 30)
The model showed fast initial learning, reaching 74.27% training accuracy by epoch 3

Overfitting Analysis:

A generalization gap emerged after epoch 10, where training accuracy continued climbing while validation accuracy plateaued around 88-90%
The gap between training and validation accuracy at epoch 30 is approximately 10 percentage points
This indicates moderate overfitting, which is common in video classification tasks

Learning Rate Impact:

The learning rate was reduced from 0.0001 to 0.00005 at epoch 17, and further to 0.000025 at epoch 24
These reductions helped stabilize validation accuracy in the 88-90% range
Best validation accuracy of 89.73% was achieved at epoch 30

Per-Class Performance

Figure 4: Heatmap showing per-class accuracy evolution across epochs.

Figure 5: Final accuracy comparison across all 16 exercise classes.

High-Performing Classes (F1-Score > 80%)

Class	Precision	Recall	F1-Score	Support
1	82.42%	95.37%	88.42%	3,668
3	80.65%	96.10%	87.70%	564
13	94.59%	75.71%	84.10%	601
15	79.02%	86.02%	82.37%	565
14	96.67%	71.00%	81.87%	531

These classes represent exercises with distinctive motion patterns that the model with skeleton-guided attention successfully captured.

Moderate-Performing Classes (F1-Score 55-80%)

Class	Precision	Recall	F1-Score	Support
5	79.34%	71.91%	75.44%	598
7	93.93%	63.12%	75.50%	564
4	73.62%	71.79%	72.69%	521
6	94.31%	56.85%	70.94%	496
11	95.59%	44.38%	60.61%	489
8	51.84%	67.83%	58.76%	603
10	93.67%	42.45%	58.42%	523
2	54.07%	58.45%	56.18%	568

These classes show variable performance, often with high precision but lower recall, indicating the model is conservative in its predictions.

Challenging Classes (F1-Score < 50%)

Class	Precision	Recall	F1-Score	Support
9	78.97%	31.71%	45.25%	533
16	29.87%	92.73%	45.19%	509
12	71.43%	9.47%	16.72%	528

These classes present the greatest challenges. Class 12 in particular has very low recall (9.47%), indicating the model rarely predicts this class. Class 16 shows the opposite pattern with high recall but low precision, suggesting it's being over-predicted.

Confusion Matrix Analysis

Figure 6: Normalized confusion matrix showing classification patterns.

Figure 7: Confusion matrix with absolute counts.

Common Misclassification Patterns

The most frequent misclassifications include:

Class 12 → Class 16 (399 instances): 75.6% of Class 12 samples misclassified, the most severe confusion
Class 9 → Class 8 (313 instances): 58.7% of Class 9 samples, indicating strong similarity between these exercises
Class 11 → Class 16 (230 instances): 47.0% of Class 11 samples
Class 8 → Class 1 (178 instances): 29.5% of Class 8 samples
Class 5 → Class 1 (156 instances): 26.1% of Class 5 samples

The skeleton-guided attention approach, while helpful for focusing on body movements, may have reduced the model's ability to distinguish exercises that differ primarily in subtle arm or hand positions. The strong confusion between Classes 9, 10, and 11 with Class 8 suggests these exercises share similar upper body postures.

Learning Stability Analysis

Figure 8: Per-class accuracy evolution showing learning stability across all classes.

Stability Observations

Analysis of per-class accuracy across epochs reveals varied learning patterns:

Stable high performers: Classes 1, 3, and 15 showed consistent improvement and stable high accuracy throughout training
Variable performers: Classes 9, 10, and 12 exhibited significant fluctuations, reflecting the difficulty in learning these exercise patterns
Late learners: Some classes showed improvement only after learning rate reductions at epochs 17 and 24

The skeleton-guided attention mechanism appears to help stabilize learning for exercises with clear body postures, while exercises requiring finer motion discrimination remain challenging.

Key factors affecting stability:

Skeleton attention coverage: Exercises well-captured by the 8 attention keypoints (shoulders, elbows, wrists, hips) showed more stable learning
Motion similarity: Classes with overlapping movement patterns (8-9-10-11) showed correlated fluctuations
Class balancing effect: Downsampling Class 1 allowed better gradient distribution to minority classes

Class Distribution Context

Figure 9: Training set class distribution after Class 1 downsampling.

The original dataset exhibited severe class imbalance with Class 1 containing approximately 64,000 frames while other classes contained 9,000-11,000 frames each. Through Class 1 downsampling, the training distribution was balanced to approximately 21,287 total clips.

The combination of mitigation strategies showed mixed results:

Weighted Random Sampling + Focal Loss: Helped prevent Class 1 dominance
Class 1 Downsampling: Reduced training data but improved class balance
Skeleton Attention: Focused learning on body movements but may have limited fine-grained discrimination

GradCAM Visualizations

GradCAM heatmaps were generated using skeleton-guided attention to understand model focus patterns. The visualizations reveal:

Joint-focused attention: The model strongly attends to the 8 keypoint regions (shoulders, elbows, wrists, hips) as guided by the skeleton attention masks
Temporal consistency: Attention patterns follow body movements across the 16-frame clips
Exercise-specific patterns: Different exercises show distinct attention distributions based on which body parts are most active

The visualization includes real-time probability charts showing the model's confidence across all 16 classes as the video progresses.

Example visualizations are available in outputs/visualizations/ with:

Original video clips
GradCAM heatmap overlays
Side-by-side comparisons
Dynamic probability charts showing predictions over time

Critical Assessment

Strengths

High validation accuracy (89.73%) demonstrates the model's learning capability
Excellent top-3 accuracy (92.56%) indicates predictions are meaningful even when the top-1 is incorrect
Effective class balancing: Downsampling and Focal Loss prevented Class 1 dominance
Interpretable attention: Skeleton-guided attention provides clear visualization of model focus
Strong performance on key classes: Classes 1, 3, 13, 14, and 15 achieved >80% F1-score

Weaknesses

Test-validation gap: Significant drop from 89.73% validation to 73.11% test accuracy suggests domain shift
Class confusion clusters: Classes 8-9-10-11-12 show severe mutual confusion
Low recall on several classes: Classes 9, 10, 11, and 12 have <50% recall
Skeleton attention limitations: May not capture fine-grained hand/finger movements important for some exercises
Class 16 over-prediction: High recall (92.73%) but low precision (29.87%) indicates systematic bias

Recommendations for Improvement

Expand skeleton keypoints: Include hand and finger keypoints for exercises requiring fine motor discrimination
Increase temporal window: Extend from 16 to 32 frames to capture complete movement cycles
Address class confusion: Apply class-specific augmentation to confused class pairs (8-9-10-11-12)
Investigate domain shift: Analyze differences between validation and test sets to understand the accuracy gap
Ensemble approach: Combine skeleton attention model with a full-frame model for complementary features
Adjust attention sigma: Experiment with different Gaussian sigma values for skeleton attention masks

Computational Performance

Training on 3x NVIDIA A40 GPUs with video caching and skeleton data preprocessing:

Time per epoch: Approximately 4-5 minutes (with multi-GPU DataParallel)
Total training time (30 epochs): Approximately 2.5 hours
GPU memory usage: ~12-15 GB per GPU
Data preprocessing: Skeleton attention masks computed on-the-fly during data loading
Video caching: All videos cached in RAM for fast access

The multi-GPU setup with DataParallel provided efficient training, though skeleton attention computation added overhead compared to baseline training.

Conclusion

This experiment explored advanced preprocessing techniques for exercise recognition, combining Class 1 downsampling, segmentation masks, and skeleton-guided attention. The C3D model achieved 89.73% validation accuracy and 73.11% test accuracy with 92.56% top-3 accuracy.

Key findings:

Skeleton-guided attention successfully focuses the model on relevant body parts, providing interpretable visualizations of model decisions
Class balancing through downsampling and Focal Loss prevents dominant class bias
Segmentation masks help remove background distractions but may be insufficient alone
Test-validation gap indicates potential domain shift between data splits that warrants further investigation

The model performs well on exercises with distinctive upper body movements (Classes 1, 3, 13-15) but struggles with exercises that require fine motor discrimination or have similar postures (Classes 8-12). Future work should explore expanding skeleton keypoints to include hands, increasing temporal windows, and investigating the source of the validation-test accuracy gap.

The GradCAM visualizations with dynamic probability charts provide valuable interpretability, showing how model confidence changes across exercise transitions in real-time.

Troubleshooting

Common Issues

Out of Memory (OOM) Errors:

Reduce batch size: --batch-size 4 or --batch-size 2
Reduce spatial resolution in config
Reduce number of data loader workers

Slow Data Loading:

Reduce number of workers if experiencing bottlenecks
Ensure dataset is on fast storage (SSD preferred)

Training Not Converging:

Check learning rate (may need adjustment)
Verify data augmentation is not too aggressive
Monitor gradient norms for exploding/vanishing gradients

GradCAM Errors on MPS:

Try running GradCAM on CPU: --device cpu
Some operations in pytorch-grad-cam may not be fully MPS-compatible

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
outputs		outputs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation