A comprehensive implementation of video-based exercise recognition using 3D Convolutional Neural Networks with GradCAM visualization for model interpretability.
- Overview
- Technical Architecture
- Dataset
- Installation
- Quick Start Guide
- Usage
- Project Structure
- Configuration
- Implementation Details
- Expected Results
This project implements a complete pipeline for recognizing exercises from video data. The system uses a 3D Convolutional Neural Network (C3D architecture) to learn spatio-temporal features from video clips and provides interpretability through GradCAM heatmap visualizations.
- 3D CNN (C3D) architecture for video classification
- Focal Loss for handling severe class imbalance
- Class 1 downsampling option for balanced training
- Segmentation mask support for background removal
- Skeleton-guided attention using YOLO pose detection to focus on key body joints
- GradCAM visualization for model interpretability with dynamic probability charts
- Support for multiple devices: CUDA (NVIDIA GPU), MPS (Apple Silicon), and CPU
- Multi-GPU training support with DataParallel
- Comprehensive evaluation metrics and confusion matrix generation
- Modular and extensible codebase
The system is designed for automatic exercise recognition from video footage, capable of classifying 17 different exercise types from short video clips.
Architecture Overview:
- Input: RGB video clip of shape (3, 16, 112, 112)
- 3 color channels
- 16 temporal frames
- 112x112 spatial resolution
- 5 convolutional blocks with 3x3x3 kernels
- Batch normalization and ReLU activation
- 3D max pooling for temporal and spatial downsampling
- Global average pooling
- Fully connected classifier
- Output: 17-class predictions
Model Statistics:
- Total parameters: 27,797,137 (approximately 27.8 million)
- Trainable parameters: 27,797,137
Network Details:
Block 1: Conv3D(3->64) + BatchNorm + ReLU + MaxPool(1,2,2)
Block 2: Conv3D(64->128) + BatchNorm + ReLU + MaxPool(2,2,2)
Block 3: Conv3D(128->256)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Block 4: Conv3D(256->512)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Block 5: Conv3D(512->512)x2 + BatchNorm + ReLU + MaxPool(2,2,2)
Global Average Pooling -> Dropout(0.5) -> Linear(512->256) -> ReLU -> Dropout(0.3) -> Linear(256->17)
- Total subjects: 60
- Train/test split: 30 subjects each
- Exercise classes: 17 (labeled as classes 1-16)
- Video format: MP4, 30 FPS
- Original resolution: approximately 400x550 pixels
- Total training clips: 28,413 (with temporal sliding window)
- Total test clips: approximately 23,000
dataset/
├── dataset/
│ ├── anon/ # Anonymized video files (.mp4)
│ ├── mask/ # Segmentation masks (.png)
│ └── skeleton/ # YOLO pose detection outputs
│ ├── yolo/ # Pose videos
│ └── yolo_pose_csv/ # Pose keypoints in CSV format
├── label/ # Frame-level labels (.csv)
└── split.csv # Train/test split specification
Labels are stored as CSV files with format:
frame_number, column1, column2
- column2 contains the exercise class: -1 for background, 0-16 for exercises
- One label file per subject
The dataset exhibits severe class imbalance:
- Class 1: approximately 64,000 training frames (dominant class)
- Classes 2-16: approximately 9,000-11,000 frames each
- Python 3.12 or higher
- Virtual environment tool (venv)
- uv package manager
-
Clone the repository and navigate to the project directory
-
Activate the virtual environment:
source .venv/bin/activate- Install dependencies using uv:
uv add torch torchvision
uv add opencv-python pandas scikit-learn
uv add matplotlib seaborn tensorboard
uv add tqdm pyyaml grad-camNote: The decord package is not available for macOS ARM architecture and is optional. OpenCV is used for video loading instead.
This section provides the recommended sequence for implementing and running the complete pipeline.
Test that all components are properly installed:
python scripts/quick_test.pyThis will verify:
- Dataset loading functionality
- Model initialization
- Forward and backward passes
- Device selection (CPU/MPS/CUDA)
Explore the dataset statistics and distribution:
python scripts/data_analysis.pyThis generates:
- Train/test split statistics
- Class distribution analysis
- Video property statistics
- Visualization plots saved to
outputs/analysis/
Perform a quick training test (2 epochs) to verify the pipeline:
python scripts/train.py --epochs 2 --batch-size 4Expected duration: 5-10 minutes on MPS/GPU
This validates:
- Data loading pipeline
- Model training loop
- Checkpoint saving
- Metrics tracking
Launch full training run:
# Default configuration (100 epochs)
python scripts/train.py
# Custom configuration
python scripts/train.py --epochs 50 --batch-size 8
# With class balancing (recommended for better per-class performance)
python scripts/train.py --epochs 50 --batch-size 8 --downsample-class1
# With background removal using segmentation masks
python scripts/train.py --epochs 50 --batch-size 8 --use-masks
# Combined: balanced training with background removal
python scripts/train.py --epochs 50 --batch-size 8 --downsample-class1 --use-masks
# With skeleton-guided attention (focuses on body joints)
python scripts/train.py --epochs 50 --batch-size 8 --use-skeleton-attention
# Full preprocessing: downsampling + masks + skeleton attention
python scripts/train.py --epochs 30 --batch-size 8 --downsample-class1 --use-masks --use-skeleton-attention
# Specify device
python scripts/train.py --epochs 50 --device cuda
python scripts/train.py --epochs 50 --device mpsExpected duration: 2-4 hours for 50 epochs on MPS/GPU
Training artifacts saved to:
- Checkpoints:
outputs/checkpoints/ - Logs:
outputs/logs/ - TensorBoard logs:
outputs/logs/
In a separate terminal, launch TensorBoard:
tensorboard --logdir outputs/logsAccess the dashboard at: http://localhost:6006
Metrics tracked:
- Training and validation loss
- Training and validation accuracy
- Per-class accuracy
- Learning rate schedule
The training process automatically saves metrics to CSV files after each epoch in outputs/logs/metrics/. Generate publication-ready plots from these metrics:
# Generate all plots with default settings
python scripts/plot_training.py
# Specify custom directories
python scripts/plot_training.py --metrics-dir outputs/logs/metrics --save-dir outputs/plots
# Plot only top 5 classes by accuracy
python scripts/plot_training.py --top-classes 5Generated plots:
loss_curves.png- Training and validation loss over timeaccuracy_curves.png- Training and validation accuracy over timelearning_rate.png- Learning rate schedulecombined_metrics.png- All metrics in a single figure including generalization gapper_class_accuracy.png- Per-class accuracy evolution (all classes)per_class_accuracy_top10.png- Top 10 performing classesper_class_heatmap.png- Heatmap showing per-class accuracy over epochsfinal_class_comparison.png- Bar chart of final accuracy per class
CSV files saved to outputs/logs/metrics/:
training_metrics.csv- Epoch-level loss, accuracy, and learning rateper_class_accuracy.csv- Per-class accuracy for every epoch
After training completes, evaluate on test set:
python scripts/evaluate.py --checkpoint outputs/checkpoints/best_model_acc.pthThis generates:
- Overall accuracy metrics
- Per-class precision, recall, F1-score
- Confusion matrix (counts and normalized)
- Predictions CSV
- Error analysis
Results saved to: outputs/results/
Visualize model attention with GradCAM:
# Visualize 20 random test samples
python scripts/visualize_gradcam.py --checkpoint outputs/checkpoints/best_model_acc.pth --num-samples 20
# Visualize only misclassified samples
python scripts/visualize_gradcam.py --checkpoint outputs/checkpoints/best_model_acc.pth --num-samples 20 --misclassified-only
# Custom visualization settings
python scripts/visualize_gradcam.py \
--checkpoint outputs/checkpoints/best_model_acc.pth \
--num-samples 50 \
--layer block5 \
--alpha 0.6 \
--fps 10Outputs for each sample:
- original.mp4: Original video clip
- heatmap.mp4: GradCAM heatmap
- overlay.mp4: Heatmap overlayed on video
- overlay_with_probs.mp4: Overlay with class probability bar chart (scaled up for readable text, default 336x336 + chart)
- side_by_side.mp4: Original and overlay side-by-side
- metadata.txt: Sample information
Visualizations saved to: outputs/visualizations/
python scripts/train.py [OPTIONS]
Options:
--config PATH Path to configuration YAML file
--seed INT Random seed for reproducibility (default: 42)
--device DEVICE Device to use: cuda/mps/cpu (auto-detected if not specified)
--epochs INT Number of training epochs (default: 100)
--batch-size INT Batch size for training (default: 8)
--downsample-class1 Downsample Class 1 to match median count of other classes (fixes class imbalance)
--use-masks Apply segmentation masks to remove background from frames during training
--use-skeleton-attention Apply skeleton-guided attention masks to focus on body joints
--skeleton-sigma FLOAT Gaussian sigma for skeleton attention blobs (default: 10.0)python scripts/evaluate.py [OPTIONS]
Required:
--checkpoint PATH Path to model checkpoint (.pth file)
Optional:
--config PATH Path to configuration file
--split SPLIT Dataset split to evaluate: train/test (default: test)
--batch-size INT Batch size for evaluation (default: 16)
--device DEVICE Device to use: cuda/mps/cpu
--save-dir PATH Directory to save results (default: outputs/results)
--use-masks Apply segmentation masks to remove background (use if model was trained with masks)python scripts/visualize_gradcam.py [OPTIONS]
Required:
--checkpoint PATH Path to model checkpoint (.pth file)
Optional:
--config PATH Path to configuration file
--split SPLIT Dataset split: train/test (default: test)
--num-samples INT Number of samples to visualize (default: 20)
--device DEVICE Device to use: cuda/mps/cpu
--save-dir PATH Directory to save visualizations (default: outputs/visualizations)
--layer LAYER Target layer for GradCAM: block3/block4/block5 (default: block4)
--alpha FLOAT Overlay transparency: 0.0-1.0 (default: 0.5)
--fps INT Output video frame rate (default: 10)
--misclassified-only Only visualize misclassified samples
--use-masks Apply segmentation masks to remove background (use if model was trained with masks)
--use-skeleton-attention Apply skeleton-guided attention (use if model was trained with skeleton attention)
--skeleton-sigma FLOAT Gaussian sigma for skeleton attention blobs (default: 10.0)
--scale-factor INT Scale factor for overlay_with_probs video (default: 3, scales 112x112 to 336x336)
--long-sample Generate longer video samples with dynamic probability updates
--sample-duration INT Duration of long samples in frames (default: 300)
--start-frame INT Starting frame for long sample mode (default: 0)
--subject-id STR Specific subject ID to visualize in long sample modePROJECT/
├── src/ # Source code
│ ├── config/
│ │ ├── __init__.py
│ │ └── config.py # Configuration management
│ ├── data/
│ │ ├── __init__.py
│ │ ├── dataset.py # PyTorch Dataset for video clips
│ │ ├── transforms.py # Video augmentation transforms
│ │ └── utils.py # Video loading utilities
│ ├── models/
│ │ ├── __init__.py
│ │ └── cnn3d.py # C3D architecture implementation
│ ├── training/
│ │ ├── __init__.py
│ │ ├── trainer.py # Training loop and logic
│ │ └── losses.py # Focal Loss implementation
│ ├── evaluation/
│ │ ├── __init__.py
│ │ ├── metrics.py # Evaluation metrics
│ │ └── evaluator.py # Evaluation pipeline
│ ├── visualization/
│ │ ├── __init__.py
│ │ └── gradcam.py # GradCAM wrapper for 3D CNN
│ └── utils/
│ ├── __init__.py
│ ├── device.py # Device selection (CPU/MPS/CUDA)
│ ├── logging.py # Logging utilities
│ └── checkpointing.py # Model save/load utilities
├── scripts/ # Executable scripts
│ ├── train.py # Main training script
│ ├── evaluate.py # Model evaluation script
│ ├── visualize_gradcam.py # GradCAM visualization script
│ ├── data_analysis.py # Dataset analysis script
│ ├── test_dataset.py # Dataset loading test
│ ├── test_training_init.py # Training initialization test
│ └── quick_test.py # Quick system test
├── outputs/ # Generated outputs
│ ├── checkpoints/ # Model checkpoints (.pth files)
│ ├── logs/ # Training logs and TensorBoard
│ ├── results/ # Evaluation results
│ ├── visualizations/ # GradCAM visualizations
│ └── analysis/ # Dataset analysis plots
├── dataset/ # Dataset directory (not in repo)
├── .venv/ # Virtual environment
├── .gitignore # Git ignore file
├── pyproject.toml # Project dependencies
└── README.md # This file
The default configuration is defined in src/config/config.py. Key parameters include:
Data Processing:
- Clip length: 16 frames
- Temporal stride: 8 frames (for sliding window during training)
- Spatial size: 112x112 pixels
- FPS: 30 (original video framerate)
Model:
- Architecture: C3D
- Number of classes: 17
- Dropout: 0.5
- Pretrained weights: False
Training:
- Batch size: 8
- Number of epochs: 100
- Number of workers: 4
- Pin memory: True
Optimizer:
- Type: AdamW
- Learning rate: 0.0001
- Weight decay: 0.00001
Learning Rate Scheduler:
- Type: ReduceLROnPlateau
- Mode: min (reduce on validation loss)
- Factor: 0.5
- Patience: 5 epochs
- Minimum LR: 0.0000001
Loss Function:
- Type: Focal Loss
- Gamma: 2.0
- Class weights: Computed automatically from training data
Data Augmentation (Training Only):
- Horizontal flip: 50% probability
- Rotation: +/- 10 degrees
- Color jitter: brightness, contrast, saturation, hue
- Normalization: ImageNet mean and std
Early Stopping:
- Patience: 10 epochs
- Minimum delta: 0.001
Checkpointing:
- Save frequency: Every 10 epochs
- Keep best 3 checkpoints
- Save best model by validation loss
- Save best model by validation accuracy
Device:
- Preference order: CUDA > MPS > CPU
- Automatic fallback to CPU if GPU unavailable
You can create a custom YAML configuration file and use it:
python scripts/train.py --config path/to/config.yamlVideo Processing:
- Videos are loaded using OpenCV
- Clips of 16 consecutive frames are extracted using a sliding window
- Frames are resized to 112x112 pixels
- Clips are normalized using ImageNet statistics
- Optional: Segmentation masks can be applied to remove background (
--use-masks)
Temporal Sampling:
- Training: Overlapping clips with stride=8 frames
- Validation/Testing: Non-overlapping clips with stride=16 frames
Class Imbalance Handling:
- Class 1 Downsampling: Optional flag to downsample the dominant class to match other classes (
--downsample-class1) - Weighted Random Sampling: Minority classes are oversampled during training
- Focal Loss: Focuses learning on hard-to-classify examples
- Class Weights: Loss is weighted by inverse class frequency
Training Loop:
- Forward pass through model
- Compute Focal Loss with class weights
- Backward pass with gradient computation
- Gradient clipping (max norm = 1.0)
- Optimizer step (AdamW)
- Metrics tracking (loss, accuracy)
Validation Loop:
- No gradient computation
- Forward pass only
- Compute validation loss and accuracy
- Per-class accuracy tracking
Checkpointing Strategy:
- Save best model by validation loss
- Save best model by validation accuracy
- Save periodic checkpoints every 10 epochs
- Each checkpoint includes:
- Model state dict
- Optimizer state dict
- Epoch number
- Metrics
- Configuration
Early Stopping:
- Monitors validation accuracy
- Stops if no improvement for 10 consecutive epochs
GradCAM Process:
- Forward pass through model with target class
- Extract activations from target layer (default: block4 for better spatial resolution)
- Compute gradients of target class with respect to activations
- Weight activations by gradients
- Generate heatmap by averaging across channels
- Resize heatmap to input spatial dimensions
- Overlay on original frames using colormap
Target Layer Options:
- block3: 14x14 spatial resolution (highest detail, less semantic)
- block4: 7x7 spatial resolution (default, balanced detail and semantics)
- block5: 3x3 spatial resolution (most semantic, lowest spatial detail)
Visualization:
- Heatmaps are generated for each frame in the clip
- Multiple output formats: original, heatmap, overlay, overlay with probability chart, side-by-side
- Probability chart shows class prediction confidence with true and predicted labels highlighted
- Saved as MP4 videos for temporal analysis
Overall Metrics:
- Accuracy: Percentage of correct predictions
- Top-3 Accuracy: Percentage where true class is in top 3 predictions
- Mean Class Accuracy: Average per-class accuracy (accounts for imbalance)
Per-Class Metrics:
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1-Score: Harmonic mean of precision and recall
- Support: Number of samples per class
Confusion Matrix:
- Generated in both count and normalized forms
- Visualized as heatmap
- Saved as CSV and PNG
Based on the dataset characteristics and model architecture, expected performance metrics are:
Overall Performance:
- Overall Accuracy: 70% or higher
- Mean Class Accuracy: 60% or higher (accounts for class imbalance)
- Top-3 Accuracy: 85% or higher
Per-Class Performance:
- Precision: 50% or higher for all classes
- Recall: 50% or higher for all classes
- F1-Score: 50% or higher for all classes
Approximate training times per epoch (with batch size 8):
- NVIDIA GPU (CUDA): 1-2 minutes per epoch
- Apple Silicon (MPS): 2-3 minutes per epoch
- CPU: 15-20 minutes per epoch (not recommended)
Full training (50 epochs):
- GPU: 1-2 hours
- MPS: 2-3 hours
- CPU: 12-16 hours
-
Class Imbalance: Despite mitigation strategies, the dominant class (Class 1) may still achieve higher accuracy than minority classes
-
Temporal Context: 16-frame clips may not capture complete exercise movements for some exercise types
-
Spatial Resolution: 112x112 resolution is relatively low and may miss fine-grained details
-
Device Compatibility: MPS (Apple Silicon) support may have limitations for some operations compared to CUDA
This section presents the results from training the C3D model for 30 epochs on the exercise recognition dataset using advanced preprocessing techniques. The analysis provides an honest assessment of model performance, including both strengths and areas for improvement.
- Total epochs: 30
- Batch size: 8
- Initial learning rate: 0.0001
- Learning rate schedule: ReduceLROnPlateau (reduced at epochs 17 and 24)
- Loss function: Focal Loss with automatic class weights
- Optimizer: AdamW
- Device: Multi-GPU training with 3x NVIDIA A40 GPUs
- Training samples: 21,287 clips (after Class 1 downsampling)
- Validation samples: 11,861 clips
This experiment incorporated three key preprocessing enhancements:
-
Class 1 Downsampling: Reduced the dominant Class 1 from ~64,000 frames to match the median count of other classes, creating a more balanced training distribution.
-
Segmentation Masks: Applied person segmentation masks to remove background noise, helping the model focus on the exercising person rather than environmental features.
-
Skeleton-Guided Attention: Used YOLO pose detection to create spatial attention masks centered on key body joints (shoulders, elbows, wrists, hips), guiding the model to focus on relevant body parts during exercise recognition.
The model achieved the following metrics on the test set:
| Metric | Value |
|---|---|
| Overall Accuracy | 73.11% |
| Mean Class Accuracy | 64.68% |
| Top-3 Accuracy | 92.56% |
| Mean Precision | 73.53% |
| Mean Recall | 60.87% |
| Mean F1-Score | 62.36% |
The validation accuracy reached 89.73% during training, indicating strong learning capability. The gap between validation and test accuracy suggests some domain shift between the validation and test sets.
Figure 1: Training and validation loss over 30 epochs.
Figure 2: Training and validation accuracy over 30 epochs.
Figure 3: Comprehensive view of training metrics including generalization gap.
Convergence Behavior:
- Training accuracy increased rapidly from 20.98% (epoch 1) to 99.82% (epoch 30)
- Validation accuracy increased from 20.54% (epoch 1) to 89.73% (epoch 30)
- The model showed fast initial learning, reaching 74.27% training accuracy by epoch 3
Overfitting Analysis:
- A generalization gap emerged after epoch 10, where training accuracy continued climbing while validation accuracy plateaued around 88-90%
- The gap between training and validation accuracy at epoch 30 is approximately 10 percentage points
- This indicates moderate overfitting, which is common in video classification tasks
Learning Rate Impact:
- The learning rate was reduced from 0.0001 to 0.00005 at epoch 17, and further to 0.000025 at epoch 24
- These reductions helped stabilize validation accuracy in the 88-90% range
- Best validation accuracy of 89.73% was achieved at epoch 30
Figure 4: Heatmap showing per-class accuracy evolution across epochs.
Figure 5: Final accuracy comparison across all 16 exercise classes.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 1 | 82.42% | 95.37% | 88.42% | 3,668 |
| 3 | 80.65% | 96.10% | 87.70% | 564 |
| 13 | 94.59% | 75.71% | 84.10% | 601 |
| 15 | 79.02% | 86.02% | 82.37% | 565 |
| 14 | 96.67% | 71.00% | 81.87% | 531 |
These classes represent exercises with distinctive motion patterns that the model with skeleton-guided attention successfully captured.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 5 | 79.34% | 71.91% | 75.44% | 598 |
| 7 | 93.93% | 63.12% | 75.50% | 564 |
| 4 | 73.62% | 71.79% | 72.69% | 521 |
| 6 | 94.31% | 56.85% | 70.94% | 496 |
| 11 | 95.59% | 44.38% | 60.61% | 489 |
| 8 | 51.84% | 67.83% | 58.76% | 603 |
| 10 | 93.67% | 42.45% | 58.42% | 523 |
| 2 | 54.07% | 58.45% | 56.18% | 568 |
These classes show variable performance, often with high precision but lower recall, indicating the model is conservative in its predictions.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 9 | 78.97% | 31.71% | 45.25% | 533 |
| 16 | 29.87% | 92.73% | 45.19% | 509 |
| 12 | 71.43% | 9.47% | 16.72% | 528 |
These classes present the greatest challenges. Class 12 in particular has very low recall (9.47%), indicating the model rarely predicts this class. Class 16 shows the opposite pattern with high recall but low precision, suggesting it's being over-predicted.
Figure 6: Normalized confusion matrix showing classification patterns.
Figure 7: Confusion matrix with absolute counts.
The most frequent misclassifications include:
- Class 12 → Class 16 (399 instances): 75.6% of Class 12 samples misclassified, the most severe confusion
- Class 9 → Class 8 (313 instances): 58.7% of Class 9 samples, indicating strong similarity between these exercises
- Class 11 → Class 16 (230 instances): 47.0% of Class 11 samples
- Class 8 → Class 1 (178 instances): 29.5% of Class 8 samples
- Class 5 → Class 1 (156 instances): 26.1% of Class 5 samples
The skeleton-guided attention approach, while helpful for focusing on body movements, may have reduced the model's ability to distinguish exercises that differ primarily in subtle arm or hand positions. The strong confusion between Classes 9, 10, and 11 with Class 8 suggests these exercises share similar upper body postures.
Figure 8: Per-class accuracy evolution showing learning stability across all classes.
Analysis of per-class accuracy across epochs reveals varied learning patterns:
- Stable high performers: Classes 1, 3, and 15 showed consistent improvement and stable high accuracy throughout training
- Variable performers: Classes 9, 10, and 12 exhibited significant fluctuations, reflecting the difficulty in learning these exercise patterns
- Late learners: Some classes showed improvement only after learning rate reductions at epochs 17 and 24
The skeleton-guided attention mechanism appears to help stabilize learning for exercises with clear body postures, while exercises requiring finer motion discrimination remain challenging.
Key factors affecting stability:
- Skeleton attention coverage: Exercises well-captured by the 8 attention keypoints (shoulders, elbows, wrists, hips) showed more stable learning
- Motion similarity: Classes with overlapping movement patterns (8-9-10-11) showed correlated fluctuations
- Class balancing effect: Downsampling Class 1 allowed better gradient distribution to minority classes
Figure 9: Training set class distribution after Class 1 downsampling.
The original dataset exhibited severe class imbalance with Class 1 containing approximately 64,000 frames while other classes contained 9,000-11,000 frames each. Through Class 1 downsampling, the training distribution was balanced to approximately 21,287 total clips.
The combination of mitigation strategies showed mixed results:
- Weighted Random Sampling + Focal Loss: Helped prevent Class 1 dominance
- Class 1 Downsampling: Reduced training data but improved class balance
- Skeleton Attention: Focused learning on body movements but may have limited fine-grained discrimination
GradCAM heatmaps were generated using skeleton-guided attention to understand model focus patterns. The visualizations reveal:
- Joint-focused attention: The model strongly attends to the 8 keypoint regions (shoulders, elbows, wrists, hips) as guided by the skeleton attention masks
- Temporal consistency: Attention patterns follow body movements across the 16-frame clips
- Exercise-specific patterns: Different exercises show distinct attention distributions based on which body parts are most active
The visualization includes real-time probability charts showing the model's confidence across all 16 classes as the video progresses.
Example visualizations are available in outputs/visualizations/ with:
- Original video clips
- GradCAM heatmap overlays
- Side-by-side comparisons
- Dynamic probability charts showing predictions over time
- High validation accuracy (89.73%) demonstrates the model's learning capability
- Excellent top-3 accuracy (92.56%) indicates predictions are meaningful even when the top-1 is incorrect
- Effective class balancing: Downsampling and Focal Loss prevented Class 1 dominance
- Interpretable attention: Skeleton-guided attention provides clear visualization of model focus
- Strong performance on key classes: Classes 1, 3, 13, 14, and 15 achieved >80% F1-score
- Test-validation gap: Significant drop from 89.73% validation to 73.11% test accuracy suggests domain shift
- Class confusion clusters: Classes 8-9-10-11-12 show severe mutual confusion
- Low recall on several classes: Classes 9, 10, 11, and 12 have <50% recall
- Skeleton attention limitations: May not capture fine-grained hand/finger movements important for some exercises
- Class 16 over-prediction: High recall (92.73%) but low precision (29.87%) indicates systematic bias
- Expand skeleton keypoints: Include hand and finger keypoints for exercises requiring fine motor discrimination
- Increase temporal window: Extend from 16 to 32 frames to capture complete movement cycles
- Address class confusion: Apply class-specific augmentation to confused class pairs (8-9-10-11-12)
- Investigate domain shift: Analyze differences between validation and test sets to understand the accuracy gap
- Ensemble approach: Combine skeleton attention model with a full-frame model for complementary features
- Adjust attention sigma: Experiment with different Gaussian sigma values for skeleton attention masks
Training on 3x NVIDIA A40 GPUs with video caching and skeleton data preprocessing:
- Time per epoch: Approximately 4-5 minutes (with multi-GPU DataParallel)
- Total training time (30 epochs): Approximately 2.5 hours
- GPU memory usage: ~12-15 GB per GPU
- Data preprocessing: Skeleton attention masks computed on-the-fly during data loading
- Video caching: All videos cached in RAM for fast access
The multi-GPU setup with DataParallel provided efficient training, though skeleton attention computation added overhead compared to baseline training.
This experiment explored advanced preprocessing techniques for exercise recognition, combining Class 1 downsampling, segmentation masks, and skeleton-guided attention. The C3D model achieved 89.73% validation accuracy and 73.11% test accuracy with 92.56% top-3 accuracy.
Key findings:
- Skeleton-guided attention successfully focuses the model on relevant body parts, providing interpretable visualizations of model decisions
- Class balancing through downsampling and Focal Loss prevents dominant class bias
- Segmentation masks help remove background distractions but may be insufficient alone
- Test-validation gap indicates potential domain shift between data splits that warrants further investigation
The model performs well on exercises with distinctive upper body movements (Classes 1, 3, 13-15) but struggles with exercises that require fine motor discrimination or have similar postures (Classes 8-12). Future work should explore expanding skeleton keypoints to include hands, increasing temporal windows, and investigating the source of the validation-test accuracy gap.
The GradCAM visualizations with dynamic probability charts provide valuable interpretability, showing how model confidence changes across exercise transitions in real-time.
Out of Memory (OOM) Errors:
- Reduce batch size:
--batch-size 4or--batch-size 2 - Reduce spatial resolution in config
- Reduce number of data loader workers
Slow Data Loading:
- Reduce number of workers if experiencing bottlenecks
- Ensure dataset is on fast storage (SSD preferred)
Training Not Converging:
- Check learning rate (may need adjustment)
- Verify data augmentation is not too aggressive
- Monitor gradient norms for exploding/vanishing gradients
GradCAM Errors on MPS:
- Try running GradCAM on CPU:
--device cpu - Some operations in pytorch-grad-cam may not be fully MPS-compatible