A novel approach to video captioning that fuses audio (Wav2Vec2) and visual (BLIP-2) features, utilizing a custom "Dual-Target MixUp" augmentation strategy to solve overfitting on low-resource datasets.
MixCap is a deep learning research project designed to generate accurate natural language descriptions for videos. While traditional models often rely solely on visual data, I designed MixCap to process both sight and sound to understand context (e.g., hearing a car engine helps identify a car even if the video is blurry).
To address the challenge of training on limited datasets (like MSR-VTT), this project introduces a Dual-Target MixUp Strategy applied at the feature level, significantly improving generalization without breaking the temporal sequence of the video.
This repository contains the research code, model architecture, and training pipeline. To see the deployed model in a user-friendly Web Application (React + Flask), check out the MixCap Web Platform:
View the Web App Repository Here
Standard data augmentation (like flipping images) doesn't work well for video sequences or fused audio-visual data.
The Solution: I implemented a custom Dual-Target MixUp strategy that operates on the fused feature space (2432-dimensional vectors).
- Feature Mixing: Linearly interpolates between the audio-visual features of Video A and Video B.
- Label Mixing: Simultaneously mixes the target caption embeddings.
- Result: The model learns to predict "between" concepts, preventing memorization of the training data.
The model was evaluated on two standard MSR-VTT splits: the 1k-A Split (standard academic benchmark) and the Full Test Set (2,990 videos) to test robustness.
| Metric | 1k-A Split (Benchmark) | Full Test Set (Robustness) |
|---|---|---|
| BLEU-4 | 0.43 | 0.44 |
| CIDEr | 0.53 | 0.55 |
| BLEU-1 | 0.84 | 0.84 |
| BLEU-2 | 0.71 | 0.71 |
| BLEU-3 | 0.57 | 0.57 |
| ROUGE-L | 0.62 | 0.63 |
| METEOR | 0.29 | 0.30 |
Note: The 1k-A split results are provided for direct comparison with state-of-the-art methods found in literature.
The architecture follows an Encoder-Decoder design:
- Visual Encoder: BLIP-2 (frozen) extracts high-level semantic visual features.
- Audio Encoder: Wav2Vec2 (frozen) extracts audio signal features.
- Fusion Layer: Concatenates features into a unified 2432D representation.
- Decoder: A custom Transformer with Bidirectional Cross-Attention.
- Tokenizer: SentencePiece (BPE) for efficient text generation.
Libraries: PyTorch, Transformers (Hugging Face), Pandas, NumPy, Scikit-Learn.
├── model/
│ ├── spm_tokanizer.ipynb
│ ├── mixcap-final-best-model.ipynb
│ ├── mixcap-final-test-1ka.ipynb
│ └── mixcap-final-test-full.ipynb
│
├── feature extraction/
│ ├── mixcap-video-feature-extraction-trainset.ipynb
│ ├── mixcap-audio-feature-extraction-trainset.ipynb
│ ├── mixcap-video-feature-extraction-testset.ipynb
│ └── mixcap-audio-feature-extraction-testset.ipynb
│
├── frame and audio extraction/
│ ├── mixcap-frame-audio-extraction.ipynb
│ └── mixcap-frame-and-audio-extraction-testset.ipynb
│
└── README.mdEnsure you have a GPU-enabled environment (NVIDIA CUDA recommended).
pip install torch transformers pandas sentencepieceNote: This project automatically fetches pre-trained models. It does not include the 10GB BLIP-2 model in the repo.
Run the notebooks in feature_extraction/.
The script will automatically download BLIP-2 and Wav2Vec2 from Hugging Face.
This generates the .npy feature files required for training.
Open model/mixcap-final-best-model.ipynb:
- Point the dataloader to your extracted features.
- Run the training cells. The Dual-MixUp augmentation is applied automatically within the training loop.
This project builds upon the following datasets and pre-trained models.
MSR-VTT (Original Dataset):
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language.
1k-A Split (JSFusion Split):
The project uses the specific 1,000-video test split introduced by Yu et al., which is the standard benchmark for this task.
Yu, Y., Kim, J., & Kim, G. (2018). A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV.
State-of-the-art pre-trained models are utilized for feature extraction. These models are downloaded automatically via the Hugging Face transformers library.
Visual Encoder: BLIP-2 (Salesforce)
Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training.
Audio Encoder: Wav2Vec2 (Meta AI)
Baevski, A., et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning.
- Ravindu Layanga
- BSc (Hons) Computer Science
- University of Westminster / Informatics Institute of Technology (IIT)