MixCap: Multimodal Video Captioning with Dual-Target MixUp 🎥 🎵

A novel approach to video captioning that fuses audio (Wav2Vec2) and visual (BLIP-2) features, utilizing a custom "Dual-Target MixUp" augmentation strategy to solve overfitting on low-resource datasets.

Overview

MixCap is a deep learning research project designed to generate accurate natural language descriptions for videos. While traditional models often rely solely on visual data, I designed MixCap to process both sight and sound to understand context (e.g., hearing a car engine helps identify a car even if the video is blurry).

To address the challenge of training on limited datasets (like MSR-VTT), this project introduces a Dual-Target MixUp Strategy applied at the feature level, significantly improving generalization without breaking the temporal sequence of the video.

Live Demo & Application

This repository contains the research code, model architecture, and training pipeline. To see the deployed model in a user-friendly Web Application (React + Flask), check out the MixCap Web Platform:

View the Web App Repository Here

Key Innovation: Dual-Target MixUp

Standard data augmentation (like flipping images) doesn't work well for video sequences or fused audio-visual data.

The Solution: I implemented a custom Dual-Target MixUp strategy that operates on the fused feature space (2432-dimensional vectors).

Feature Mixing: Linearly interpolates between the audio-visual features of Video A and Video B.
Label Mixing: Simultaneously mixes the target caption embeddings.
Result: The model learns to predict "between" concepts, preventing memorization of the training data.

Evaluation Results

The model was evaluated on two standard MSR-VTT splits: the 1k-A Split (standard academic benchmark) and the Full Test Set (2,990 videos) to test robustness.

Metric	1k-A Split (Benchmark)	Full Test Set (Robustness)
BLEU-4	0.43	0.44
CIDEr	0.53	0.55
BLEU-1	0.84	0.84
BLEU-2	0.71	0.71
BLEU-3	0.57	0.57
ROUGE-L	0.62	0.63
METEOR	0.29	0.30

Note: The 1k-A split results are provided for direct comparison with state-of-the-art methods found in literature.

Architecture & Tech Stack

The architecture follows an Encoder-Decoder design:

Visual Encoder: BLIP-2 (frozen) extracts high-level semantic visual features.
Audio Encoder: Wav2Vec2 (frozen) extracts audio signal features.
Fusion Layer: Concatenates features into a unified 2432D representation.
Decoder: A custom Transformer with Bidirectional Cross-Attention.
Tokenizer: SentencePiece (BPE) for efficient text generation.

Libraries: PyTorch, Transformers (Hugging Face), Pandas, NumPy, Scikit-Learn.

Repository Structure

├── model/
│   ├── spm_tokanizer.ipynb
│   ├── mixcap-final-best-model.ipynb
│   ├── mixcap-final-test-1ka.ipynb
│   └── mixcap-final-test-full.ipynb
│
├── feature extraction/
│   ├── mixcap-video-feature-extraction-trainset.ipynb
│   ├── mixcap-audio-feature-extraction-trainset.ipynb
│   ├── mixcap-video-feature-extraction-testset.ipynb
│   └── mixcap-audio-feature-extraction-testset.ipynb
│
├── frame and audio extraction/
│   ├── mixcap-frame-audio-extraction.ipynb
│   └── mixcap-frame-and-audio-extraction-testset.ipynb
│
└── README.md

Usage & Reproduction

1. Prerequisites

Ensure you have a GPU-enabled environment (NVIDIA CUDA recommended).

pip install torch transformers pandas sentencepiece

2. Feature Extraction

Note: This project automatically fetches pre-trained models. It does not include the 10GB BLIP-2 model in the repo.

Run the notebooks in feature_extraction/.

The script will automatically download BLIP-2 and Wav2Vec2 from Hugging Face.

This generates the .npy feature files required for training.

3. Training

Open model/mixcap-final-best-model.ipynb:

Point the dataloader to your extracted features.
Run the training cells. The Dual-MixUp augmentation is applied automatically within the training loop.

Acknowledgements & Citations

This project builds upon the following datasets and pre-trained models.

Dataset & Splits

MSR-VTT (Original Dataset):

Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language.

Data Source

1k-A Split (JSFusion Split):

The project uses the specific 1,000-video test split introduced by Yu et al., which is the standard benchmark for this task.

Yu, Y., Kim, J., & Kim, G. (2018). A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV.

Pre-trained Models

State-of-the-art pre-trained models are utilized for feature extraction. These models are downloaded automatically via the Hugging Face transformers library.

Visual Encoder: BLIP-2 (Salesforce)

Li, J., et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training.

Hugging Face Model Card

Audio Encoder: Wav2Vec2 (Meta AI)

Baevski, A., et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning.

Hugging Face Model Card

Author

Ravindu Layanga
BSc (Hons) Computer Science
University of Westminster / Informatics Institute of Technology (IIT)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
feature extraction		feature extraction
frame and audio extraction		frame and audio extraction
model		model
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MixCap: Multimodal Video Captioning with Dual-Target MixUp 🎥 🎵

Overview

Live Demo & Application

Key Innovation: Dual-Target MixUp

Evaluation Results

Architecture & Tech Stack

Repository Structure

Usage & Reproduction

1. Prerequisites

2. Feature Extraction

3. Training

Acknowledgements & Citations

Dataset & Splits

Pre-trained Models

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MixCap: Multimodal Video Captioning with Dual-Target MixUp 🎥 🎵

Overview

Live Demo & Application

Key Innovation: Dual-Target MixUp

Evaluation Results

Architecture & Tech Stack

Repository Structure

Usage & Reproduction

1. Prerequisites

2. Feature Extraction

3. Training

Acknowledgements & Citations

Dataset & Splits

Pre-trained Models

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages