Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. This project implements a working solution inspired by the research paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.
Image captioning involves:
- Building networks capable of perceiving contextual subtleties in images.
- Relating observations to both the scene and the real world.
- Producing succinct and accurate image descriptions.
The task of image captioning is divided into two key modules:
-
Encoder - Image-Based Model:
- Task: Extract features from the image and generate feature vectors.
- Input: Source images from the dataset.
- Output: Flattened vectors of image features for the language model.
- Model/Technique: Transfer Learning using Inception-ResNet-V2.
-
Decoder - Language-Based Model:
- Task: Translate extracted features into a natural language caption.
- Input: Flattened image feature vectors.
- Output: Caption for the image.
- Model/Technique: Bi-Directional Long Short-Term Memory (LSTM) with Bahdanau Attention.
- Embeddings: GloVe Embeddings (6B.300d).
- Environment:
- Anaconda3 or Miniconda3.
- Python 3.9+ (ideally compatible with 3.6+).
- Libraries:
- All libraries listed in requirements.txt.
- Datasets and Pretrained Models:
- Dataset: Flickr30k dataset.
- Word Embeddings: GloVe (6B or 42B or 840B).
├── data
│ ├── flickr30k_images <- Flickr30k Dataset.
│ └── captions.csv <- Captions for the Flickr30k Dataset. Downloaded by default.
│
├── notebooks
│ ├── glove.xB.xxxd.txt <- GloVe Embeddings.
│ └── neural_image_captioning.ipynb <- Main Notebook to run.
│
├── models
│ └── checkpoint <- Model checkpoints for reuse.
│
└── docs
├── system_flow.png <- System Flow Diagram.
├── attention_explain.png <- Attention Mechanism Example.
├── sample_input.png <- Sample input from the dataset.
├── sample_output.png <- Predicted caption output.
├── belu.png <- BELU Performance Metric Visualization.
- Install all dependencies as listed in
requirements.txt. - Organize data and embeddings into the folder structure shown above.
- Run the neural_image_captioning.ipynb notebook in a
jupyter-notebooksorjupyter-labsession. - Use model checkpoints in the models/checkpoint folder to export or improve the model.
The dataset includes 31,783 images and 158,915 captions, with 5 captions per image.
The output contains:
- Predicted Caption
- Attention Map Over Image
- Original Image
The BELU (Bilingual Evaluation Understudy) score measures the similarity between the predicted and reference captions. BELU is calculated by comparing the n-grams of candidate captions with those of reference captions, producing a score between 0 and 1 (closer to 1 indicates higher similarity).
- Adding more Bidirectional LSTM layers.
- Training the model for more epochs (e.g., 50 epochs as mentioned in the research paper).
- Incorporating state-of-the-art NLP techniques such as transformers (e.g., BERT).
- Using a larger dataset like MSCOCO or applying data augmentation.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
- Long-term Recurrent Convolutional Networks for Visual Recognition and Description
- Deep Visual-Semantic Alignments for Generating Image Descriptions




