Neural Image Captioning

Problem Statement

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. This project implements a working solution inspired by the research paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

What is Image Captioning?

Image captioning involves:

Building networks capable of perceiving contextual subtleties in images.
Relating observations to both the scene and the real world.
Producing succinct and accurate image descriptions.

Methodology

The task of image captioning is divided into two key modules:

Encoder - Image-Based Model:
- Task: Extract features from the image and generate feature vectors.
- Input: Source images from the dataset.
- Output: Flattened vectors of image features for the language model.
- Model/Technique: Transfer Learning using Inception-ResNet-V2.
Decoder - Language-Based Model:
- Task: Translate extracted features into a natural language caption.
- Input: Flattened image feature vectors.
- Output: Caption for the image.
- Model/Technique: Bi-Directional Long Short-Term Memory (LSTM) with Bahdanau Attention.
- Embeddings: GloVe Embeddings (6B.300d).

System Flow Diagram

Attention Working Example

Dependencies

Environment:
- Anaconda3 or Miniconda3.
- Python 3.9+ (ideally compatible with 3.6+).
Libraries:
- All libraries listed in requirements.txt.
Datasets and Pretrained Models:
- Dataset: Flickr30k dataset.
- Word Embeddings: GloVe (6B or 42B or 840B).

Project Structure

    ├── data
    │   ├── flickr30k_images                <- Flickr30k Dataset.
    │   └── captions.csv                    <- Captions for the Flickr30k Dataset. Downloaded by default.
    │
    ├── notebooks
    │   ├── glove.xB.xxxd.txt               <- GloVe Embeddings.
    │   └── neural_image_captioning.ipynb   <- Main Notebook to run.
    │
    ├── models
    │   └── checkpoint                      <- Model checkpoints for reuse.
    │
    └── docs
        ├── system_flow.png                 <- System Flow Diagram.
        ├── attention_explain.png           <- Attention Mechanism Example.
        ├── sample_input.png                <- Sample input from the dataset.
        ├── sample_output.png               <- Predicted caption output.
        ├── belu.png                        <- BELU Performance Metric Visualization.

How to Run the Project

Install all dependencies as listed in requirements.txt.
Organize data and embeddings into the folder structure shown above.
Run the neural_image_captioning.ipynb notebook in a jupyter-notebooks or jupyter-lab session.
Use model checkpoints in the models/checkpoint folder to export or improve the model.

Sample Input

The dataset includes 31,783 images and 158,915 captions, with 5 captions per image.

Sample Output

The output contains:

Predicted Caption
Attention Map Over Image
Original Image

Performance Indicator

The BELU (Bilingual Evaluation Understudy) score measures the similarity between the predicted and reference captions. BELU is calculated by comparing the n-grams of candidate captions with those of reference captions, producing a score between 0 and 1 (closer to 1 indicates higher similarity).

BELU Metric Example

Future Improvements

Adding more Bidirectional LSTM layers.
Training the model for more epochs (e.g., 50 epochs as mentioned in the research paper).
Incorporating state-of-the-art NLP techniques such as transformers (e.g., BERT).
Using a larger dataset like MSCOCO or applying data augmentation.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Image Captioning

Problem Statement

What is Image Captioning?

Methodology

System Flow Diagram

Attention Working Example

Dependencies

Project Structure

How to Run the Project

Sample Input

Sample Output

Performance Indicator

BELU Metric Example

Future Improvements

References

Papers

Implementations

Videos

About

Uh oh!

Releases

Packages

Uh oh!

Languages

theand9/Neural-Image-Captioning

Folders and files

Latest commit

History

Repository files navigation

Neural Image Captioning

Problem Statement

What is Image Captioning?

Methodology

System Flow Diagram

Attention Working Example

Dependencies

Project Structure

How to Run the Project

Sample Input

Sample Output

Performance Indicator

BELU Metric Example

Future Improvements

References

Papers

Implementations

Videos

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages