Skip to content

theand9/Neural-Image-Captioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Image Captioning

Problem Statement

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. This project implements a working solution inspired by the research paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.


What is Image Captioning?

Image captioning involves:

  • Building networks capable of perceiving contextual subtleties in images.
  • Relating observations to both the scene and the real world.
  • Producing succinct and accurate image descriptions.

Methodology

The task of image captioning is divided into two key modules:

  1. Encoder - Image-Based Model:

    • Task: Extract features from the image and generate feature vectors.
    • Input: Source images from the dataset.
    • Output: Flattened vectors of image features for the language model.
    • Model/Technique: Transfer Learning using Inception-ResNet-V2.
  2. Decoder - Language-Based Model:

    • Task: Translate extracted features into a natural language caption.
    • Input: Flattened image feature vectors.
    • Output: Caption for the image.
    • Model/Technique: Bi-Directional Long Short-Term Memory (LSTM) with Bahdanau Attention.
    • Embeddings: GloVe Embeddings (6B.300d).

System Flow Diagram

System Flow Diagram

Attention Working Example

Attention Working Example


Dependencies

  1. Environment:
    • Anaconda3 or Miniconda3.
    • Python 3.9+ (ideally compatible with 3.6+).
  2. Libraries:
  3. Datasets and Pretrained Models:

Project Structure

    ├── data
    │   ├── flickr30k_images                <- Flickr30k Dataset.
    │   └── captions.csv                    <- Captions for the Flickr30k Dataset. Downloaded by default.
    │
    ├── notebooks
    │   ├── glove.xB.xxxd.txt               <- GloVe Embeddings.
    │   └── neural_image_captioning.ipynb   <- Main Notebook to run.
    │
    ├── models
    │   └── checkpoint                      <- Model checkpoints for reuse.
    │
    └── docs
        ├── system_flow.png                 <- System Flow Diagram.
        ├── attention_explain.png           <- Attention Mechanism Example.
        ├── sample_input.png                <- Sample input from the dataset.
        ├── sample_output.png               <- Predicted caption output.
        ├── belu.png                        <- BELU Performance Metric Visualization.

How to Run the Project

  1. Install all dependencies as listed in requirements.txt.
  2. Organize data and embeddings into the folder structure shown above.
  3. Run the neural_image_captioning.ipynb notebook in a jupyter-notebooks or jupyter-lab session.
  4. Use model checkpoints in the models/checkpoint folder to export or improve the model.

Sample Input

The dataset includes 31,783 images and 158,915 captions, with 5 captions per image.

Sample Input


Sample Output

The output contains:

  1. Predicted Caption
  2. Attention Map Over Image
  3. Original Image

Sample Output


Performance Indicator

The BELU (Bilingual Evaluation Understudy) score measures the similarity between the predicted and reference captions. BELU is calculated by comparing the n-grams of candidate captions with those of reference captions, producing a score between 0 and 1 (closer to 1 indicates higher similarity).

BELU Metric Example

BELU


Future Improvements

  • Adding more Bidirectional LSTM layers.
  • Training the model for more epochs (e.g., 50 epochs as mentioned in the research paper).
  • Incorporating state-of-the-art NLP techniques such as transformers (e.g., BERT).
  • Using a larger dataset like MSCOCO or applying data augmentation.

References

Papers

  1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
  2. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
  3. Long-term Recurrent Convolutional Networks for Visual Recognition and Description
  4. Deep Visual-Semantic Alignments for Generating Image Descriptions

Implementations

Videos

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published