Movie Genre Detection Model

This project aims to detect the genre(s) of a movie by analysing its YouTube trailer, using transcript-based machine learning involing NLP techniques. Each trailer may belong to multiple genres, making this a multi-label classification task.

📁 Dataset & Preprocessing

I used the MovieLens Dataset to map movieId to YouTube trailer IDs. The dataset preparation involves:

Selection of ~600 trailers (subset of greater than 25,000 movie IDs)
Downloading trailers using yt-dlp
Audio/video separation and preprocessing
Speech-to-text transcription using Google's ASR model
Cleaning the text: punctuation removal, tokenisation, lemmatisation
Final CSV output:


file_name, movie_name, genre, transcript, audio_file

Genre Classification

We have focused on the following 7 primary genres:

Action
Comedy
Drama
Romance
Horror
Thriller
Sci-Fi

Each trailer may belong to one or more of these genres.

(Other genres are included as well, but the focus is particularly on these genres)

Modelling Pipeline

🔹 1. Feature Engineering

N-gram extraction (up to n=10)
Genre-wise frequency distribution
Filtering of overlapping/common words

🔹 2. Supervised Learning

Model: Logistic Regression or similar
Input: Tokenised transcripts
Output: Multi-label genre predictions

🔹 3. Unsupervised Extension

Use clustering on the full MovieLens dataset (25k+ trailers)
Identify emergent genre clusters via transcript similarity

Data Splits

Training: 70%
Testing: 30%
Stratified to ensure balanced genre representation

📊 Evaluation Metrics

Accuracy@k (Top-k prediction accuracy)
Micro / Macro F1-score
Mean Average Precision (mAP)

🛠 Tools & Libraries

Python, scikit-learn, yt-dlp
NLP: NLTK, spaCy or similar
Transcription: Google Speech Recognition API

👨‍💻 Contributors

📚 References

Radford et al. (2022), Robust Speech Recognition via Large‑Scale Weak Supervision, Whisper ASR
Harper & Konstan (2015), The MovieLens Datasets: History and Context, ACM TiiS
Sulun et al. (2024), Movie Trailer Genre Classification Using Multimodal Pretrained Features, ESWA
nfu Liu et al. (2021), Multi-label Text Classification with tALBERT-CNN, IJCI

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
NLP Help		NLP Help
vosk-model-small-en-us-0.15		vosk-model-small-en-us-0.15
.gitignore		.gitignore
Experimental Design.docx		Experimental Design.docx
Genre-Classification.ipynb		Genre-Classification.ipynb
LICENSE		LICENSE
README.md		README.md
movie-genre-detection-model.png		movie-genre-detection-model.png
requirements.txt		requirements.txt
trailerData.csv		trailerData.csv
trailer_address.csv		trailer_address.csv
trainingData.csv		trainingData.csv
try.py		try.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Movie Genre Detection Model

📁 Dataset & Preprocessing

Genre Classification

Modelling Pipeline

🔹 1. Feature Engineering

🔹 2. Supervised Learning

🔹 3. Unsupervised Extension

Data Splits

📊 Evaluation Metrics

🛠 Tools & Libraries

👨‍💻 Contributors

📚 References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ib-hussain/Movie-Genre-Detection-Model

Folders and files

Latest commit

History

Repository files navigation

Movie Genre Detection Model

📁 Dataset & Preprocessing

Genre Classification

Modelling Pipeline

🔹 1. Feature Engineering

🔹 2. Supervised Learning

🔹 3. Unsupervised Extension

Data Splits

📊 Evaluation Metrics

🛠 Tools & Libraries

👨‍💻 Contributors

📚 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages