This project aims to detect the genre(s) of a movie by analysing its YouTube trailer, using transcript-based machine learning involing NLP techniques. Each trailer may belong to multiple genres, making this a multi-label classification task.
I used the MovieLens Dataset to map movieId
to YouTube trailer IDs. The dataset preparation involves:
- Selection of ~600 trailers (subset of greater than 25,000 movie IDs)
- Downloading trailers using yt-dlp
- Audio/video separation and preprocessing
- Speech-to-text transcription using Google's ASR model
- Cleaning the text: punctuation removal, tokenisation, lemmatisation
- Final CSV output:
file_name, movie_name, genre, transcript, audio_file
We have focused on the following 7 primary genres:
- Action
- Comedy
- Drama
- Romance
- Horror
- Thriller
- Sci-Fi
Each trailer may belong to one or more of these genres.
(Other genres are included as well, but the focus is particularly on these genres)- N-gram extraction (up to n=10)
- Genre-wise frequency distribution
- Filtering of overlapping/common words
- Model: Logistic Regression or similar
- Input: Tokenised transcripts
- Output: Multi-label genre predictions
- Use clustering on the full MovieLens dataset (25k+ trailers)
- Identify emergent genre clusters via transcript similarity
- Training: 70%
- Testing: 30%
- Stratified to ensure balanced genre representation
- Accuracy@k (Top-k prediction accuracy)
- Micro / Macro F1-score
- Mean Average Precision (mAP)
- Python,
scikit-learn
,yt-dlp
- NLP:
NLTK
,spaCy
or similar - Transcription: Google Speech Recognition API
- Radford et al. (2022), Robust Speech Recognition via Large‑Scale Weak Supervision, Whisper ASR
- Harper & Konstan (2015), The MovieLens Datasets: History and Context, ACM TiiS
- Sulun et al. (2024), Movie Trailer Genre Classification Using Multimodal Pretrained Features, ESWA
- nfu Liu et al. (2021), Multi-label Text Classification with tALBERT-CNN, IJCI