This project builds a binary sentiment classifier (positive/negative) for IMDB movie reviews using TF-IDF and linear classifiers (Logistic Regression, Linear SVM, ComplementNB). It selects the best model via 5-fold cross-validation, saves it, and provides a desktop GUI with CustomTkinter.
- π Python 3.10+
- πͺ Windows PowerShell (commands below assume Windows)
Setup (recommended: virtual environment):
cd C:\Users\Konyar\Desktop\Code\IMDB_Sentiment_Analysis_NLP
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt
- π Location:
datasets/movie.csv
- π§Ύ Expected schema:
text,label
text
: review textlabel
: 0 (negative), 1 (positive)
- π Source: Kaggle β IMDB Movie Ratings Sentiment Analysis
The command below trains models, selects the best via CV, evaluates on a test split, and saves the model.
python IMDB_Sentiment_Analyser.py train --csv datasets\movie.csv --model models\sentiment_pipeline.joblib
- β
Test Accuracy: ...
β test accuracy - π
Classification Report:
β precision / recall / F1 - π’
Confusion Matrix:
printed in console - πΌοΈ Confusion matrix image saved to
models/confusion_matrix.png
- πΎ Model saved to
models/sentiment_pipeline.joblib
After training:
python IMDB_Sentiment_Analyser.py gui --model models\sentiment_pipeline.joblib
Enter a review and click "Predict". The GUI shows prediction (Positive/Negative) and a confidence percentage.
- β±οΈ Cross-validation over ~40k rows can take a few minutes.
- π Confidence comes from
predict_proba
if available (Logistic Regression). For margin-based models (LinearSVC), a sigmoid mapping of the decision score is shown for readability. This is not a calibrated probability but useful as a relative confidence.
IMDB_Sentiment_Analysis_NLP/
IMDB_Sentiment_Analyser.py # Training, evaluation (with CM plot), GUI
datasets/
movie.csv # Dataset (text,label)
models/
sentiment_pipeline.joblib # (Created after training)
confusion_matrix.png # (Created after training)
requirements.txt
README.md
This project is for educational purposes. Dataset usage terms belong to their respective source.