Sentiment Analysis: A Self-Study in NLP Model Comparison

This is a self-study project focused on Natural Language Processing (NLP) and various machine learning models for sentiment analysis. The project explores different model architectures and techniques to classify customer feedback as positive, negative, or neutral sentiment.

A key challenge addressed in this project is the mismatch between training and testing data: models are trained on relatively short sentences (average 3-4 words) but must predict sentiment for much longer sentences (average 65 words).

This length discrepancy motivated the exploration of aggregated prediction techniques, where long sentences are split into smaller chunks, predicted separately, and their probabilities are combined to make a final prediction.

Customized Preprocessing

Several preprocessing techniques are used to improve the performance of NLP models and ensure the data is clean and suitable for training:

Removing unwanted characters and special symbols
Eliminating stop words (commonly used words like "and," "the," etc.)
Applying stemming and lemmatization to reduce words to their root forms
Encoding textual data
Padding sequences to ensure uniform input lengths (required only for sequence-based models; not needed for bag-of-words models)

The preprocessing pipeline is implemented in scripts/word_normalization.py for consistent text transformation across all models.

Project Structure

root/
├── datasets/           # Dataset files (raw and processed)
├── models/             # Trained model files (.joblib, .keras, PyTorch models)
├── notebooks/          # Jupyter notebooks for model development and analysis
├── scripts/            # Utility scripts (preprocessing, aggregated prediction)
├── text_transformers/  # Saved text vectorization objects (.pkl files)
└── tuner_results/      # Hyperparameter tuning results and trial data

Models Explored

The project compares six different model architectures:

Naive Bayes Classifier (NBC): Traditional probabilistic classifier using count vectorization
Recurrent Neural Network (RNN): Deep learning model with two variants:
- RNN with custom tokenizer
- RNN with TextVectorization layer
BERT Transformer: Pre-trained transformer model fine-tuned for sentiment analysis
Logistic Regression: Linear classifier with TF-IDF vectorization
Support Vector Machine (SVM): Linear kernel SVM with TF-IDF vectorization

Project Workflow

Phase 1: Data Preparation

Combined multiple datasets to create a diverse training set
Applied customized preprocessing pipeline
Split data into training and testing sets
Saved processed datasets for consistent model evaluation

Phase 2: Model Development

Trained individual models with appropriate preprocessing
Each model was optimized with hyperparameter tuning
Models were saved for later comparison and evaluation

Phase 3: Baseline Comparison

Compared all six models on standardized test datasets
Evaluated performance using confusion matrices and accuracy metrics
Identified BERT as the best-performing model

Phase 4: Aggregated Prediction Hypothesis

Hypothesized that splitting long sentences into chunks might improve performance
Implemented aggregated prediction approach for five models (excluding BERT)
Tested chunk-based prediction with probability averaging
Found that Logistic Regression benefited from aggregation, while others did not

Phase 5: Final Comparison

Compared BERT (normal prediction) against Logistic Regression (aggregated prediction)
Validated that BERT remains the superior model even when compared to the best aggregated approach

Dataset Sources

Notes

This project serves as a learning exercise in NLP and machine learning, exploring various approaches to sentiment analysis. The notebooks document the complete workflow from data preparation to model comparison and evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
datasets		datasets
models		models
notebooks		notebooks
scripts		scripts
text_transformers		text_transformers
tuner_results/rnn_sentiment_random		tuner_results/rnn_sentiment_random
.gitignore		.gitignore
INSIGHTS.md		INSIGHTS.md
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis: A Self-Study in NLP Model Comparison

Customized Preprocessing

Project Structure

Models Explored

Project Workflow

Phase 1: Data Preparation

Phase 2: Model Development

Phase 3: Baseline Comparison

Phase 4: Aggregated Prediction Hypothesis

Phase 5: Final Comparison

Dataset Sources

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis: A Self-Study in NLP Model Comparison

Customized Preprocessing

Project Structure

Models Explored

Project Workflow

Phase 1: Data Preparation

Phase 2: Model Development

Phase 3: Baseline Comparison

Phase 4: Aggregated Prediction Hypothesis

Phase 5: Final Comparison

Dataset Sources

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages