This project performs sentiment analysis on tweets using the Twitter Sentiment Analysis Dataset from Kaggle. It applies NLP techniques for data cleaning and uses models like Naive Bayes and TF-IDF with ML classifiers to classify sentiments as positive or negative.
The dataset used is:
training.1600000.processed.noemoticon.csv from Kaggle's Twitter Sentiment Analysis dataset
- Sentiment: 0 = Negative, 4 = Positive (converted to 1 for binary classification)
- Text: Actual tweet text
π You can download it from Kaggle: Twitter Sentiment Analysis Dataset
- Python
- NLTK
- NumPy / Pandas
- Matplotlib
- Scikit-learn
-
Drop irrelevant columns: Keep only
SentimentandText. -
Sentiment Mapping: 4 β 1 (positive)
-
Downsampling: Balance dataset to have equal positive and negative samples.
-
Text Cleaning:
- Lowercasing
- Remove stopwords & punctuation
- Remove digits, tags, special characters
- Lemmatization using
WordNetLemmatizer
- Distribution of sentiments before and after balancing
- Sample visualization using histograms
- Top features contributing to sentiment classification
-
Token-based feature extraction
-
Accuracy:
- Training: ~86%
- Testing: ~76%
-
Feature extraction using
TfidfVectorizer -
Models:
- Multinomial Naive Bayes
- Bernoulli Naive Bayes
- Linear SVM
-
Model evaluation using accuracy and confusion matrix
| Model | Accuracy (Test) |
|---|---|
| NLTK Naive Bayes | 76% |
| Multinomial NB (TF-IDF) | ~84β86% |
| Linear SVM (TF-IDF) | ~88β90% |
- Hyperparameter tuning using GridSearchCV
- Adding emojis/emoticons handling
- Applying deep learning models (e.g., LSTM, BERT)
- Kaggle Twitter Dataset by Kaggle user kazanova
- NLTK and Scikit-learn documentation
Would you like me to generate this as a downloadable README.md file for GitHub?