This repository contains a project focused on Sentiment Analysis using Natural Language Processing (NLP) techniques. The project utilizes the Sentiment140 dataset, which includes 1.6 million labeled tweets, to train a model for classifying sentiments.
- Overview
- Dataset
- Installation
- Usage
- Preprocessing
- Model Training
- Evaluation
- Results
- Contributing
- License
Natural Language Processing (NLP) is a field of Artificial Intelligence that enables machines to understand and process human languages. In this project, we focus on text classification to perform Sentiment Analysis, categorizing tweets into Positive and Negative sentiments. Sentiment Analysis is widely used in various industries to gauge public opinion about products, services, or topics.
The project uses the Sentiment140 dataset, which contains 1.6 million labeled tweets. Each tweet is labeled as either Positive (4) or Negative (0).
To run this project, you need to have Python installed. Follow the steps below to set up the environment:
- Clone this repository:
git clone https://github.com/yourusername/sentiment-analysis-nlp.git
- Navigate to the project directory:
cd sentiment-analysis-nlp
- Install the required libraries:
pip install -r requirements.txt
- Download the Sentiment140 dataset and place it in the
data
directory. - Run the Jupyter Notebook:
jupyter notebook sentiment_analysis.ipynb
The preprocessing steps include:
- Removing user mentions, hyperlinks, and non-alphanumeric characters.
- Converting text to lowercase.
- Removing stopwords using NLTK.
- Stemming words using SnowballStemmer.
The project uses TensorFlow for training the sentiment classification model. Key steps include:
- Tokenization of text data.
- Padding sequences to ensure uniform input length.
- Using pre-trained GloVe embeddings for word representation.
- Training a sequence model (e.g., LSTM or GRU).
The dataset is split into training (80%) and testing (20%) sets. The model's performance is evaluated using metrics such as accuracy, precision, recall, and F1-score.
The model achieves an accuracy of approximately 79% on the test set. Visualizations of word clouds for Positive and Negative sentiments are also provided.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License. See the LICENSE file for more details.
- Sentiment140 Dataset
- GloVe: Global Vectors for Word Representation
- NLTK: Natural Language Toolkit
- TensorFlow
- This project is a part of my training in AI/ML from Internz Learn. Thanks to them for their guidance and support.
Feel free to reach out if you have any questions or suggestions. Happy coding!