This project is a Spam/Ham email classification system built using Jupyter Notebook.
It trains a machine learning pipeline to classify emails/messages as spam or ham using
Count/TF-IDF Vectorizer and Multinomial Naive Bayes (MNB).
The project also includes a simple Streamlit interface for interactive predictions.
- Python (Jupyter Notebook)
- pandas, numpy → data loading & preprocessing
- scikit-learn → CountVectorizer / TfidfVectorizer, MultinomialNB, Pipeline, evaluation
- joblib → saving/loading trained model
- nltk → optional stopword handling
- matplotlib, seaborn → plots & confusion matrix
- Streamlit → simple demo app
- Dataset → Load
spam_ham_database.csv(label + text columns). - Preprocessing → lowercase, remove URLs/emails, normalize numbers, clean special characters (lemmatization optional).
- Train/Test Split → 80% training, 20% testing with stratification to keep class balance.
- Vectorization → Convert text into numeric features using CountVectorizer / TF-IDF with unigrams + bigrams.
- Model Training → Multinomial Naive Bayes classifier trained on the vectorized data.
- Evaluation → Accuracy, Precision, Recall, F1-score, and Confusion Matrix.
- Artifacts → Save the trained pipeline with
joblibfor reuse. - Streamlit Demo → Input text → cleaned → classified as Spam/Ham with probability.
- Achieved high accuracy (~90–95% depending on preprocessing).
- Built a reproducible ML pipeline (preprocessing + vectorizer + model).
- Demonstrated results interactively with Streamlit.
- Clear separation of stages: preprocessing → training → evaluation → serving.
The Streamlit app provides a simple way to test the model.
# Install dependencies
pip install -r requirements.txt
# Train & save pipeline (run the notebook first if needed)
jupyter notebook notebook.ipynb
# Run Streamlit demo
streamlit run streamlit_app.py