Spam-Email-Detection

In this project I want to detect spam emails using natural language processing. I have chosen an email dataset from the Kaggle website.

DataSet:

⋄ The dataset consists of 5572 entries of emails and contains 2 columns (Category and Message).

⋄ It does not include any missing values.

Main features:

⋄ Category (Category of the message ham or spam)

⋄ Message (The text of the message)

Procedure:

The datset consists of one categorical variable (Category). So I used LabelEncoding to encode that variable. Where 0 means the message is spam and 1 is ham.
I removed any duplicates of the messages.
I cleaned the messages from punctuation and stopwords.
I convert the cleaned messages into matrix of the most frequent words, using CountVectorizer.
I split the dataset into training & testing sets and I used the Naive Bayes model on the training set.

Results:

The Naive Bayes model identified the email messages as spam or not spam with 86% accuracy on the test dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
spam ham emails.ipynb		spam ham emails.ipynb
spam.csv		spam.csv

Provide feedback