+
+ +
+
+

Chapter 5 —–

+

In the first line, we create a variable to store a list of all of +the words present in the emails. As our dataset has a column for each word, we do this +by storing the names of every column with the names function, except for the first +and last column, which are for email id and label, respectively.

+

Let’s move on to the second and third lines of the code. We would like to filter out +some words that are very common in the English language, such as articles and pronouns, +which will most likely add noise rather than information to our classification algorithm. +For this we will use two Julia packages that are specially designed for working with +texts of any type. These are Languages.jl and TextAnalysis.jl. In the third line, we +create a StringDocument, which is a kind of struct provided by TextAnalysis.jl and +we use its ready-to-use methods to remove articles and pronouns from the list of words +we created before.

+

Then we create another DataFrame with the filtered words, called , +and we transform it into a Matrix, which we transpose to have each email as a separate column. +# —–

+

To filter out common words, we use two Julia packages specially designed +for working with texts of any type: Languages.jl and TextAnalysis.jl.

+

Next, we need to divide the data in two: a training set and a testing +set. This is standard practice when working with models that learn from +data, like the one we’re going to implement. We train the model with the +training set, then evaluate the model’s accuracy by having it make +predictions on the testing set. In Julia, the package MLDataUtils.jl has +some nice functionalities for data manipulations like this.

+
labels = raw_df.Prediction
+(x_train, y_train), (x_test, y_test) = splitobs(shuffleobs((data_matrix, labels)), at = 0.7)
+

The function splitobs splits our dataset into a training set and a +testing set, and shuffleobs randomizes the order of the data in the +split. We pass a labels array to our split function so it knows how to +properly split the dataset. Now we can turn our attention to building +the spam filter.

+
+

5.1 The Naive Bayes Approach

+

As we mentioned, what we are facing here is a classification problem, +and we will code from scratch and use a supervised learning algorithm +to find a solution with the help of Bayes’ theorem. We’re going to use a +naive Bayes classifier to create our spam filter. We’re going to use a +classifier to create our spam filter. This method is +going to treat each email just as a collection of words, with no regard +for the order in which they appear. This means we won’t take into +account semantic considerations like the particular relationship between +words and their context.

+

Our strategy will be to estimate a probability of an incoming email +being ham or spam and make a decision based on that. Our general +approach can be summarized as:

+

\(P(spam|email) \propto P(email|spam)P(spam)\)

+

\(P(ham|email) \propto P(email|ham)P(ham)\)

+

Notice we use the \(\propto\) sign, meaning proportional to, instead of +the = sign because the denominator from Bayes’s theorem is missing. In +this case, we won’t need to calculate it, as it’s the same for both +probabilities and all we’re going to care about is a comparison of these +two probabilities.

+

In this naive approach, where semantics aren’t taken into account and +each email is just a collection of words, the conditional probability +\(P(email|spam)\) means the probability that a given email can be +generated with the collection of words that appear in the spam category +of our data. Let’s take a quick example. Imagine for a moment that our +training set of emails consists just of these three emails, all labeled +as spam:

+
    +
  • Email 1: ‘Are you interested in buying my product?’
  • +
  • Email 2: ‘Congratulations! You’ve won $1000!’
  • +
  • Email 3: ‘Check out this product!’
  • +
+

Also imagine we receive a new, unclassified email and we want to +discover \(P(email|spam)\). The new email looks like this:

+
    +
  • New email: ‘Apply and win all these products!’
  • +
+

The new email contains the words win and product, which are rather +common in our example’s training data. We would therefore expect +\(P(email|spam)\), the probability of the new email being generated by the +words encountered in the training spam email set, to be relatively high.

+

(The word \emph{win} appears in the form \emph{won} in the training +set, but that’s OK. The standard linguistic technique of +\emph{lemmatization} groups together any related forms of a word and +treats them as the same word.)

+

Mathematically, the way to calculate \(P(email|spam)\) is to take each +word in our target email, calculate the probability of it appearing in +spam emails based on our training set, and multiply those probabilties +together.

+

\(P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)\)

+

We use a similar calculation to determine \(P(email|ham)\), the +probability of the new email being generated by the words encountered in +the training ham email set:

+

\(P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)\)

+

The multiplication of each of the probabilities associated with a +particular word here stems from the naive assumption that all the words +in the email are statistically independent. In reality, this assumption +isn’t necessarily true. In fact, it’s most likely false. Words in a +language are never independent from one another, but this simple +assumption seems to be enough for the level of complexity our problem +requires.

+

The probability of a given word \(word_i\) being in a given category is +calculated like so:

+

\[P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}\] +\[P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}\]

+

These formulas tell us exactly what we have to calculate from our data. +We need the numbers \(N_{word_i|spam}\) and \(N_{word_i|ham}\) for each +word, meaning the number of times that \(word_i\) is used in the spam and +ham categories, respectively. \(N_{spam}\) and \(N_{ham}\) are the total +number of words used in the spam and ham categories (including all word +repetitions), and \(N_{vocabulary}\) is the total number of unique words +in the dataset. The variable \(\alpha\) is a smoothing parameter that +prevents the probability of a given word being in a given category from +going down to zero. If a given word hasn’t appeared in the spam category +in our training dataset, for example, we don’t want to assign it zero +probability of appearing in new spam emails.

+

As all of this information will be specific to our dataset, a clever way +to aggregate it is to use a Julia struct, with attributes for the +pieces of data we’ll need to access over and over during the prediction +process. Here’s the implementation:

+
mutable struct BayesSpamFilter
+    words_count_ham::Dict{String, Int64}
+    words_count_spam::Dict{String, Int64}
+    N_ham::Int64
+    N_spam::Int64
+    vocabulary::Array{String}
+    BayesSpamFilter() = new()
+end
+

The relevant attributes of the struct are words_count_ham and +words_count_spam, two dictionaries containing the frequency of +appearance of each word in the ham and spam datasets; N_ham and +N_spam, the total number of words appearing in each category; and +vocabulary, an array of all the unique words in the dataset.

+

The line BayesSpamFilter() = new() is the constructor of this struct. +Because the constructor is empty, all the attributes will be undefined +when we instantiate the filter. We’ll have to define some functions to +fill these variables with values that are relevant to our particular +problem. First, here’s a function word_count that counts the +occurrences of each word in the ham and spam categories.

+
function words_count(word_data, vocabulary, labels, spam=0)
+    count_dict = Dict{String,Int64}()
+    n_emails = size(word_data)[2]
+    for (i, word) in enumerate(vocabulary)
+        count_dict[word] = sum([word_data[i, j] for j in 1:n_emails if labels[j] == spam])
+    end
+    return count_dict
+end
+

Next, we’ll define a fit! function for our spam filter struct. Notice +we’re using the bang (!) convention here to indicate a function that +modifies its arguments in-place (in this case, the spam filter struct +itself). This function fits our model to the data, a typical procedure +in data science and machine learning areas.

+
function fit!(model::BayesSpamFilter, x_train, y_train, voc)
+    model.vocabulary = voc
+    model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0)
+    model.words_count_spam = words_count(x_train, model.vocabulary, y_train, 1)
+    model.N_ham = sum(values(model.words_count_ham))
+    model.N_spam = sum(values(model.words_count_spam))
+    return
+end
+
## fit! (generic function with 1 method)
+

This function fills in all the undefined parameters in our struct. We +mainly use the words_count function we defined earlier. Notice that +we’re only fitting the model to the training portion of the data, since +we’re reserving the testing portion to evaluate the model’s accuracy.

+
+
+

5.2 Training the Model

+

Now it’s time to instantiate our spam filter and fit the model to the +training data. With the struct and helper functions we’ve defined, the +process is quite straightforward.

+
spam_filter = BayesSpamFilter()
+fit!(spam_filter, x_train, y_train, vocabulary)
+

We create an instance of our BayesSpamFilter struct and pass it to our +fit! function along with the data. Notice that we’re only passing in +the training portion of the dataset, since we want to reserve the +testing portion to evaluate the model’s accuracy later.

+
+
+

5.3 Making Predictions

+

Now that we have our model, we can use it to make some spam vs. ham +predictions and assess its performance. We’ll define a few more +functions to help with this process. First, we need a function +implementing the TAL formula that we discussed earlier.

+
function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
+    ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
+    spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
+    return ham_prob, spam_prob
+end
+
## word_spam_probability (generic function with 1 method)
+

This function calculates \(P(word_i|spam)\) and \(P(word_i|ham)\) for a +given word. We’ll call it for each word of an incoming email within +another function, spam_predict, to calculate the probability of that +email being spam or ham.

+
function spam_predict(email, model::BayesSpamFilter, α, tol=100)
+    ngrams_email = ngrams(StringDocument(email))
+    email_words = keys(ngrams_email)
+    n_vocabulary = length(model.vocabulary)
+    ham_prior = model.N_ham / (model.N_ham + model.N_spam)
+    spam_prior = model.N_spam / (model.N_ham + model.N_spam)
+
+    if length(email_words) > tol
+        word_freq = values(ngrams_email)
+        sort_idx = sortperm(collect(word_freq), rev=true)
+        email_words = collect(email_words)[sort_idx][1:tol]
+    end
+
+    email_ham_probability = BigFloat(1)
+    email_spam_probability = BigFloat(1)
+
+    for word in intersect(email_words, model.vocabulary)
+        word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α)
+        email_ham_probability *= word_ham_prob
+        email_spam_probability *= word_spam_prob
+    end
+    return ham_prior * email_ham_probability, spam_prior * email_spam_probability
+end
+

This function takes as input a new email that we want to classify as +spam or ham, our fitted model, an \(α\) value (which we’ve already +discussed), and a tolerance value tol. The latter sets the maximum +number of unique words in an email that we’ll look at. We saw that the +calculations for \(P(email|spam)\) and \(P(email|ham)\) require the +multiplication of each \(P(word_i|spam)\) and \(P(word_i|ham)\) term. When +emails consist of a large number of words, this multiplication may lead +to very small probabilities, up to the point that the computer +interprets those probabilities as zero. This isn’t desirable; we need +values of \(P(email|spam)\) and \(P(email|ham)\) that are larger than zero +in order to multiply them by \(P(spam)\) and \(P(ham)\), respectively, and +compare these values to make a prediction. To avoid probabilities of +zero, we’ll only consider up to the tol most frequently used words in +the email.

+

Finally, we arrive to the point of actually testing our model. We create +another function to manage the process. This function classifies each +email into Ham (represented by the number 0) or Spam (represented by the +number 1)

+
function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200)
+    N = length(y_test)
+    predictions = Array{Int64,1}(undef, N)
+    for i in 1:N
+        email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...)
+        pham, pspam = spam_predict(email, model, α, tol)
+        pred = argmax([pham, pspam]) - 1
+        predictions[i] = pred
+    end
+
+    predictions
+end
+

This function takes in the testing portion of the data and our trained +model. We call our spam_predict function for each email in the testing +data and use the maximum (argmax) of the two returned probability +values to predict (pred) if the email is spam or ham. We return the +predictions as an array of values, which will contain zeros for ham +emails, and ones for spam emails. Here we call the function to make +predictions about the test data:

+
predictions = get_predictions(x_test, y_test, spam_filter, 1)
+

Let’s take a look at the predicted classifications of just the first +five emails in the test data.

+
predictions[1:5]
+
## 5-element Vector{Int64}:
+##  1
+##  1
+##  0
+##  0
+##  0
+

Of the first five emails, one (the third) was classified as spam, and +the rest were classified as ham.

+
+
+

5.4 Evaluating the Accuracy

+

Looking at the predictions themselves is pretty meaningless; what we +really want to know is the model’s accuracy. We’ll define another +function to calculate this.

+
function spam_filter_accuracy(predictions, actual)
+    N = length(predictions)
+    correct = sum(predictions .== actual)
+    accuracy = correct / N
+    accuracy
+end
+

This function compares the predicted classifications with the actual +classifications of the test data, counts the number of correct +predictions, and divides this number by the total number of test emails, +giving us an accuracy measurement. Here we call the function:

+
spam_filter_accuracy(predictions, y_test)
+
## 0.946520618556701
+

The output indicates our model is about 95 percent accurate. It appears +our model is performing very well! Such a high accuracy rate is quite +astonishing for a model so naive and simple. In fact, it may be a little +too good to be true, because we have to take into account one more +thing. Our model classifies emails into spam or ham, but the amount of +ham emails in our data set is considerably higher than the spam ones. +Let’s see the percentages:

+
sum(x_train)/length(x_train)
+
## 0.3600111425785784
+

This tells us that only 36% of the emails in the training set are +spam. This type of classification problem, where there’s an unequal +distribution of classes in the dataset, is called imbalanced. With +unbalanced data, a better way to see how the model is performing is to +construct a confusion matrix, an \(N \times N\) matrix, where \(N\) is the +number of target classes (in our case, 2, for spam and ham). The matrix +compares the actual values for each class with those predicted by the +model. Here’s a function that builds a confusion matrix for our spam +filter:

+
function spam_filter_confusion_matrix(y_test, predictions)
+    # 2x2 matrix is instantiated with zeros
+    confusion_matrix = zeros((2, 2))
+
+    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
+    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
+    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
+    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
+
+    # Now we convert the confusion matrix into a DataFrame 
+    confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
+    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
+    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))
+
+    return confusion_df
+end
+

Now let’s call our function to build the confusion matrix for our model.

+
confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions)
+
## 2×3 DataFrame
+##  Row │ prediction            ham_mail  spam_mail
+##      │ String                Float64   Float64
+## ─────┼───────────────────────────────────────────
+##    1 │ Model predicted Ham     1050.0       41.0
+##    2 │ Model predicted Spam      42.0      419.0
+

Row 1 of the confusion matrix shows us all the times our model +classified emails to be ham; 1,056 of those classifications were correct +and 36 were incorrect. Similarly, the spam_mail column shows us the +classifications for all the spam emails; 36 were misidentified as ham, +and 427 were correctly identified as spam.

+

Now that we have the confusion matrix, we can calculate the accuracy of +the model segmented by category.

+
ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail])
+
## 0.9615384615384616
+
spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail])
+
## 0.9108695652173913
+
+
+

5.5 Summary

+

In this chapter, we’ve used a naive Bayes approach to build a simple +email spam filter. We walked through the whole process of training, +testing, and evaluating a learning model. First, we obtained a dataset +of emails already classified as spam or ham and preprocessed the data. +Then we considered the theoretical framework for our naive analysis. +Using Bayes’s theorem on the data available, we assigned a probability +of belonging to a spam or ham email to each word of the email dataset. +The probability of a new email being classified as spam is therefore the +product of the probabilities of each of its constituent words. We +defined a Julia struct for the spam filter object and created +functions to fit the spam filter object to the data. Finally, we made +predictions on new data and evaluated our model’s performance by +calculating the accuracy and making a confusion matrix.

+
+
+

5.6 Appendix - A little more about alpha

+

As we have seen, to calculate the probability of the email being a spam +email, we should use

+

\(P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)\)

+

where P(wordnp|spam) stands for the probability of the word that is not presen +t in our dataset. What probability should be assigned to this word? One way to +handle this could be to simply ignore that term in the multiplication. In other +words, assigning P(wordnp|spam)=1. Without thinking about it too much, we can +conclude that this doesn’t make any sense, since that would mean that the probability +to find that word in a spam (or ham, too) email would be equal to 1. A more +logically consistent approach would be to assign 0 probability to that word. +But there is a problem: with P(wordnp|spam)=0

+

we can quickly see that

+

\(P(word0|spam)P(word1|spam)...P(wordnp|spam)=0\)

+

This is the motivation for introducing the smoothing parameter α

+

into our equation. In a real-word scenario, we should expect that words not present +in our training set will appear, and altough it makes sense that they don’t have a +high probability, it can’t be 0. When such a word appears, the probability assigned +for it will be simply

+

\(P(wordnp|spam)=Nwordnp|spam+αNspam+αNvocabulary=0+αNspam+αNvocabulary=αNspam+αNvocabulary\)

+

In summary, α is just a smoothing parameter, so that the probability of +finding a word that is not in our dataset, doesn’t go down to 0. Since we +want to keep the probability for these words low enough, it makes sense to use α=1

+ +
+
+
+ +
+