Using Natural Language Processing for Spam Detection in Emails

Article by Ramya Vidiyala | August 21, 2020

Have you ever wondered how a machine translates language? Or how voice assistants respond to questions? Or how mail gets automatically classified into spam or not spam?

All these tasks are done through Natural Language Processing (NLP), which processes text into useful insights that can be applied to future data. In the field of artificial intelligence, NLP is one of the most complex areas of research due to the fact that text data is contextual. It needs modification to make it machine-interpretable, and requires multiple stages of processing for feature extraction.

Classification problems can be broadly split into two categories: binary classification problems, and multi-class classification problems. Binary classification means there are only two possible label classes, e.g. a patient’s condition is cancerous or it isn’t, or a financial transaction is fraudulent or it is not. Multi-class classification refers to cases where there are more than two label classes. An example of this is classifying the sentiment of a movie review into positive, negative, or neutral.

There are many types of NLP problems, and one of the most common types is the classification of strings. Examples of this include the classification of movies/news articles into different genres, and the automated classification of emails into spam or not spam. I’ll be looking into this last example in more detail for this article.

 

Problem Description

Understanding the problem is a crucial first step in solving any machine learning problem. In this article, we will explore and understand the process of classifying emails as spam or not spam. This is called Spam Detection, and it is a binary classification problem.

The reason to do this is simple: by detecting unsolicited and unwanted emails, we can prevent spam messages from creeping into the user’s inbox, thereby improving user experience.

Emails are sent through a spam detector. If an email is detected as spam, it is sent to the spam folder, else to the inbox.

 

Dataset

Let’s start with our spam detection data. We’ll be using the open-source Spambase dataset from the UCI machine learning repository, a dataset that contains 5569 emails, of which 745 are spam.

The target variable for this dataset is ‘spam’ in which a spam email is mapped to 1 and anything else is mapped to 0. The target variable can be thought of as what you are trying to predict. In machine learning problems, the value of this variable will be modeled and predicted by other variables.

A snapshot of the data is presented in figure 1.

Figure 1: The text column contains the email, spam column contains the target variable

 

Task: To classify an email into spam or not spam.

To get to our solution we need to understand the four processing concepts below. Please note that the concepts discussed here can also be applied to other text classification problems.

  • Text Processing
  • Text Sequencing
  • Model Selection
  • Implementation

 

1. Text Processing

Data usually comes from a variety of sources and often in different formats. For this reason, transforming your raw data is essential. However, this transformation is not a simple process, as text data often contains redundant and repetitive words. This means that processing the text data is the first step in our solution.

The fundamental steps involved in text preprocessing are,

  1. Cleaning the raw data
  2. Tokenizing the cleaned data

 

a. Cleaning the Raw Data

This phase involves the deletion of words or characters that do not add value to the meaning of the text. Some of the standard cleaning steps are listed below:

  • Lowering case
  • Removal of special characters
  • Removal of stopwords
  • Removal of hyperlinks
  • Removal of numbers
  • Removal of whitespaces

 

Lowering Case

Lowering the case of text is essential for the following reasons:

  • The words, ‘TEXT’, ‘Text’, ‘text’ all add the same value to a sentence
  • Lowering the case of all the words is very helpful for reducing the dimensions by decreasing the size of the vocabulary

def to_lower(word):
    result = word.lower()
    return result

 

Removal of special characters

This is another text processing technique that will help to treat words like ‘hurray’ and ‘hurray!’ in the same way.

def remove_special_characters(word):
    result=
word.translate(str.maketrans(dict.fromkeys(string.punctuation)))
    return result

 

Removal of stop words

Stopwords are commonly occurring words in a language like ‘the’, ‘a’, and so on. Most of the time they can be removed from the text because they don’t provide valuable information.

def remove_stop_words(words):
    result = [i for i in words if i not in ENGLISH_STOP_WORDS]
    return result

 

Removal of hyperlinks

Next we remove any URLs in the data. There is a good chance that email will have some URLs in it. We don’t need them for our further analysis as they do not add any value to the results.

def remove_hyperlink(word):
    return re.sub(r"http\S+", "", word)

 

B. Tokenizing the Cleaned Data

Tokenization is the process of splitting text into smaller chunks, called tokens. Each token is an input to the machine learning algorithm as a feature.

keras.preprocessing.text.Tokenizer is a utility function that tokenizes a text into tokens while keeping only the words that occur the most in the text corpus. When we tokenize the text, we end up with a massive dictionary of words, and they won’t all be essential. We can set ‘max_features’ to select the top frequent words that we want to consider.

max_feature = 50000 #number of unique words to consider

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=max_feature)
tokenizer.fit_on_texts(x_train)
x_train_features = np.array(tokenizer.texts_to_sequences(x_train))
x_test_features = np.array(tokenizer.texts_to_sequences(x_test))

 

Figure 2: Data Cleaning and Tokenizing phases of text processing

 

 

2. Text Sequencing

a. Padding

Making the tokens for all emails an equal size is called padding.

We send input in batches of data points. Information might be lost when inputs are of different sizes. So, we make them the same size using padding, and that eases batch updates.

The length of all tokenized emails post-padding is set using ‘max_len’.

 

Figure 3: All Tokenized emails are converted to the same size in the ‘Padding’ stage.

 

Code snippet for padding :

from keras.preprocessing.sequence import pad_sequences
x_train_features = pad_sequences(x_train_features,maxlen=max_len)
x_test_features = pad_sequences(x_test_features,maxlen=max_len)

 

b. Label the encoding target variable

The model will expect the target variable as a number and not a string. We can use Label encoder from sklearn to convert our target variable as below.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_y = le.fit_transform(target_train.values)
test_y = le.transform(target_test.values)

 

3. Model Selection

A movie consists of a sequence of scenes. When we watch a particular scene, we don’t try to understand it in isolation, but rather in connection with previous scenes. In a similar fashion, a machine learning model has to understand text by utilizing already-learned text, just like in a human neural network.

In traditional machine learning models,we cannot store a model’s previous stages. However, Recurrent Neural Networks (commonly called RNN) can do this for us. Let’s take a closer look at RNNs below.

 

Figure 4: Working of a basic RNN

 

An RNN has a repeating module that takes input from the previous stage and gives its output as input to the next stage. However, in RNNs we can only retain information from the most recent stage. To learn long-term dependencies, our network needs memorization power. Here’s where Long Short Term Memory Networks (LSTMs) come to the rescue.

LSTMs are a special case of RNNs, They have the same chain-like structure as RNNs, but with a different repeating module structure.

 

Figure 5: Working of a basic LSTM

 

To perform LSTM even in reverse order, we’ll use a Bi-directional LSTM.

 

4. Implementation

Embedding

Text data can be easily interpreted by humans. But for machines, reading and analyzing is a very complex task. To accomplish this task, we need to convert our text into a machine-understandable format.

Embedding is the process of converting formatted text data into numerical values/vectors which a machine can interpret.

 

Figure 6: All tokenized emails are converted into vectors in the embedding phase

 

import tensorflow as tf
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from tensorflow.compat.v1.keras.layers import CuDNNGRU
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

#size of the output vector from each layer
embedding_vector_length = 32

 

#Creating a sequential model
model = tf.keras.Sequential()

#Creating an embedding layer to vectorize
model.add(Embedding(max_feature, embedding_vector_length, input_length=max_len))

#Addding Bi-directional LSTM
model.add(Bidirectional(tf.keras.layers.LSTM(64)))

 

#Relu allows converging quickly and allows backpropagation
model.add(Dense(16, activation='relu'))

#Deep Learninng models can be overfit easily, to avoid this, we add randomization using drop out
model.add(Dropout(0.1))

#Adding sigmoid activation function to normalize the output
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

 

The summary of the Bi-directional LSTM model

 

history = model.fit(x_train_features, train_y, batch_size=512, epochs=20, validation_data=(x_test_features, test_y))

y_predict = [1 if o>0.5 else 0 for o in model.predict(x_test_features)]

 

The summary of loss, accuracy, validation loss, validation accuracy

 

Through the above, we have successfully fit a bi-directional LSTM model on our email data, and detected 125 of 1114 emails as spam.

Since the percentage of spam in data is often low, Measuring the model’s performance by accuracy alone is not recommended. We need to evaluate it using other performance metrics as well, which we’ll look at below.

 

Performance Metrics

Precision and recall are the two most widely used performance metrics for a classification problem to get a better understanding of the problem. Precision is the fraction of the relevant instances from all the retrieved instances. Precision helps us to understand how useful the results are. Recall is the fraction of relevant instances from all the relevant instances. Recall helps us understand how complete the results are.

The F1 Score is the harmonic mean of precision and recall.

For example, consider that a search query results in 30 pages, of which 20 are relevant, but the results fail to display 40 other relevant results. In this case, the precision is 20/30, and recall is 20/60. Therefore, our F1 Score is 4/9.

Using F1-score as a performance metric for spam detection problems is a good choice.

from sklearn.metrics import confusion_matrix,f1_score, precision_score,recall_score

cf_matrix =confusion_matrix(test_y,y_predict)

tn, fp, fn, tp = confusion_matrix(test_y,y_predict).ravel()

print("Precision: {:.2f}%".format(100 * precision_score(test_y, y_predict)))
print("Recall: {:.2f}%".format(100 * recall_score(test_y, y_predict)))
print("F1 Score: {:.2f}%".format(100 * f1_score(test_y,y_predict)))

 

Results of Precision, Recall, F1 Score

 

import seaborn as sns
import matplotlib.pyplot as plt

ax= plt.subplot()
#annot=True to annotate cells
sns.heatmap(cf_matrix, annot=True, ax = ax,cmap='Blues',fmt='');

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Not Spam', 'Spam']); ax.yaxis.set_ticklabels(['Not Spam', 'Spam']);

 

Heatmap of confusion matrix

 

A model with an F1 score of 94% is a good-to-go model. Keep in mind, however, that these results are based on the training data we used. When applying a model like this to real world data, we still need to actively monitor the model’s performance over time. We can also continue to improve the model by responding to results and feedback by doing things like adding features and removing misspelled words.

 

Summary

In this article, we created a spam detection model by converting text data into vectors, creating a BiLSTM model, and fitting the model with the vectors. We also explored a variety of text processing techniques, text sequencing techniques, and deep learning models, namely RNN, LSTM, BiLSTM. You can find all the code for the project on my GitHub.

The concepts and techniques learnt in this article can be applied to a variety of natural language processing problems like building chatbots, text summarization, language translation models. We hope to have more articles about such NLP problems in the future.


Be sure to check the related resources below for more technical articles, and sign up to the Lionbridge AI newsletter for interviews and articles delivered directly to your inbox.

Subscribe to our newsletter for more technical articles
The Author
Ramya Vidiyala

Ramya is a data nerd and a passionate writer who loves exploring and finding meaningful insights from data. She writes articles on her Medium blog about ML and data science where she shares her experiences to help readers understand concepts and solve problems. Reach out to her on Twitter (@ramya_vidiyala) to start a conversation!

    Welcome!

    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.