Feature Engineering, Logistic Regression, Cross Validation
In this project, I attempted to create a classifier to distinguish spam (junk or commercial or bulk) emails from ham (non-spam) emails at an accuracy of at least 90%.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style = "whitegrid",
color_codes = True,
font_scale = 1.5)
The dataset consists of email messages and their labels (0 for ham, 1 for spam). The labeled training dataset contains 8348 labeled examples, and the unlabeled test set contains 1000 unlabeled examples.
The train
DataFrame contains labeled data that I will use to train my model. It contains four columns:
from utils import fetch_and_cache_gdrive
fetch_and_cache_gdrive('1SCASpLZFKCp2zek-toR3xeKX3DZnBSyp', 'train.csv')
fetch_and_cache_gdrive('1ZDFo9OTF96B5GP2Nzn8P8-AL7CTQXmC0', 'test.csv')
original_training_data = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
# Convert the emails to lower case as a first step to processing the text
original_training_data['email'] = original_training_data['email'].str.lower()
test['email'] = test['email'].str.lower()
original_training_data.head()
After some basic data cleaning (replacing NaN values in the subject column with empty strings), the original training data is divided along a 90/10 train-validation split.
# This creates a 90/10 train-validation split on our labeled data
from sklearn.model_selection import train_test_split
train, val = train_test_split(original_training_data, test_size=0.1, random_state=42)
Below is a function called words_in_texts
that takes in a list of words and a pandas Series of email texts. It outputs a 2-dimensional NumPy array containing one row for each email text. The row contains either a 0 or a 1 for each word in the list: 0 if the word doesn’t appear in the text and 1 if the word does. This is particularly useful for finding spam vs. ham distributions for specific words found in the ‘email’ column of the train
dataset.
def words_in_texts(words, texts):
'''
Args:
words (list): words to find
texts (Series): strings to search in
Returns:
NumPy array of 0s and 1s with shape (n, p) where n is the
number of texts and p is the number of words.
'''
indicator_array = np.zeros((len(texts),len(words)))
for t in range(len(texts)):
for w in range(len(words)):
if words[w] in texts[t]:
indicator_array[t][w] = 1
return indicator_array.astype(int)
The barplot below was created with the ‘email’ column from the train
dataset, in conjunction with the words_in_texts
function as an example:
At this point, I ventured to create a data frame of the 200 most commonly used words in the ‘email’ column of the train
dataset, along with each word’s overall distribution, as well as distrubution over strictly spam vs. stricly ham emails, with an added column of the spam-to-ham ratio for clarity.
val=val.reset_index(drop=True)
Y_val=val['spam']
#finding the most used words in our training email set (taking the top 200)
all_words = []
words_lower=train['email'].str.lower()
words_split=words_lower.str.split()
for word in words_split:
all_words += word
top_words=pd.Series(all_words).value_counts().head(200).index.values
#creating a data frame of the top 200 words, along with their distribution between spam/ham emails
top_words_df = pd.DataFrame(top_words)
top_words_df = pd.DataFrame()
top_words_df['top_words']=top_words
top_words_df['distribution spam'] = words_in_texts(top_words_df['top_words'],train[train['spam']==1]['email'].values).mean(axis=0)
top_words_df['distribution ham'] = words_in_texts(top_words_df['top_words'],train[train['spam']==0]['email'].values).mean(axis=0)
top_words_df['ratio spam/ham'] = top_words_df['distribution spam']/top_words_df['distribution ham']
top_words_df['distribution'] = words_in_texts(top_words_df['top_words'],train['email']).mean(axis=0)
In order to decide which words would have a greater effect on my model’s predictive ability, I chose the words from top_words_df
which had a distribution over spam emails that was twice that of ham emails (or vice-versa), and filtered out the others. I also dropped words that were not particularly prevelent to the overall email data set, or were too specific to certain emails, according to the words_in_texts
function.
x2_effect = top_words_df[(top_words_df['ratio spam/ham'] >= 2) | (top_words_df['ratio spam/ham'] <= 0.5)]
x2_effect_filtered = x2_effect[x2_effect['distribution']>=0.1]
x2_effect_filtered = x2_effect_filtered.reset_index(drop=True)
The resulting data frame x2_effect_filtered
consists of the 33 “top words” that meet my desired ratio/distribution thresholds. For visualization:
Furthermore, I implemented a process called “tf-idf”, or “term frequency–inverse document frequency” (https://en.wikipedia.org/wiki/Tf–idf), which allowed me to give weight to my chosen words based on the number of times the words appeared in the email dataset, instead of using the words_in_text function
, which only indicated if a word was used in an email or not.
#term frequency–inverse document frequency:
def tf_idf(t,D):
N = len(D)
totalwords = D['email'].str.split().apply(len)
frequency = D['email'].str.count(t)
tf = np.log((1+frequency)/totalwords)
idf = np.log(N/(sum(frequency>0)+1))
return idf*tf
train_tf_idf=pd.DataFrame()
val_tf_idf=pd.DataFrame()
for k in test_words:
train_tf_idf[k] = tf_idf(k,train)
val_tf_idf[k] = tf_idf(k,val)
mymodel=LogisticRegression()
mymodel.fit(test_tf_idf,Y_train)
result_test = mymodel.predict(test_tf_idf)
result_val = mymodel.predict(val_tf_idf)
display(sum(mymodel.predict(test_tf_idf)==Y_train)/len(Y_train))
display(sum(mymodel.predict(val_tf_idf)==Y_val)/len(Y_val))
As evidenced above, my predictive model based on the “tf-idf” output on the 33 chosen “top words” features resulted in a train and validation accuracy of ~91.3% and ~92.1%, respectively.
Below is a triangular heat map of the features used in my predictive model: a list of the most common/effective words for predicting spam vs ham based on overall word frequency as well as distribution across all emails in the data set. Notice that most of the features have a relatively high correlation value and that all of the features are positively correlated. This implies that there is a fair amount of overlap in the words found in certain emails. To hone in on this, I have created two additional heatmaps below showing correlation of features to two specific features: one that appears to be part of an html tag, and another that is plain text. Clearly, there is a stronger correlation between the html tag feature with other possible html tag features, and a stronger correlation between plain text features with other plain text features.
Finally, my predictive model was used on the test
dataset of 1000 unlabeled emails:
final_test=pd.DataFrame()
for k in test_words:
final_test[k] = tf_idf(k,test)
test_predictions = mymodel.predict(final_test)
This successfully resulted in an overall test accuracy of 90.7%.
For this project, I found the most commonly used words in my email training data set, then visualized the distribution of their usage between ham/spam emails using the words_in_texts
function, as well as a barchart. From there, I wanted to choose words from my list of most commonly used words that I felt would have a greater effect on my model’s predictive ability. I kept the words in which its distribution over spam emails was twice that of ham emails (or vice-versa), and filtered out the others. I noticed some of the words on my list seemed much too specific to only a handful of emails, so I further filtered out those words with low overall email distribution w/ the same words_in_texts
function (so as to avoid overfitting on my training set). I then used a process called “tf-idf”, or “term frequency–inverse document frequency” (https://en.wikipedia.org/wiki/Tf–idf), which allowed me to give weight to the my chosen words based on the number of times the words appeared in the email dataset, instead of using the words_in_text
function, which only indicated if a word was used in an email or not.
Initially, I tried to use a regex/string function to extract the number of html tags in each email in the dataset, but this approach was not as successful as I had initially hoped. The difference in the distribution of html tags over spam/ham emails is not as large as I had initially suspected, and there are a number of outliers in the set of “ham” emails with an extremely large number of html tags (for example, the fourth email in the train
data frame has >1000 html tags and is NOT spam!). I tried to combine this approach with some other features, such as whether or not an email was a response, and the length of the emails. However, I was only able to acheive an accuracy of ~84% with this approach.
There were a number of surprising aspects which arose in my search for good features. Before manually creating a list of the most commonly used words in my dataset, I looked up a number of websites with lists of “common spam words.” However, after inputing these words into the words_with_text
function, it turns out that most of these words were not particularly prevalent in the dataset. I was also surprised by some of the words that made it onto my list of words that I deemed as being the most common words with the greatest effect on prediction. For example, the word “your”, to my eye, does not seem like a word that one would tend to find more often in spam over ham emails, and yet its distribution ratio (spam/ham) is actually quite large.
Were I to do this project again, I would likely use some version of k-fold cross-validation, rather than a single train/validation split for assessing validation error. I would also have another look at the distribution of word counts within the emails, as well as character counts (specifically special characters) to see how the the distributions compare over spam vs. ham emails, and then add features accordingly. As seen in the above heat map, clearly a number of the word features that I chose were rather highly correlated with others, so I would want to eliminate those with correlations deemed too great so as to avoid redundant features, and thus narrow down the number of features used in the overall model. Moreover, I would create more than one model, each with a varied number of features out of the total overall features chosen (i.e. 1-feature model, 2-feature model, 3-feature model, etc.) and compare validation error for the varying degrees so as to choose the best number of features to utilize in the final predictive model.