- For devs
Avoiding The Blind Spots Of Missing Data With Machine Learning
There are several traditional ways to fight SPAM:
content checking
links checking
block lists
gibberish e.g. generated signup emails
etc
They’re all working and constantly improving but at times I can’t help feeling like it’s a primary school and I’m taught 2 + 2 = 4.
There is so much more to fighting spam! Recently I came to a frightening realization: HAM i.e. legitimate email is SPAM you signed up for.
Think about legitimate online stores sending special offers and spammers doing basically the same thing. The content will look very much alike and the links will be classified accordingly.
The only difference is that you signed up for that online store but you didn’t sign up for spam. Unfortunately, that’s something we can’t verify.
So what can we do? While twins look alike they behave differently …
The research is specifically about how fast spammers send X messages. But obviously, there are many other behavioral patterns to consider.
For the research we manually classified over 1000 accounts and collected the following:
time passed before account starts sending
time to send X messages
Here’s how our dataset looked like:
1timepassed,time2send,class 2252202,961501,legitimate 3391006,11291,spam 4...
It’s just a CSV file.
For the analysis, I was using R but depending on your task and personal preferences you might use something else – scikit-learn, Weka, MOA, etc.
Two-thirds of the dataset were reserved for training and one third for validation:
1x <- read.csv(file="firstXmessages.csv", sep=",", head=TRUE)23## pick rows classified as spam4spam <- x[x$class %in% c("spam"), ]56## shuffle the data points7totalspam <- nrow(spam) 8spam <- spam[sample(totalspam), ]910## keep 1 / 3 for validation and 2 / 3 for training11validatespamrows <- totalspam / 3 12validatespam <- spam[sequence(validatespamrows), ] 13trainspam <- spam[validatespamrows + sequence(totalspam - validatespamrows), ]1415## repeat for legitimate domains16legit <- x[x$class == "legitimate", ] 17totallegit <- nrow(legit) 18legit <- legit[sample(totallegit), ] 19validatelegitrows <- totallegit / 3 20validatelegit <- legit[sequence(validatelegitrows), ] 21trainlegit <- legit[validatelegitrows + sequence(totallegit - validatelegitrows), ]2223## merge legitimate and spam datapoints together and shuffle24train <- rbind(trainspam, trainlegit) 25train <- train[sample(nrow(train)), ]2627validate <- rbind(validatespam, validatelegit) 28validate <- validate[sample(nrow(validate)), ] 29validatex <- subset(validate, select=-class) 30
The first model tried for classification was SVM (Support Vector Machines). Its visualization plot gives you a good understanding of the data points distribution:
The space is separated into two classes: spam and legitimate. The red dots and crosses are spam data points. The black dots and crosses are legitimate data points. The crosses correspond to support vectors used to build the hyper-plane dividing the two classes.
The X-axis is “time to send” and the Y-axis is “time passed before sending”. The chart has a very reasonable interpretation: spammers start sending sooner and send faster.
We tried SVM with different kernels and only the linear one had such a clear and reasonable explanation. For example here’s how results for sigmoid and polynomial kernels looked like:
1library('e1071') 2fit <- svm(train$class ~., train, kernel="sigmoid", cost=0.01) 3plot(fit, train)
1fit <- svm(train$class ~., train, kernel="polynomial", cost=0.01) 2plot(fit, train)
These actually demonstrate pretty well what overfitting is: the model adjusts to the training data but behaves poorly in real life. Look at the polynomial chart – you can literally see how the algo reaches out for a red cross far away from its classmates.
At this point we went back to the linear model to proceed with validation. To assess the model we used Precision and Recall.
Here’s a short explanation of those metrics from Wikipedia:
Since in our case it’s OK to let an occasional spammer through (there are other checks in place) but not OK to disable a legitimate account, we focused on:
spam Precision
legitimate Recall
spam Recall
High spam Precision means that the majority of accounts classified as spam are indeed spam. High legitimate Recall means that we don’t misclassify legitimate accounts as spam. High spam Recall means that we catch the majority of spam accounts.
Here’s how the metrics looked for linear SVM:
y <- predict(fit, validatex, type="class")
confusionmatrix <- table(validate$class, y)
print(Evaluate(cm=confusionmatrix))
Class Precision Recall
legitimate 0.7031250 0.6000000
spam 0.8814229 0.9214876
60% legitimate Recall is unacceptably low. It means that 40% of legitimate accounts were misclassified as spam.
The next model we tried was CART (Classification And Regression Tree):
library(rpart)
## grow tree
fit <- rpart(train$class ~., method="class", data=train)
## plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for how fast spammers send")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
The metrics look significantly better, though legitimate Recall is still fairly low:
Class Precision Recall
legitimate 0.8333333 0.6666667
spam 0.9027237 0.9586777
By shuffling the dataset differently we were able to improve it, but at the price of a more complicated decision tree:
Class Precision Recall
legitimate 0.7215190 0.7600000
spam 0.9243697 0.9090909
When your model gets too complex it’s a pretty good indicator of overfitting. So we decided to stick to the first, more simple, CART and see how it fits into SVM visualization chart:
The lines in the bottom-left corner correspond to the decision tree “magic” numbers. I also added another vertical line to make the “spam” area even smaller and more inline with common sense expectations.
The data points in emerged quadrants are pretty much all spam! And that’s actually what we needed: a simple model that would make sense and catch only spammers!
The datapoints outside of the quadrants are well-mixed. I.e. we can’t reliably tell who is a spammer and who is a legitimate sender there. So if we try to improve the metrics we most likely overfit.
The next step is to add more features i.e. more columns to our csv file. But that’s a topic for another blog post.
Always use common sense to verify your model
Constantly check for overfitting
Know what metrics to use and why
Hardcode / bruteforce your model if it makes sense
Great explanation of a train-validate-test workflow
Quick guide on CART
Happy machine learning and no spam!
Last updated on May 17, 2021
Avoiding The Blind Spots Of Missing Data With Machine Learning
Machine Learning For Everyday Tasks
Designing HTML Email Templates For Transactional Emails
5 Ideas For Better Developer-Designer Collaboration
What Is a RESTful API, How It Works, Advantages, and Examples
How to Improve the Way WordPress Websites Send Email
How To Use Parallel Programming
HTTP/2 Cleartext (H2C) Client Example in Go
How we built a Lucene-inspired parser in Go
Gubernator: Cloud-native distributed rate limiting for microservices
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Become an Email Pro With Our Templates API
Google Postmaster Tools: Understanding Sender Reputation
Navigating Your Career as a Woman in Tech
Implementing Dmarc – A Step-by-Step Guide
Email Bounces: What To Do About Them
Announcing InboxReady: The deliverability suite you need to hit the inbox
Black History Month in Tech: 7 Visionaries Who Shaped The Future
How To Create a Successful Triggered Email Program
Designing HTML Email Templates For Transactional Emails
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Implementing Dmarc – A Step-by-Step Guide
Announcing InboxReady: The deliverability suite you need to hit the inbox
Designing HTML Email Templates For Transactional Emails
Email Security Best Practices: How To Keep Your Email Program Safe
Mailgun’s Active Defense Against Log4j
Email Blasts: The Dos And Many Don’ts Of Mass Email Sending
Email's Best of 2021
5 Ideas For Better Developer-Designer Collaboration
Mailgun Joins Sinch: The Future of Customer Communications Is Here
Always be in the know and grab free email resources!
By sending this form, I agree that Mailgun may contact me and process my data in accordance with its Privacy Policy.