- For devs
Avoiding The Blind Spots Of Missing Data With Machine Learning
Machine learning is often thought to be too complicated for everyday development tasks. We often associate it with things like big data, data mining, data science, and artificial intelligence. Sometimes it feels something like this:
Machine Learning is hard
I have always felt like we can benefit from using machine learning for simple tasks that we do regularly.
At Mailgun, we work with e-mail and as part of our offering, we parse HTML quotations. This allows a user to grab the latest reply instead of the entire conversation, which is returned as part of our webhook response. You can read more about how we handle inbound message processing in our documentation.
For those of you who don’t know, here’s what parsing HTML from the public Internet looks like:
Parsing HTML from public internet
It’s messy and sometimes processes get stuck.
Changing the parsing library can help, but it won’t solve the issue completely because every library has its limitations. You have to restrict the parsing to something reasonable.
But what should the criteria and threshold be? Should we limit by HTML length or tag count? Maybe both? Maybe by something else? The objective obviously is to process as many messages as possible without shooting yourself in the foot, but the path isn’t super obvious.
That’s where cluster analysis and statistical classification become handy. For the research I was using R, but depending on your task and personal preferences you might use something else – scikit-learn, Weka, MOA, etc.
First, we logged HTML length, tags count, message processing time and put them into a csv file:
1htmllen,tagscount,took 22893762,85527,34.300139904 331378,518,0.0368919372559 419105,413,0.0545339584351 5...
The vast majority of messages take fractions of a second to process. So, when collecting the dataset, we had to make sure that we have enough “slow” messages.
We ended up with two csv files, collected on different days. One had 13831 lines and was reserved for analysis and model-training (the train dataset). Another had 12149 lines and was reserved for model validation (the validation dataset).
Generally you want to have at least two datasets – one for training and one for validation. Otherwise you might run into overfitting problem, when your model is well adjusted to the train data but fails in the real world.
To visualize the data and look for patterns k-means clustering was first tried:
1 messages <- read.csv(file="messages080816.csv", sep=",", head=TRUE) 2 mydata <- matrix(messages$took, ncol=1) 3 4 ## Determine the number of clusters 5 wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) 6 for (i in 2:15) wss[i] <- sum(kmeans(mydata, 7 centers=i)$withinss) 8 plot(1:15, wss, type="b", xlab="Number of Clusters", 9 ylab="Within groups sum of squares")
As you can see there is a significant performance improvement up to 4 clusters. After that there is no real boost.
The next step was to figure out how the data points get distributed between the clusters:
1 ## K-Means Clustering with 4 clusters2 fit <- kmeans(mydata, 4) 3 4 ## Cluster Plot against 1st 2 principal components 5 ## vary parameters for most readable graph 6 library(cluster) 7 clusplot(mydata, fit$cluster, color=TRUE,8 shade=TRUE, labels=2, lines=0)
And you can somewhat anticipate the problem already: the clusters form by nipping off the datapoints that are far away, while the interval we’re interested in (1-20 sec) is in the very midst. The issue persists with increasing the number of clusters.
Moreover there is a significant overlap between the clusters in the interval:
1 ## get clusters mean, min, max 2 mean <- aggregate(mydata,by=list(fit$cluster),FUN=mean) 3 min <- aggregate(mydata,by=list(fit$cluster),FUN=min) 4 max <- aggregate(mydata,by=list(fit$cluster),FUN=max)
Compare clusters number 1 and 4. The issue persists with increasing the number of clusters.
At this point, we decided to try a different approach and look at the percentiles for message processing time:
1 percentiles <- quantile(messages$took, seq(0.5, 0.99, 0.01)) 2 plot(seq(0.5, 0.99, 0.01), percentile)
As you can see, after the 78th percentile the processing time quickly bubbles up. Here was our first threshold – 78th percentile that corresponded to 6.5 seconds.
All datapoints that took less than 6.5 sec were marked as “fast” and others as “slow”.
For classification, we tried SVM (Support Vector Machines), Random Forests and CART(Classification And Regression Tree).
CART showed slightly better results for this task but its main advantage is that it gives you a decision tree that is easy to understand, explain and implement vs SVM or Random Forests that work like a black box and require using heavy ML libraries in production.
Here’s how you classify using CART:
1x <- subset(messages, select=-took) 2library(rpart) 34# grow tree 5fit <- rpart(x$cls ~., method="class", data=x) 67## display the results 8printcp(fit)910# detailed summary of splits 11summary(fit) 1213# plot tree 14plot(fit, uniform=TRUE, 15 main="Classification Tree for message processing time") 16text(fit, use.n=TRUE, all=TRUE, cex=.8)
Here’s the decision tree:
And that’s how the implementation looks like:
1def html_too_big(s): 2 return s.count('<') > _MAX_TAGS_COUNT
Isn’t it beautiful? All the research complexity in a single line!
For evaluation we took the validation dataset and tried to predict the classification using our model:
1 validate <- read.csv(file="messages080916.csv", sep=",", head=TRUE) 2 validate <- subset(validate, select=-took) 3 xv <- subset(validate, select=-cls) 4 y <- predict(fit, xv, type="class") 5 table(validate$cls, y)
Here’s the confusion matrix and some common classification metrics:
Machine learning is not just for data scientists. Even simple decisions powered by ML can benefit you. Know your data. Do not rely blindly on scientific algorithms and models.
Great explanation of a train-validate-test workflow
Quick guide on CART and Random Forests in R
Quick guide on Cluster Analysis in R
How to plot in R
Happy machine learning and data mining!
Last updated on August 27, 2020
Avoiding The Blind Spots Of Missing Data With Machine Learning
How Fast Spammers Send
Designing HTML Email Templates For Transactional Emails
5 Ideas For Better Developer-Designer Collaboration
What Is a RESTful API, How It Works, Advantages, and Examples
How to Improve the Way WordPress Websites Send Email
How To Use Parallel Programming
HTTP/2 Cleartext (H2C) Client Example in Go
How we built a Lucene-inspired parser in Go
Gubernator: Cloud-native distributed rate limiting for microservices
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Become an Email Pro With Our Templates API
Google Postmaster Tools: Understanding Sender Reputation
Navigating Your Career as a Woman in Tech
Implementing Dmarc – A Step-by-Step Guide
Email Bounces: What To Do About Them
Announcing InboxReady: The deliverability suite you need to hit the inbox
Black History Month in Tech: 7 Visionaries Who Shaped The Future
How To Create a Successful Triggered Email Program
Designing HTML Email Templates For Transactional Emails
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Implementing Dmarc – A Step-by-Step Guide
Announcing InboxReady: The deliverability suite you need to hit the inbox
Designing HTML Email Templates For Transactional Emails
Email Security Best Practices: How To Keep Your Email Program Safe
Mailgun’s Active Defense Against Log4j
Email Blasts: The Dos And Many Don’ts Of Mass Email Sending
Email's Best of 2021
5 Ideas For Better Developer-Designer Collaboration
Mailgun Joins Sinch: The Future of Customer Communications Is Here
Always be in the know and grab free email resources!
By sending this form, I agree that Mailgun may contact me and process my data in accordance with its Privacy Policy.