- For devs
How we built a Lucene-inspired parser in Go
Back in 2011, we had several customers ask us for a high level message parsing API that they could use to strip signatures and quotes from an email like you see below:
While simple for humans, this is actually quite a challenging task for machines. One of the main reasons is because there is no standard format for an email message. Different email clients compose replies in different manners and even within the same email client, the sender can change the format to whatever they choose. For example, users can place their reply after quoting the original message (bottom-posting):
At 10.01am Wednesday, Danny wrote:
> By the way, which systems will be updated? I had some network
> problems after last week's update. Will I have to reboot?
No, you won't have to reboot.
before the quoted original message (top-posting):
No, you won't have to reboot.
-------- Original Message --------
From: Danny <danny@example.com>
Sent: Tuesday, October 16, 2007 10:01 AM
To: Jim <jim@example.com>
Subject: RE: Job
By the way, which systems will be updated? I had some network
problems after last week's update. Will I have to reboot?
or even interleave their reply:
> Can you present your report an hour later?
Yes I can. The summary will be sent no later than 5pm.
Jim
At 10.01am Wednesday, Danny wrote:
>> 2.00pm: Present report
> Jim, I have a meeting at that time. Can you present your report an hour later?
In fact, there are so many different ways to reply, there is even a Wikipedia article about it! All of this makes parsing the body of an email a challenging task.
Even with machine learning, we had to constantly adjust things. Email formatting is constantly changing, phone email clients are introducing new signatures like “Sent from your XXX phone”, new edge cases are discovered, etc.
Here is a simple example. Back in the day, all email signatures were separated with dashes:
So the first thing that comes to mind is to write a regex to detect dashes as a signature splitter and extract lines after it as a signature:
>>> signature = regex.match("^[s]*--*[s]*[a-z .]*$).*", message)
But the next thing you know you get an email like this:
And your parser strips off the most important part of the email. It’s a very simple example and you could easily work around it. But in real life, things get much more complicated and tricky.
We did a lot of research, looked at all the variations of email that passes through Mailgun and came up with a solution based on some machine learning techniques. The solution has been in production for several years now, undergoing bug fixes and enhancements. Overall we have received positive feedback from customers, though naturally, developers tend to point out where you could improve.
So now you all have the chance to help improve the solution. Because of the constantly changing and distributed landscape of email, we’ve decided to tackle this problem with a distributed solution: we’re open sourcing our library so we can hack on this together!
We’re calling our new library talon after a multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments.
In case you want to start testing it right away, we’ve prepared a simple Demo app and a QuickStart Guide for you. Otherwise read on for a more general overview, approaches we took, and assessment results.
Here’s how most common workflows look like:
Currently, we use machine learning only to classify signature lines. The rest of the library are various heuristics and sanity checks we came up with while working on support tickets and analyzing message formatting patterns/trends.
The machine learning part of the library is inspired by the following research papers:
http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf
http://www.cs.cornell.edu/people/tj/publications/joachims_01a.pdf
To classify signature lines we used SVM with Linear Kernel. To assess our classifiers we used 5-fold cross-validation:
The dataset consisted of 2912 email lines. Out of 1030 signature lines 954 were classified correctly. Out of 1882 non-signature lines, 147 were mistaken for signature. Overall it gives us 92% success rate and 78% area under the ROC curve. Which could be regarded as excellent and fair correspondingly.
When we modified the library for outsourcing, we tried to provide a sturdy skeleton while making it easy to add more meat. From experience, the parts that could use the most focus are the regexps for quotations / signature separators and HTML quotations extraction by HTML tags. However, you are certainly welcome to contribute to any part of the library you like.
We hope that you’ll find the library useful and it makes your life easier.
Happy Sending!
Learn about our Deliverability Services
Looking to send a high volume of emails? Our email experts can supercharge your email performance. See how we've helped companies like Lyft, Shopify, Github increase their email delivery rates to an average of 97%.
Last updated on May 17, 2021
How we built a Lucene-inspired parser in Go
Intelligent Email Forwarding With Mailgun
Getting Started With Mailgun: An Introduction To The Platform
Inbound Email Routing In PHP
We Just Open Sourced Flanker, Our Python Email Address And Mime Parsing Library
How UserVoice Solved Their Incoming Email Problem
Store(): A Temporary Mailbox For All Your Incoming Email
Weekly Product Update: Inbound Emails And Self-Service Dedicated IPs
Agape Charity Finds Easy Way To Forward Email For Volunteers Worldwide
Weekly Product Update: Easy Testing For Routes Webhooks
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Become an Email Pro With Our Templates API
Google Postmaster Tools: Understanding Sender Reputation
Navigating Your Career as a Woman in Tech
Implementing Dmarc – A Step-by-Step Guide
Email Bounces: What To Do About Them
Announcing InboxReady: The deliverability suite you need to hit the inbox
Black History Month in Tech: 7 Visionaries Who Shaped The Future
How To Create a Successful Triggered Email Program
Designing HTML Email Templates For Transactional Emails
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Implementing Dmarc – A Step-by-Step Guide
Announcing InboxReady: The deliverability suite you need to hit the inbox
Designing HTML Email Templates For Transactional Emails
Email Security Best Practices: How To Keep Your Email Program Safe
Mailgun’s Active Defense Against Log4j
Email Blasts: The Dos And Many Don’ts Of Mass Email Sending
Email's Best of 2021
5 Ideas For Better Developer-Designer Collaboration
Mailgun Joins Sinch: The Future of Customer Communications Is Here
Always be in the know and grab free email resources!
By sending this form, I agree that Mailgun may contact me and process my data in accordance with its Privacy Policy.