- For devs
How To Use Parallel Programming
At Mailgun we have numerous systems generating a ton of events every hour of the day. It’s a number so large that it’s impossible for a team of people to sort through ElasticSearch results and expect consistent results or for the team to maintain their sanity.
Like many companies, we still need to find a way to deal with all the data. This makes for a substantial task: create a system that allows our internal customers to react to events in real-time.
Traditionally, options for solving such a problem include:
Feeding the data into a database and then running database queries, often on a fixed time interval
Map-reduce solutions like Hadoop
Mechanical Turk
Real-time filters.
We pride ourselves on doing a lot with very little, so we set out to see how we could make real-time filters work for us. We focused on building a system that allows essentially anyone in the engineering and support teams to write rules that react and perform actions on their behalf.
When designing this system, we tried to build something that would give everyone the tools to rapidly iterate and respond to threats to our system and had to conform to the following rules:
The users should be able to add their own rules at will
The domain-specific-language henceforth referred to as a DSL, should be as simple as possible and ideally look like something they already know.
People should be able to test their rules on the historical data before applying them.
We use Kibana internally, so something that could be tested against that to fulfill point number 3 was ideal. This leads us to a straight-forward conclusion: we should write a Lucene inspired DSL parser!
The first iteration of the system relied on regular expressions and lots of character-by-character parsing. It got the job done, but subsequent iterations rapidly grew in complexity. If you've ever done any large string parsing project, you know what we mean.
Moving forward, we knew we had to build something that could be iterated upon quickly, and ideally, conformed to a known standard.
Learn about our Deliverability Services
Looking to send a high volume of emails? Our email experts can supercharge your email performance. See how we've helped companies like Lyft, Shopify, Github increase their email delivery rates to an average of 97%.
Those of us that have taken Automata Theory college courses and the like probably remember grammars in the context of programming languages. Furthermore, it’s given that some readers will have far more expertise than your humble author when it comes to the subject, so any in-depth explanations are to be referred to your friendly neighborhood search engine.
It's still useful to talk briefly about what makes a grammar and why they're useful. To quote Wikipedia, a Formal Grammar is "...a set of production rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form."
At the risk of oversimplifying, it lets us take a string like "You've got mail!" and break it down into tokens that we can semantically analyze. In other words, if our grammar is well defined, we can then write additional code to determine the meaning of that string.
The history and types of grammars can fill tons of books and are thus well beyond the scope of a blog post. Instead, we'll zero in on context-free grammars and more specifically, parsing expression grammars.
Quoting Wikipedia again, a Context-Free Grammar is "a certain type of formal grammar: a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements."
To put it another way, a context-free grammar defines a graph which in turn defines how to parse a language. As mentioned above, it does not tell us how to understand the language. It’s also worth mentioning as we’ll see below that the graph may have cycles, but it eventually terminates.
Context-free grammars have a cousin of sorts in the Parsing Expression Grammar, or PEG. They look a lot like a context-free grammar, or CFG, but come with a very useful distinction: when confronted with an ambiguous choice, the PEG will always choose the first production rule defined. The CFG instead leaves the choice ambiguous, and why bother with ambiguity when we don’t have to?
What this really means is that it's easier to write tools to define PEGs. Fortunately, many people have already done that for us.
After some research, we settled on Pigeon as the basis of our PEG. Pigeon follows the same paradigm as a lot of Go tools by generating code that compiles alongside your program.
Using it is easy enough: You define both the grammar and the Go code to handle each of the rules in the same package, and simply call Parse() on the generated parser. The challenge is to write handlers for each of those rules. Like grammars, the topic of compilation could fill a library with books. Fortunately for us, writing an interpreter is less complex and the resulting code is sufficiently fast for our needs.
Furthermore, the syntax tries to align itself closely to Go syntax itself, which makes it easier when writing your PEG alongside your Go program utilizing the grammar.
Say you have an event generated by sending an email, encoded in JSON, that looks like the following:
{
"event": "sent",
"subject": "A special offer just for you!",
"account": "12345",
}
There are a couple of ways to look at this event: you could decide this is an innocuous email, or you could say "This might be spam, but there's not enough information to decide". This is a very common problem in our system, and always requires lots of aggregated data to come to a conclusion.
So, let's write a rule that will match the above event:
1event:"sent" AND subject:"A special offer just for you!"
This is fairly easy to parse, and better yet, it can also be checked in Kibana!
Now let's walk through constructing a very simple PEG that would match this rule. We'll implement each production rule one at a time and explain what it does. Rule names typically follow PascalCase as well as allowing numbers and underscores. Each rule takes the form:
1RuleName <- <rule definition>
Note that are actually several legal characters for the rule definition operator but we like <- as it's easy to type and looks most like grammars as they're defined academically.
The first rule in the grammar is treated as the entrypoint:
1Input <- Term !.
Our Input
rule says "Match the entire input string" with !.
meaning "Match the end of the file". This will pass whatever the entire input string to a rule named Term which looks like the following:
1Term <- Variable AndVar*
This passes control-flow to the Variable production rule:
1Variable <- FieldChars+ _ ":" _ Value
Now we're getting a little more complex. First, we greedily match all characters until the FieldChars rule is satisfied. Note the +
there at the end. PEG syntax shares a lot in common with regular expressions, so this is saying "Match the rule FieldChars
one or more times"
1FieldChars <- [a-z]
Again, if you're familiar with regex syntax this should be straight-forward: Simply match any single character between “a” and “z”
Now back to Variable, the next part _
is a rule saying "match any whitespace character". Note that _
is a completely valid identifier for a production rule, and less typing is usually a good thing.
1_ "whitespace" <- [ \n\t\r]*
There are a couple of things to note here:
The string "whitespace" is the "friendly name" and exists for documentation purposes and your future sanity.
As with regular expression syntax, this will match any form of white space, newlines, tabs or carriage-returns.
The next part of the Variable rule matches the string literal "AND". It doesn't get much easier than that.
Next, we have another whitespace rule, and then finally the Value rule.
1Value <- '"' ValueChars* '"'
This rule matches the string literal '"'
, zero or more ValueChars rules and lastly another quote.
ValueChars
looks like the following:
1ValueChars <- [a-zA-Z0-9 !]
These ValueChars
will match the lowercase and uppercase alphabet, any number, spaces, and the exclamation point.
Why did we define Value this way? Because the rule can strip the double-quotes off for us so we don't have to, plus we're lazy. However, it's merely a convenience we could have omitted in favor of including the rule form of Value with the Variable rule definition.
Finally, AndVar
should look fairly straightforward at this point. Note that it refers back to Term and implements the aforementioned cycle.
1AndVar <- _ "AND" _ Term
The complete definition looks like the following.
1Input <- Term !.2Term <- Variable AndVar*3AndVar <- _ "AND" _ Term4Variable <- FieldChars+ _ ":" _ Value5_ "whitespace" <- [ \n\t\r]*6Value <- '"' ValueChars '"'7FieldChars <- [a-z]8ValueChars <- [a-zA-Z0-9 !]
The real-value of defining your own grammar comes when you provide implementations for the actions of the rules. Let's look at the real PEG definition:
1{2 package main34 import (5 "strings"6 )78 type Node interface {9 Evaluate(input Event) (bool, error)10 }1112 type Term struct {13 node Node14 }1516 func (t *Term) Evaluate(input Event) (bool, error) {17 return t.node.Evaluate(input)18 }1920 type AndNode struct {21 Nodes []Node22 }2324 func (n *AndNode) Evaluate(input Event) (bool, error) {25 for _, node := range n.Nodes {26 matched, err := node.Evaluate(input)27 if err != nil {28 return false, err29 }30 if !matched {31 return false, nil32 }33 }34 return true, nil35 }3637 type Variable struct {38 Field string39 Value string40 }4142 func (v *Variable) Evaluate(input Event) (bool, error) {43 fieldVal, ok := input[v.Field]44 if !ok {45 return false, fmt.Errorf("Field '%s' not present in the event", v.Field)46 }4748 return fieldVal == v.Value, nil49 }5051 func toString(label interface{}) string {52 var sb strings.Builder53 value := label.([]interface{})54 for _, i := range(value) {55 if i == nil {56 continue57 }58 switch b := i.(type) {59 case []byte:60 sb.WriteByte(b[0])61 case []interface{}:62 s := toString(i)63 sb.WriteString(s)64 default:65 fmt.Printf("She's dead, Jim %T %+v\n", i, i)66 }67 }68 return sb.String()69 }70}7172Input <- Term !.7374Term <- variable:Variable rest:AndVar* {75 andVars := rest.([]interface{})76 variables := make([]Node, 0, len(andVars))77 variables = append(variables, variable.(Node))78 for _, r := range(andVars) {79 variables = append(variables, r.(Node))80 }81 return &AndNode{Nodes: variables}, nil82}8384AndVar <- _ "AND" _ rightSide:Term {85 return rightSide, nil86}8788Variable <- field:FieldChars+ _ ":" _ value:Value {89 return &Variable{Field: toString(field), Value: toString(value)}, nil90}9192Value <- '"' value:ValueChars* '"' {93 return value, nil94}9596_ "whitespace" <- [ \n\t\r]*97FieldChars <- [a-z]98ValueChars <- [a-zA-Z0-9 !]
There's a lot to unpack here so let's step through it.
At the top, we have a section of code surrounded by curly braces. This is a Pigeon convention that takes the enclosed-code verbatim and injects it into the final generated code output. Note that you can define types relevant to your grammar here for convenience, or you can put them in another file as Pigeon will use the package name you declare here. Everything after that is the definition of the grammar itself.
At this point, you've probably noticed that the rules differ a bit from how we defined them above. Let's take a look at Term
.
1Term <- variable:Variable rest:AndVar*
This is nearly our Term
definition from before, but now we've made it useful. Prefixing a rule in the definition with name
: assigns the return value of that rule to name and then passes name
to the function defined by your rule action when the code is generated. All of the code between the curly braces becomes a function associated with that rule and is called when the rule is matched:
1func (c *current) onTerm1(variable, rest interface{}) (interface{}, error) {2 andVars := rest.([]interface{})3 variables := make([]Node, 0, len(andVars))4 variables = append(variables, variable.(Node))5 for _, r := range andVars {6 variables = append(variables, r.(Node))7 }8 return &AndNode{Nodes: variables}, nil9}
Here we can see that Pigeon deals entirely in empty interface types. It's great for flexibility but less great for writing the implementation as we have to account for them.
1andVars := rest.([]interface{})2variables := make([]Node, 0, len(andVars))3variables = append(variables, variable.(Node))4for _, r := range andVars {5 variables = append(variables, r.(Node))6}7return &AndNode{Nodes: variables}, nil
As mentioned previously, we can have cycles in rules. Given this, we know that it's possible to receive an infinite number of Terms mapped to the rest
variable. Pigeon deals with this by handing us a slice of empty interfaces. We've created a Node
interface that everything has to be type-inferred into before we can use them.
After walking the entire list of AND’d
variables, we create an AndNode
containing each of the Terms. The definition of the AndNode can be found in the code literal section defined at the top of the PEG. As an implementer of the Node interface, It defines an Evaluate method which says the expression evaluates to true only if all of the individual terms also evaluate to true.
At this point the rest of the definition should hopefully make sense, so let's skip to running the thing. The following code listing shows parsing our rule, using it, and a couple of extra examples to show failures and error handling:
1package main23import (4 "fmt"5)67type Event map[string]string89func Match(rule string, event Event) (bool, error) {10 // Parse is generated by pigeon in this package11 tree, err := Parse("parsing", []byte(rule))12 if err != nil {13 return false, err14 }1516 // Yes, this is a bit ugly. It's a consequence of pigeon dealing in interface slices17 iface, ok := tree.([]interface{})18 if !ok {19 return false, fmt.Errorf("Internal Error")20 }2122 ast, ok := iface[0].(Node)23 if !ok {24 return false, fmt.Errorf("Internal Error")25 }2627 // ast is the tree generated by our supplementary Go defined by the production rules28 return ast.Evaluate(event)29}303132func main() {33 rules := []string{34 "event:\"sent\" AND subject:\"A special offer just for you!\"",35 "event:\"found\" AND subject:\"A special offer just for you!\"",36 "event:\"sent\" AND badfield:\"A special offer just for you!\"",37 }383940 event := map[string]string{41 "event": "sent",42 "subject": "A special offer just for you!",43 "account": "12345",44 }4546 for _, rule := range rules {47 matched, err := Match(rule, Event(event))48 if err != nil {49 fmt.Printf("Rule '%s' failed with error '%s'\n", rule, err.Error())50 } else {51 fmt.Printf("Rule '%s' matched: %t\n", rule, matched)52 }53 }54}
All you have to do is run:
~> go get
github.com/mna/pigeon
~> pigeon rules.peg > rules.go
~> go run .
Rule 'event:"sent" AND subject:"A special offer just for you!"' matched: true
Rule 'event:"found" AND subject:"A special offer just for you!"' matched: false
Rule 'event:"sent" AND badfield:"A special offer just for you!"' failed with error 'Field 'badfield' not present in the event'
Congratulations! You've now written your own language of extremely limited utility!
At this point you may be thinking:
Don't a lot of senders use subjects like that?
Is a rules engine that only matches strings literals all that useful?
Couldn't this post have been half as long?
To which we would respond:
We would want to implement some kind of handling for the account
as defined in the original event. Mailgun solves this problem by including business logic in the codebase to support rate-limiting based on a defined field. In other words, we can say "If this rule matches, save the account and current time and don't perform this action against this account for the next N minutes." Obviously, if we didn't have this handling and the sender in question sends one million emails, support would be notified one million times. That'd be a great way to drown your support team in noise and also get jumped in the parking lot after work.
One could implement rules to parse regular expressions and substrings that would be far more useful than an explicit string match. In fact, we’ve done exactly that.
Sorry.
Lastly, you likely noticed we didn't define what the rule would do here when it does match. One possible solution is to have the rule post to a Slack channel when it’s triggered. This is a common solution here at Mailgun, and looks like the following:
slack:#channel-name
Hopefully, at this point, your mind is spinning with possibilities. Here are some thoughts we’ve had:
There’s much more to emulate in lucene syntax like NOT and OR statements, and parentheses.
Provide a separate grammar for actions.
Implement templatization to inject values from the event into the action.
Layer on syntactic sugar to make writing some filters easier.
Implement a system to count and react to a specific number of event occurrences over a time period.
Implementing a cache because parsing the rules is somewhat expensive.
There are loads more you can do, especially when it comes to fleshing out the action parser. You could call other services, notify customers, turn the lights on and off, and more. When it comes to inventing your own tools based on this, the sky’s the limit.
So at this point, you might be asking yourself: "why didn't they use $SOME_OTHER_TOOL?"
The answer is because we don't yet need the power of any of those tools nor the complexity that frequently comes with operating them. Our current stream processing tool is a single, simple binary deployed in a container that Just Works™. There's very little to manage, and it's readily keeping up with our very high volume event feed.
It's easy to develop
It's still using off-the-shelf, FOSS tools
It can be tailored to exactly our needs
The syntax for our DSL is tiny and easy to articulate and reason about, and it forces everyone to write rules roughly the same way. It also makes it harder to create unintended side-effects
Language features beyond what we've already implemented represent a substantial leap in complexity.
Updating the source PEG will generate new code, and it really mucks up your code diffs.
Lucene’s deliberate simplicity can make it difficult to write more complex rules.
We may eventually decide that $SOME_OTHER_TOOL is more suitable for our needs, but for the foreseeable future, the power our Mailgun DSL brings to the table is more than sufficient.
At this point, we hope we've demonstrated how writing your own DSL could be useful. Our implementation processes an enormous number of events per day while making the lives of many of our team members much easier (at least that’s what we tell them).
As for the features it supports, we're just getting started. If this has inspired you to look into implementing your own stream processing grammar, please let us know. We'd love to hear about it!
Learn about our Deliverability Services
Looking to send a high volume of emails? Our email experts can supercharge your email performance. See how we've helped companies like Lyft, Shopify, Github increase their email delivery rates to an average of 97%.
Last updated on May 17, 2021
How To Use Parallel Programming
Golang’s Superior Cache Solution to Memcached and Redis
HTTP/2 Cleartext (H2C) Client Example in Go
Gubernator: Cloud-native distributed rate limiting for microservices
Delivering HTML Emails With Mailgun-Go
What Toasters And Distributed Systems Might Have In Common
Pseudonymization And You – Optimizing Data Protection
Same API, New Tricks: Get Event Notifications Just In Time With Webhooks
Sending Email Using The Mailgun PHP API
Avoiding The Blind Spots Of Missing Data With Machine Learning
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Become an Email Pro With Our Templates API
Google Postmaster Tools: Understanding Sender Reputation
Navigating Your Career as a Woman in Tech
Implementing Dmarc – A Step-by-Step Guide
Email Bounces: What To Do About Them
Announcing InboxReady: The deliverability suite you need to hit the inbox
Black History Month in Tech: 7 Visionaries Who Shaped The Future
How To Create a Successful Triggered Email Program
Designing HTML Email Templates For Transactional Emails
InboxReady x Salesforce: The Key to a Stronger Email Deliverability
Implementing Dmarc – A Step-by-Step Guide
Announcing InboxReady: The deliverability suite you need to hit the inbox
Designing HTML Email Templates For Transactional Emails
Email Security Best Practices: How To Keep Your Email Program Safe
Mailgun’s Active Defense Against Log4j
Email Blasts: The Dos And Many Don’ts Of Mass Email Sending
Email's Best of 2021
5 Ideas For Better Developer-Designer Collaboration
Mailgun Joins Sinch: The Future of Customer Communications Is Here
Always be in the know and grab free email resources!
By sending this form, I agree that Mailgun may contact me and process my data in accordance with its Privacy Policy.