How Google Keeps Your Spam Out of Your Inbox

You’re probably not surprised to find that there’s some interesting math behind all of Google’s information crunching

October 3, 2012

Behind all of Google’s information crunching—from figuring out which search results are the most important, to reading and keeping tabs on your email—there’s some interesting math. And recently Javier Tordable, a software engineer, did a presentation on it, opening a window into the geeky Google world just a crack.

Let’s start with Gmail. Sometimes you get spam mail, but Gmail is pretty good at figuring out that, when a correspondant is trying to get you to invest in a Nigerian prince, you probably don’t want that piece of mail in your inbox. How does it know? Step one: train the machine. Step two: put it to work.

It’s called machine learning, and Google is doing a ton of it. In step one, you have to do what computer scientists call “characterize an instance.” In math-speak that means:

In general, the characteristics of an instance can be considered as elements in a vector of an ndimensional euclidean space for a large n (100-1000 dimensions is normal, 1M-10M is not unheard of)

But here’s how to think about it if you stopped math after Calc 1. Gmail can pull a few key pieces of information from any particular email. How long is it? How many capital letters are there? Is this from someone you’ve gotten an email from before? You don’t want the information required to make the decision to be too hard to get or deal with, because that will slow down and decrease the accuracy of your machine. So Google draws a line, based on what it knows about spam. The emails that get through fall on one side of the line, and the spammy ones, on the other.

More math speak:

A simple classification model is a hyperplane in the space of characteristics. Data instances on one side of the hyperplane are classified as valid emails and instances on the other side are classified as spam.

What about voice searching—also called automated speech recognition, or ASR? Like machine learning, ASR happens in two parts: processing the sound coming in and figuring out what you’re saying. The first part involves Fourier transforms, which isolate the important bits that the computer can translate. The second part is modeling speech using what’s called a “hidden Markov model.” Tordable explains:

In this model the states are the letters of the message and the sequence of events is the sound signal. The Viterbi algorithm can be used to obtain the sequence of states of maximum likelihood.

Google would love to make voice recognition better and easier. In this case study, a group of Google whizzes write:

A goal at Google is to make spoken access ubiquitously available. We would like to let the user choose – they should be able to take it for granted that spoken interaction is always an option. Achieving ubiquity requires two things: availability (i.e., built into every possible interaction where speech input or output can make sense), and performance (i.e., works so well that the modality adds no friction to the interaction).

Another area where Google uses math is in their maps—in the spotlight recently after Apple debuted their mapping system to considerable criticism. At the heart of Google Maps is basic graph theory—the math of getting from one place to another while traveling the shortest distance. But, of course, it’s more complex than that. Tordable writes, “One unique problem is that the graphs used in Google Maps contain millions of nodes, but the algorithms have to run in milliseconds.”

Google won’t tell us how they do that. Otherwise Apple wouldn’t have run into its problem, but the basics involve shucking Dijsktra’s algorithm (probably the most commonly used graph search algorithm). A few years back, computer scientists at the University of Karlsruhe described a new way to rank path queries to get much faster results. They wrote:

Our algorithm preprocesses the eight digit number of nodes needed for maps of the USA or Western Europe in a few hours using linear space. Shortest (i.e. fastest) path queries then take around eight milliseconds to produce exact shortest paths. This is about 2,000 times faster than using Dijkstra’s algorithm.

Tordable goes through a number of other mathematical tools used by Google, including the ones involved in Google Books, Image Searches, Analytics, YouTube, Google Translate, Google Earth, and Picasa. You can see the whole set of slides here.

You Might Also Like

Get the latest stories in your inbox every weekday.