For a long time, the universal translator has been a pop culture obsession. Star Trek presents it as a handheld machine, resembling a microphone, that can translate most languages immediately. The Hitchhiker’s Guide to the Galaxy boasts a Babel fish, a creature that, when stuck inside one’s ear, offers instant translation of any language in the galaxy.
So it should come as no surprise that present-day humans are trying to create a device that works just as well. Dozens of translation smartphone apps exist, but most translate words on a simple, one-to-one basis; a user types or speaks a word and the app bounces back with a translation. Now, the goal, and the real monetary windfall, is for engineers and entrepreneurs to make it possible for two people to converse in different languages, while a small device spits out translations in real time.
Existing Translation Apps
On a trip to Greece, English-speaking Andrew Lauder fell ill.
“I went to the pharmacy, and they couldn’t understand any English, so I got no meds,” says Lauder, CEO of Vocre Translate. The drug labels were quite literally Greek to him. Language barriers are common for world travelers. In a foreign country, small transactions like buying medicine or getting directions—another difficulty Lauder faced— become herculean tasks.
When he returned stateside, Lauder created Vocre Translate, a voice and text translation app. It began as a text-to-text app (called MyLangauge), then initially transformed into a speech-to-text model that, like other apps including SayHi Translate, used a traditional model in which a word translates directly to another word. Say “Hello,” and the smartphone or tablet chirps back an automated “Hola.” “Goodbye” becomes “Sayonara.” And so on, much like a text translator.
To create the simple audio translation, the creators of these apps needed data. Vocre pulled its information from public domain recordings and documents, such as old films or public hearings. “We basically begged a voicemail transcription service to let us use their cloud for speech recognition,” says SayHi CEO Lee Bossier.
Once the engineers had audio and text data, they paired the audio and text, word for word. Voice recognition software recognizes “cheese” and converts it into text. That’s converted into French, and the app finds the French pronunciation for “fromage.”
That said, if a user cheekily calls something “cheesy,” the translator doesn’t work as well, because spoken language isn’t nearly as static as written language. Cadence, slang, inflection, pronunciation, dialect and conversational flow can change meaning
Over time, though, Lauder wanted a more conversational device. In an email, he says, “Based on our usage data, we've found that people speak very differently from the way they write. Spoken word is much more spontaneous and much less formal and literal.” So he employed statistical machine translation, an approach also used by Google that uses data to find common word usage, forgoing the traditional word-to-word translation model. Basically, Vocre learns as it’s used. “It learns based on every conversation, every phrase that goes through it. It’s something that gets smarter over time,” Lauder says.
Still, at present, both apps take a few seconds to translate, but are no doubt effective, especially in concert with body language, for transactional conversations like ordering a meal. After all, humans have been ordering food in non-native languages for years and always manage to still eat. But they haven’t been able to have in-depth, complex conversations.
With Vocre and SayHi, conversations can stiffly stagger along, but it isn’t the same as chatting in your native language. Google intends to change this entirely.
Google’s Approach (Statistical Machine Translation)
When learning a new language in school, we begin with individual vocabulary terms. But language is more fluid—words need context.
“The approach [Google] takes is a more general approach,” says Josh Estelle, a software engineer for Google Translate. “Instead of trying to hardcode all these rules, we try to learn the rules by looking at data.”
The tech company avoids the one-to-one, word-for-word method and instead employs statistical machine translation, looking not at what words mean but how language is modeled, which it learns through data. So, it aims for the forest, not the trees. An English example: we know the definitions of the word “break” and “up.” But the phrase “break up” is not the literal combination of the two words.
Statistical machine translation requires data. Mountains of it. For the method to work, it needs not just the fact that “fromage” is French for cheese but 100 examples of both “fromage” and cheese being used in actual sentences.
Estelle says if an English speaker has two menus, identical save for the fact that one’s printed in English and one in Chinese, “you can probably figure out what the Chinese character is for ‘soup.’” Context is king. But to create that context, you need access to millions of menus, and every other document imaginable.
Which is exactly what Google has. Without the web giant to gather heaps of data, a real-world Babel fish couldn’t exist. It crawls the web and collects everything—text and audio. Then, it feeds this data into algorithms that compare everything to everything else. These comparisons help get to the root of how language naturally works.
“One thing that surprises people when we talk about Translate is our team doesn’t have any linguists on it,” Estelle says. “We’ve launched 71 languages, and I would say our team doesn’t know how to speak the vast majority of them. A human translator is not going to be able to learn all these terms and things as fast as our [data] can learn from the web.”
What’s the Point?
Like Google, Facebook sees benefits. Consider the social media site’s own foray into translation.
“The mission of Facebook has been connecting the entire world, and one of the barriers of connecting the world is not everyone speaks the same language,” says Tom Stocky, a director of engineering at Facebook. “On the translation side, I think the really ambitious vision for the future is if you could use Facebook in your native language and interact with any other language.”
This past August, Facebook acquired Jibbigo, a speech-to-speech translation app that’s available for Android and iOS devices.
Keen Facebook users will note that the social site already employs some translation. If you’ve ever had a Spanish post on your English-based page, you’ve immediately been given the opportunity to translate it into your native tongue.
But Stocky sees the voice component as a potential game changer. The rise of smartphones and tablets welcomes a perpetually interconnected world, and the rise of speech recognition software is inviting new means of web interaction. Stocky envisions a future in which users can just speak a command to their smartphones and interact with other users, language differences aside.
“There’s no question that will happen eventually, because the only limitations there are the power of the language engine and of course processing time and processing power,” he says.
Laura Murphy, a professor in the department of global health systems and development at Tulane University and an admitted technology skeptic, questions the value of a universal translator, and not needing to know more than one language.
She thinks the device could be somewhat useful with travel, business and international relations but not groundbreaking. At a certain level, we already have translators (people) in place, and most who work in foreign relations know the appropriate languages. A device, Murphy believes, could have negative consequences.
“I think it can make people lazy,” Murphy says. Translating languages can be mentally challenging by forcing the brain—especially one that knows more than two languages—to work in a different way, but the exercise is rewarding, nonetheless. The brain pulls from a place of linguistic empathy that even the finest voice translator could never reach.
While this universal communication could be a positive, Murphy acknowledges, “it might lead to people thinking they’re communicating when they’re not.” Culture is not always completely embodied in language (take sarcasm, for example), and communication is not always about the information being passed.
When Can We Expect to See This Technology?
“In 2005, it took us 40 hours to translate 1,000 sentences,” Estelle says, of Google. “Today, we translate the equivalent of 1,000 sentences every 10 milliseconds.”
As Richard Anderson famously says in the 1970s TV series The Six Million Dollar Man, “We have the technology.” Now it’s just about waiting for the collecting and analyzing of data. How long that will take remains unknown, according to Estelle. But cautious estimates put such a device in our hands within a decade.
While app creators like Bossier or giant companies like Google and Facebook don’t want to build their own versions of the Biblical Tower of Babel, it does want to put an end to babbling. It envisions a world where we all communicate, about medicine, about politics, about ideas.
And, that world might not be far off.
Editor's Note: We updated this story on April 4, 2014, to accurately describe the Vocre Translate technology.