The Turing Test Measures Something, But It’s Not “Intelligence”

A computer program mimicked human conversation so well that it was mistaken for a real live human, but “machine intelligence” still has a long way to go

Alan Turing sculpture
A slate sculpture of Alan Turing by artist Stephen Kettle sits at the Bletchley Park National Codes Centre in Great Britain. Photo: Courtesy of Flickr user LEOL30

Alan Turing, one of the fathers of the computer age, was an extraordinarily clever man. So clever, in fact, that he understood that the term “machine intelligence” was just about meaningless. Better, he reasoned, to talk about what a machine can actually do: Can it talk? Can it hold down a conversation? At least that is something we can attempt to study. Turing eventually proposed what has come to be known as the “Turing test”: If a judge can’t tell which of two hidden entities is a human and which is a artificial, the machine has “passed” the test – which is exactly what is said to have happened this past Saturday in London.

“We are… proud to declare that Alan Turing’s test was passed for the first time,” one of the organizers, Kevin Warwick of the University of Reading, said as the results were announced. The winning chatbot goes by the name of “Eugene Goostman,” a computer program that emulates the personality of a 13-year-old Ukrainian boy. “Eugene” managed to convince 33 percent of the judges that it was human at Saturday’s event, held at the Royal Society’s offices in London on the 60th anniversary of Turing’s death. (Turing, a homosexual, was convicted of gross indecency in 1952 and was ordered to undergo hormonal “treatment” as part of a plea agreement.  Two years later he died from cyanide poisoning in an apparent suicide.)

But a word of caution is in order. “Intelligence” has always been a slippery subject, and the Turing test in particular has long been fraught with controversy. Turing described how it would work in a 1950 paper titled “Computing machinery and intelligence.” He took the idea from a traditional Victorian parlor game, where you try to figure out if the person hidden behind a curtain is a man or a woman, just by asking questions. (The answers to the questions had to be written down, because the voice would be a giveaway.) Here’s how Turing’s version would work: You’d have a judge, sitting in front of two curtains, with no way of knowing what’s behind them. Behind one curtain is a human; behind the other is a computer. The judge can ask questions of either of the two hidden entities. Based on the responses, the judge tries to figure out if the hidden entity is a human or a machine. (Turing envisioned the conversation as being mediated by teletype machines; today, we can use any kind of electronic, text-based interface, like the kind used in Internet chat rooms, or instant messaging.)

Turing speculated that by the year 2000 “an average interrogator will not have more than 70 per cent chance of making the right identification” – that is, computer programs would stymie the judges 30 per cent of the time – after five minutes of questioning. The “five minutes” is important. Turing didn’t talk about a time limit as being an inherent part of the test, and one could argue that for a machine to really pass the test, it ought to be able to handle any amount of questioning. Presumably the five-minute criteria was an arbitrary but necessary limit. The year 2000 came and went, with chatbots making only halting progress. (In a more sober moment, responding to a question from a BBC interviewer in 1952, Turing said it would be 100 years before a machine passed the test.)

Back in 2012, I was a judge in a “Turing test marathon,” the largest-ever set of Turing tests conducted at one time; it was held at Bletchley Park, in England, the site of Turing’s vital code-breaking work during the final years of the Second World War. (It was organized by the same team that ran Saturday’s event, and an earlier version of Eugene was the winner that time, too.) The set-up for Saturday’s event was the same as in 2012: The judges typed their questions at a computer, then waited for the replies to appear on their screens; the chatbots, along with the “hidden humans,” were in another room, out of sight.

The first thing I became hyper-conscious of is that when you’re a judge in a Turing test, five minutes goes by pretty fast. And the shorter the conversation, the greater the computer’s advantage; the longer the interrogation, the higher the probability that the computer will give itself away. I like to call this the mannequin effect: Have you ever apologized to a department store mannequin, assuming that you had just bumped into a live human being? If the encounter lasts only a fraction of a second, with you facing the other way, you may imagine that you just brushed up against a human. The longer the encounter, the more obvious the mannequin-ness of the mannequin.

It’s the same with chatbots. An exchange of hellos reveals nothing – but the further you get into it, the more problems arise. Chatbots, I found, seem prone to changing the subject for no reason. Often, they can’t answer simple questions. At the risk of sounding vague, they just don’t sound human. In one of my conversations in 2012, I typed in a simple joke – and the entity I was conversing with instantly changed the subject to hamburgers. (Computer scientist Scott Aaronson recently had a similar experience when he chatted with Eugene via the bot’s website. Aaronson asked Eugene how many legs a camel has; it replied, “Something between 2 and 4. Maybe, three? :-)))” Later, when Aaronson asked how many legs an ant has, Eugene coughed up the exact same reply, triple-smiley and all.)

Note also that Eugene doesn’t emulate a native-English-speaking adult; it pretends to be a young and somewhat flippant Ukrainian teen, conversing in reasonably good (but far from perfect) English. As Vladimir Veselov, one of the program’s developers, told Mashable.com:  “We spent a lot of time developing a character with a believable personality.” Although Eugene will engage anyone on any topic, his age “makes it perfectly reasonable that he doesn’t know everything.” Eugene doesn’t come right out and announce his age and nationality; but he’ll reveal it if asked – and the end result may be a certain amount of leniency from the judges, especially regarding English grammar and word use. (I’m assuming most of the judges on Saturday were native English speakers, though I don’t know this for certain.) The tables would likely have been turned if Eugene were ever to encounter a native Ukrainian speaker as a judge.

The struggle to build a talking machine highlights just how complex language is. It’s not just a question of talking – you have to talk about something, and what you say has to make sense – and it has to make sense in the context of what the other person has just said. For us, it’s easy; for computers, not so much. And so chatbots rely on an assortment of tricks: Memorizing megabytes of canned responses, or scouring the Internet for dialogue that might approximate the conversation they’re currently in the midst of. In other words, what a machine lacks in intelligence it may be able to make up for in raw computing power. This is why Google or Siri (the iPhone personal assistant) can seem so smart to us: Siri may not have a “mind,” but it has access to such a vast database of information, it can act as though it does. It was the same kind of brute-force approach that allowed IBM’s “Watson” to win at Jeopardy! in 2011.

All of this raises a crucial question: What is it, exactly, that the Turing test is measuring? Some critics have suggested that it is rewards trickery rather than intelligence. NYU Psychologist Gary Marcus, writing at NewYorker.com, says Eugene succeeds “by executing a series of ‘ploys’ designed to mask the program’s limitations.” Steven Harnad, a psychologist and computer scientist at the University of Quebec in Montreal, was even more skeptical, telling The Guardian that it was “complete nonsense” to claim that Eugene had passed the Turing test. (To his credit, Turing was well aware of this issue; he called his idea “the imitation game,” and spoke of intelligence only sparingly.) Even more awkwardly, the computer, unlike the human, is compelled to deceive. “The Turing Test is really a test of being a successful liar,” Pat Hayes, a computer scientist at the Institute for Human and Machine Cognition in Pensacola, Florida, told me following the 2012 Turing test marathon. “If you had something that really could pass Turing’s imitation game, it would be a very successful ‘human mimic.’”

And “human” is the other key point: Isn’t it possible that there are other kinds of intelligence in the world, beyond the kind displayed by our species? A truly intelligent machine would have countless practical applications, but why focus on creating more “people”? After all, we have plenty of people already. As the linguist Noam Chomsky has pointed out, when we strive to build a machine that moves underwater, we don’t require it to “swim” – and a submarine is no less of an achievement for its inability to do the backstroke.

Yes, Eugene is impressive, at least in small bursts. And yet, even the best chatbots stumble on questions that a child half Eugene’s pretend-age could handle breezily. Perhaps not surprisingly, most AI researchers spend little time obsessing over the Turing test. Machine intelligence is, in fact, moving forward, and rather swiftly. Voice-to-text translation software, which was fairly pathetic just a few years ago, is rapidly improving, as are language translation programs. Amazon often has a pretty good idea of what you want to buy even before you do. And Google’s self-driving car would have been mere fantasy a decade ago. But conversation, as we keep re-discovering, is really hard, and it is not likely to be the frontier in which AI shines most brightly. For now, if you’re looking for someone to chat with, I recommend a real human.

Dan Falk is a science journalist based in Toronto.

Get the latest stories in your inbox every weekday.