Now the Turing Test Goes Visual

A proposed test would have computer programs not only pick out what is in a photo but what is happening

street scene
A computer that passes the new test would be able to say which people in this scene from Pushkar, India, are carrying objects and which are riding bikes Atlantide Phototravel/Corbis

Facebook’s algorithms can pick your face out of a crowd (or try to at least), but it still can’t tell if you are posing in a family portrait or drinking with buddies—it can’t tell you how you are interacting with others. In the future though, computers may be able to do just that. Now researchers have proposed a way to figure out just how smart computers are at visual identification. They call their test a visual Turing test, after the computer scientist Alan Turing’s test of whether a computer can display human-like intelligence.

The popular perception of the test is that it's used to distinguish humans from computers—and one version is used to that effect, when you do a CAPTCHA to sign up for a new email. But artificial intelligence researchers really think of the test as a way to measure how advanced computer intelligence is so far.

“There have been some impressive advances in computer vision in recent years,” Stuart Geman, a mathematics professor at Brown University and one of the researchers proposing the new evaluation, says in a press statement. “We felt that it might be time to raise the bar in terms of how these systems are evaluated and benchmarked.”

Instead of simply recognizing that an image shows two people, the test sees if computers can figure out that the two people are having a conversation or even an argument. Currently, researchers use publicly available data sets to test their programs—MIT has LableMe, which uses crowdsourcing to identify the "car," "tree," and "building" in images, for example. To improve on this and offer a greater challenge, researchers based at Brown came up with a framework for a standardized visual Turing test.

Lee Gomes for IEEE Spectrum reports:

Their proposed method calls for human test-designers to develop a list of certain attributes that a picture might have, like whether a street scene has people in it, or whether the people are carrying anything or talking with each other. Photographs would first be hand-scored by humans on these criteria; a computer vision system would then be shown the same picture, without the “answers,” to determine if it was able to pick out what the humans had spotted.

Initially, the questions would be rudimentary, asking if there is a person in a designated region of the picture, for example. But the questions would grow in complexity as programs became more sophisticated; a more complicated question might involve the nature of an interaction between different people in the picture.

The team described the test in Proceedings of the National Academy of Sciences. As of now, German says that no computer system could pass even the simple versions of the new test. But they will in the future. Since there are many possible attributes to any photo, researchers would have to come up with innovative ways for their computers to learn to assess photos. 

“As researchers, we tend to ‘teach to the test,’” Geman says in the statement. “If there are certain contests that everybody’s entering and those are the measures of success, then that’s what we focus on. So it might be wise to change the test, to put it just out of reach of current vision systems.”