Can Computers Decipher a 5,000-Year-Old Language?

A computer scientist is helping to uncover the secrets of the inscribed symbols of the Indus

Over the decades, archaeologists have turned up a great many artifacts from the Indus civilization, including stamp sealings, amulets and small tablets. (Robert Harding / Photo Library)

The Indus civilization, which flourished throughout much of the third millennium B.C., was the most extensive society of its time. At its height, it encompassed an area of more than half a million square miles centered on what is today the India-Pakistan border. Remnants of the Indus have been found as far north as the Himalayas and as far south as Mumbai. It was the earliest known urban culture of the subcontinent and it boasted two large cities, one at Harappa and one at Mohenjo-daro. Yet despite its size and longevity, and despite nearly a century of archaeological investigations, much about the Indus remains shrouded in mystery.

From This Story

What little we do know has come from archaeological digs that began in the 1920s and continue today. Over the decades, archaeologists have turned up a great many artifacts, including stamp sealings, amulets and small tablets. Many of these artifacts bear what appear to be specimens of writing—engraved figures resembling, among other things, winged horseshoes, spoked wheels, and upright fish. What exactly those symbols might mean, though, remains one of the most famous unsolved riddles in the scholarship of ancient civilizations.

There have been other tough codes to crack in history. Stumped Egyptologists caught a lucky break with the discovery of the famed Rosetta stone in 1799, which contained text in both Egyptian and Greek. The study of Mayan hieroglyphics languished until a Russian linguist named Yury Knorozov made clever use of contemporary spoken Mayan in the 1950s. But there is no Rosetta stone of the Indus, and scholars don’t know which, if any, languages may have descended from that spoken by the Indus people.

About 22 years ago, in Hyderabad, India, an eighth-grade student named Rajesh Rao turned the page of a history textbook and first learned about this fascinating civilization and its mysterious script. In the years that followed, Rao’s schooling and profession took him in a different direction—he wound up pursuing computer science, which he teaches today at the University of Washington in Seattle—but he monitored Indus scholarship carefully, keeping tabs on the dozens of failed attempts at making sense of the script. Even as he studied artificial intelligence and robotics, Rao amassed a small library of books and monographs on the Indus script, about 30 of them. On a nearby bookshelf, he also kept the cherished eighth-grade history textbook that introduced him to the Indus.

“It was just amazing to see the number of different ideas people suggested,” he says. Some scholars claimed the writing was a sort of Sumerian script; others situated it in the Dravidian family; still others thought it was related to a language of Easter Island. Rao came to appreciate that this was “probably one of the most challenging problems in terms of ancient history.”

As attempt after attempt failed at deciphering the script, some experts began to lose hope that it could be decoded. In 2004, three scholars argued in a controversial paper that the Indus symbols didn’t have linguistic content at all. Instead, the symbols may have been little more than pictograms representing political or religious figures. The authors went so far as to suggest that the Indus was not a literate civilization at all. For some in the field, the whole quest of trying to find language behind those Indus etchings began to resemble an exercise in futility.

A few years later, Rao entered the fray. Until then, people studying the script were archaeologists, historians, linguists or cryptologists. But Rao decided to coax out the secrets of the Indus script using the tool he knew best—computer science.

On a summer day in Seattle, Rao welcomed me into his office to show me how he and his colleagues approached the problem. He set out a collection of replicas of clay seal impressions that archaeologists have turned up from Indus sites. They are small—like little square chocolates—and most of them feature an image of an animal beneath a series of Indus symbols. Most samples of the Indus script are miniatures like these, bearing only a few characters; no grand monoliths have been discovered. Scholars are uncertain of the function of the small seals, Rao told me, but one theory is that they may have been used to certify the quality of traded goods. Another suggests that the seals might have been a way of ensuring that traders paid taxes upon entering or leaving a city—many seals have been found among the ruins of gate houses, which might have functioned like ancient toll booths.

Rao and his colleagues didn’t seek to work miracles—they knew that they didn't have enough information to decipher the ancient script—but they hypothesized that by using computational methods, they could at least begin to establish what sort of writing the Indus script was: did it encode language, or not? They did this using a concept called “conditional entropy.”

Despite the imposing name, conditional entropy is a fairly simple concept: it is a measure of the amount of randomness in a sequence. Consider our alphabet. If you were to take Scrabble tiles and toss them in the air, you might find any old letter turning up after any other. But in actual English words, certain letters are more likely to occur after others. A q in English is almost always followed by a u. A t may be followed by an r or e, but is less likely to be followed by an n or a b.


Comment on this Story

comments powered by Disqus