Can Computers Decipher a 5,000-Year-Old Language?

A computer scientist is helping to uncover the secrets of the inscribed symbols of the Indus

Indus script
Over the decades, archaeologists have turned up a great many artifacts from the Indus civilization, including stamp sealings, amulets and small tablets. Robert Harding / Photo Library

The Indus civilization, which flourished throughout much of the third millennium B.C., was the most extensive society of its time. At its height, it encompassed an area of more than half a million square miles centered on what is today the India-Pakistan border. Remnants of the Indus have been found as far north as the Himalayas and as far south as Mumbai. It was the earliest known urban culture of the subcontinent and it boasted two large cities, one at Harappa and one at Mohenjo-daro. Yet despite its size and longevity, and despite nearly a century of archaeological investigations, much about the Indus remains shrouded in mystery.

What little we do know has come from archaeological digs that began in the 1920s and continue today. Over the decades, archaeologists have turned up a great many artifacts, including stamp sealings, amulets and small tablets. Many of these artifacts bear what appear to be specimens of writing—engraved figures resembling, among other things, winged horseshoes, spoked wheels, and upright fish. What exactly those symbols might mean, though, remains one of the most famous unsolved riddles in the scholarship of ancient civilizations.

There have been other tough codes to crack in history. Stumped Egyptologists caught a lucky break with the discovery of the famed Rosetta stone in 1799, which contained text in both Egyptian and Greek. The study of Mayan hieroglyphics languished until a Russian linguist named Yury Knorozov made clever use of contemporary spoken Mayan in the 1950s. But there is no Rosetta stone of the Indus, and scholars don’t know which, if any, languages may have descended from that spoken by the Indus people.

About 22 years ago, in Hyderabad, India, an eighth-grade student named Rajesh Rao turned the page of a history textbook and first learned about this fascinating civilization and its mysterious script. In the years that followed, Rao’s schooling and profession took him in a different direction—he wound up pursuing computer science, which he teaches today at the University of Washington in Seattle—but he monitored Indus scholarship carefully, keeping tabs on the dozens of failed attempts at making sense of the script. Even as he studied artificial intelligence and robotics, Rao amassed a small library of books and monographs on the Indus script, about 30 of them. On a nearby bookshelf, he also kept the cherished eighth-grade history textbook that introduced him to the Indus.

“It was just amazing to see the number of different ideas people suggested,” he says. Some scholars claimed the writing was a sort of Sumerian script; others situated it in the Dravidian family; still others thought it was related to a language of Easter Island. Rao came to appreciate that this was “probably one of the most challenging problems in terms of ancient history.”

As attempt after attempt failed at deciphering the script, some experts began to lose hope that it could be decoded. In 2004, three scholars argued in a controversial paper that the Indus symbols didn’t have linguistic content at all. Instead, the symbols may have been little more than pictograms representing political or religious figures. The authors went so far as to suggest that the Indus was not a literate civilization at all. For some in the field, the whole quest of trying to find language behind those Indus etchings began to resemble an exercise in futility.

A few years later, Rao entered the fray. Until then, people studying the script were archaeologists, historians, linguists or cryptologists. But Rao decided to coax out the secrets of the Indus script using the tool he knew best—computer science.

Fascinated by the Indus civilization since the eighth grade, Rajesh Rao is using computer science and a concept called "conditional entropy" to help decode the Indus script. Courtesy of David Zax
Over the decades, archaeologists have turned up a great many artifacts from the Indus civilization, including stamp sealings, amulets and small tablets. Robert Harding / Robert Harding World Imagery / Corbis
Rao and his collaborators published their findings in the journal Science in May. They didn't decipher the language but their findings sharpened the understanding of it. Robert Harding / Robert Harding World Imagery / Corbis
Rao and his colleagues are now looking at longer strings of characters than they analyzed in the Science paper. Finding patterns would in turn help determine which language families the script might belong to. Courtesy of David Zax

On a summer day in Seattle, Rao welcomed me into his office to show me how he and his colleagues approached the problem. He set out a collection of replicas of clay seal impressions that archaeologists have turned up from Indus sites. They are small—like little square chocolates—and most of them feature an image of an animal beneath a series of Indus symbols. Most samples of the Indus script are miniatures like these, bearing only a few characters; no grand monoliths have been discovered. Scholars are uncertain of the function of the small seals, Rao told me, but one theory is that they may have been used to certify the quality of traded goods. Another suggests that the seals might have been a way of ensuring that traders paid taxes upon entering or leaving a city—many seals have been found among the ruins of gate houses, which might have functioned like ancient toll booths.

Rao and his colleagues didn’t seek to work miracles—they knew that they didn't have enough information to decipher the ancient script—but they hypothesized that by using computational methods, they could at least begin to establish what sort of writing the Indus script was: did it encode language, or not? They did this using a concept called “conditional entropy.”

Despite the imposing name, conditional entropy is a fairly simple concept: it is a measure of the amount of randomness in a sequence. Consider our alphabet. If you were to take Scrabble tiles and toss them in the air, you might find any old letter turning up after any other. But in actual English words, certain letters are more likely to occur after others. A q in English is almost always followed by a u. A t may be followed by an r or e, but is less likely to be followed by an n or a b.

Rao and his collaborators—an international group including computer scientists, astrophysicists and a mathematician—used a computer program to measure the conditional entropy of the Indus script. Then they measured the conditional entropy of other types of systems—natural languages (Sumerian, Tamil, Sanskrit, and English), an artificial language (the computer programming language Fortran) and non-linguistic systems (human DNA sequences, bacterial protein sequences, and two artificial datasets representing high and low extremes of conditional entropy). When they compared the amount of randomness in the Indus script with that of the other systems, they found that it most closely resembled the rates found in the natural languages. They published their findings in May in the journal Science.

If it looks like a language, and it acts like a language, then it probably is a language, their paper suggests. The findings don’t decipher the script, of course, but they do sharpen our understanding of it, and have lent reassurance to those archaeologists who had been working under the assumption that the Indus script encodes language.

After publishing the paper, Rao got a surprise. The question of which language family the script belongs to, it turns out, is a sensitive one: because of the Indus civilization’s age and significance, many contemporary groups in India would like to claim it as a direct ancestor. For instance, the Tamil-speaking Indians of the south would prefer to learn that the Indus script was a kind of proto-Dravidian, since Tamil is descended from proto-Dravidian. Hindi speakers in the north would rather it be an old form of Sanskrit, an ancestor of Hindi. Rao’s paper doesn’t conclude which language family the script belongs to, though it does note that the conditional entropy is similar to Old Tamil—causing some critics to summarily “accuse us of being Dravidian nationalists,” says Rao. “The ferocity of the accusations and attacks was completely unexpected."

Rao sometimes takes relief in returning to the less ferociously contested world of neuroscience and robotics. But the call of the Indus script remains alluring, and “what used to be a hobby is now monopolizing more than a third of my time,” he says. Rao and his colleagues are now looking at longer strings of characters than they analyzed in the Science paper. “If there are patterns,” says Rao, “we could come up with grammatical rules. That would in turn give constraints to what kinds of language families” the script might belong to.

He hopes that his future findings will speak for themselves, inciting less rancor from opponents rooting for one region of India versus another. For his part, when Rao talks about what the Indus script means to him, he tends to speak in terms of India as a whole. “The heritage of India would be considerably enriched if we were able to understand the Indus civilization,” he says. Rao and his collaborators are working on it, one line of source code at a time.

Get the latest History stories in your inbox?

Click to visit our Privacy Statement.