In 1984, the National Biomedical Research Foundation launched a free online database containing over 283,000 protein sequences. Today the Protein Information Resource allows scientists all over the world to take an unknown protein, compare it to the thousands of known proteins in the database, and determine the ways in which it is alike and different. From that data they can quickly and accurately deduce a protein’s evolutionary history and its relationship to various forms of life.
The humble origins of this massive online database start long before the internet. It all began with the Atlas of Protein Sequence and Structure, a 1965 printed book containing the 65 then-known protein sequences, compiled by a woman named Margaret Dayhoff. To create her Atlas, Dayhoff applied cutting-edge computer technologies to find solutions to biological questions, helping usher in the birth of a new field we now call bioinformatics. Originally a chemist, Dayhoff harnessed new and evolving technologies of the post-World War II computing era to pioneer tools that chemists, biologists and astronomers alike could use in in the cross-disciplinary study of the origins of life on Earth.
Dayhoff (then Margaret Oakley) was born in Philadelphia on March 11, 1925 to Ruth Clark, a high school math teacher, and Kenneth Oakley, a small business owner. At the age of ten, her family moved to New York City. There, she attended public schools, eventually becoming the valedictorian of Bayside High in 1942. She attended Washington Square College of New York University on a scholarship, graduating magna cum laude in mathematics just three years later in 1945.
That same year, Dayhoff entered Columbia University to get her PhD in quantum chemistry under the mentorship of prominent chemist and World War II operations researcher George Kimball. Her acceptance was a rarity for the time. After WWII, more men entered the sciences, and chemistry became even more male-dominated than in the previous decade, with only five percent of chemistry PhDs going to women, down from eight percent.
During Dayhoff’s time at the university, Columbia was a hotbed for computing technology. It boasted some of the first computing laboratories in the U.S., and in 1945 became home to the IBM Watson Scientific Laboratory led by astronomer W. J. Eckert. The Watson lab had first served as a computing center for the Allies in the final months of WWII. After the war, it became a site for developing some of the first super computers, including the Selective Sequence Electronic Calculator (SSEC), which Eckert later used to calculate lunar orbits for the Apollo missions.
With this technology at her fingertips, Dayhoff combined her interest in chemistry with computing by way of punched-card machines—essentially early digital computers. The machines allowed Dayhoff to automate her calculations, storing an algorithm on one set of cards and data on another. Using the machine, she was able to process calculations far more quickly and accurately than by hand.
Dayhoff’s particular subject of interest was polycyclic organic compounds, which are molecules that consist of three or more atoms joined in a close ring. She used the punched-card machines to perform a large number of calculations on the molecules’ resonant energies (the difference between a molecule’s potential energy of a specific state and average state) to determine the probability of molecular bonding and bond distances.
Dayhoff graduated with her doctoral degree in quantum chemistry in just three years. The research that she undertook as a graduate student was published, with Kimball as coauthor, in 1949 in the Journal of Chemical Physics under the simple title Punched Card Calculation of Resonance Energies.
Also in 1948, Dayhoff married Edward Dayhoff, a student in experimental physics whom she had met at Columbia. In 1952, the pair moved to Washington, D.C. where Edward took up a post at the National Bureau of Standards and Dayhoff gave birth to her first of two daughters, Ruth. Dayhoff soon dropped out of research to become a stay-at-home mom to Ruth and her younger daughter Judith, save for a two-year postdoctoral position at the University of Maryland.
When she returned to research and began applying for grants to fund her work in 1962, she was met with a shock. The National Institutes of Health rejected a grant application that listed Dayhoff as principal investigator, with the explanation that “[Dayhoff] has been out of really intimate touch for some time … with this complicated and rapidly advancing area,” as historian Bruno Strasser writes in his upcoming book Collecting Experiments: Making Big Data Biology. This kind of uphill climb for women who have taken time off to raise children is just one of the ways that scientific institutions hindered—and continue to hinder—women’s advancement.
Despite the NIH’s lack of support, Dayhoff was about to enter the most consequential decade of her career. In 1960, she accepted a fateful invitation from Robert Ledley, a pioneering biophysicist whom she met through her husband, to join him at the National Biomedical Research Foundation in Silver Spring, Maryland. Ledley knew Dayhoff’s computer skills would be crucial to the foundation’s goal of combining the fields of computing, biology and medicine. She would serve as his associate director for 21 years.
Once in Maryland, Dayhoff had free reign to use to Georgetown University’s brand-new IBM 7090 mainframe. The IBM system was designed for handling complex applications, with computing speeds six times faster than previous models. This speed had been achieved by replacing slower, bulkier vacuum tube technology with faster, more efficient transistors (the components that produce the 1s and 0s of computers). Using the mainframe, Dayhoff and Ledley started searching for and comparing peptide sequences with FORTRAN programs that they had written themselves in an attempt to assemble partial sequences into a complete protein.
Dayhoff and Ledley’s commitment to applying computer analysis to biology and chemistry was unusual. “The culture of statistical analysis, let alone of digital computing, were completely foreign to most [biochemists],” explains Strasser in an interview with Smithsonian.com. “Some even prided themselves in not being ‘theorists,’ which is how they understood data analysis using mathematical models.”
One scientific discipline where Dayhoff’s computer savvy was more appreciated, however, was astronomy. This interest in computing was thanks in part to W. J. Eckhart, who in 1940 had used IBM punched-card machines to predict planetary orbits. And in the 1960s, American interest in space exploration was in full swing, which meant funding for NASA. At the University of Maryland, Dayhoff met spectroscopist Ellis Lippincott, who brought her into a six-year collaboration with Carl Sagan at Harvard in 1961. The three of them developed thermodynamic models of the chemical makeup of matter, and Dayhoff devised a computer program that could calculate equilibrium concentrations of gases in planetary atmospheres.
With Dayhoff’s program, she, Lippincott and Sagan were able to choose an element to analyze, allowing them investigate many different atmospheric compositions. Ultimately, they developed atmospheric models for Venus, Jupiter, Mars and even a primordial atmosphere of Earth.
While exploring the skies, Dayhoff also took up a question that researchers had been exploring since at least the 1950s: what is the function of proteins? Sequencing proteins was a means of getting at the answer, but sequencing individual proteins was highly inefficient. Dayhoff and Ledley took a different approach. Instead of analyzing proteins in isolation, they compared proteins derived from different plant and animal species. “By comparing the sequences of the same protein in different species, one could observe which parts of the sequence were always identical in all species, a good indication that this part of the sequence was crucial for the good of the protein,” Strasser says.
Dayhoff probed deeper, looking to proteins’ shared history. She analyzed not only at the parts that were the same across species, but also their variations. “They took these differences as a measure of evolutionary distances between species, which allowed them to reconstruct phylogenetic trees,” Strasser explains.
Dayhoff, always ready to harness the power of new technology, developed computerized methods to determine protein sequences. She ran a computer analysis of proteins in a wide variety of species, from the candida fungus to the whale. Then she used their differences to determine their ancestral relationships. In 1966, with the help of Richard Eck, Dayhoff created the first reconstruction of a phylogenetic tree.
In a 1969 Scientific American article, “Computer Analysis of Protein Evolution,” Dayhoff presented to the public one of these trees along with her research using computers for sequencing proteins. “Each protein sequence that is established, each evolutionary mechanism that is illuminated, each major innovation in phylogenetic history that is revealed will improve our understanding of the history of life,” she wrote. She was trying to show the life sciences community the potential of computerized models.
Her next goal was to collect all known proteins in one place where researchers could find sequences and compare them to others. Unlike today, when it’s easy to call up sources on an electronic database with merely a keyword, Dayhoff had to scour physical journals to find the proteins she was looking for. In many instances, that meant checking fellow researcher’s work for errors. Even with the aid of a computer, the work of collecting and cataloguing the sequences required copious amounts of time and a discerning scientific eye.
Not everyone saw value in what she was doing. To other researchers, Dayhoff’s work resembled the collection and cataloguing work of 19th century natural history rather than the experimental work of the 20th century scientist. “Collecting, comparing and classifying things of nature seemed old-fashioned to many experimental biologists in the second half of the 20th century,” Stasser says. He refers to Dayhoff as an “outsider.” “She contributed to a field that did not exist and thus had no professional recognition,” he says.
In 1965, Dayhoff first published her collection of the 65 known proteins in the Atlas of Protein Sequence and Structure, a printed version of her database. Eventually the data moved to magnetic tape, and now it lives online where researchers continue to use her data to find thousands more proteins. Other biomedical databases have joined the fray, including the Protein Data Bank, a collaborative collection of protein and nucleic acids launched in 1971, and GenBank, the genetic sequence database launched in 1982. Dayhoff started a scientific revolution.
“Today, every single publication in experimental biology contains a combination of new experimental data and inferences drawn from comparisons with other data made available in a public database, an approach that Dayhoff started half a century ago,” Strasser says.
As bioinformatics grew, the tasks of collecting and computation largely fell to women. Dayhoff’s collaborators on the Atlas were all women except for Ledley. Like the women “computers” of NASA in the 1960s and the female codebreakers of World War II, these women were soon pushed to the margins of scientific practice. Referring to the “ENIAC girls” who programmed the first digital, general-purpose computer, historian of computing Jennifer Light writes that “it is within the confines of precisely such low-status occupational classifications that women were engaged in unprecedented work.”
In her biographical sketch of Dayhoff, Lois T. Hunt, who worked on the Atlas with her, wrote that Dayhoff believed her investigation into Earth’s primordial atmosphere might give her “the compounds necessary for the formation of life.” This, perhaps even more than computing, is what ties the disparate parts of Dayhoff’s scientific research together. From the tiny protein to the vast atmosphere, Dayhoff was searching for the secrets of life’s emergence on this planet. Though she didn’t unlock them all, she gave modern science the tools and methods to continue the search.