The rise of DNA testing through services like 23andme shows that there’s a big market out there for family history.
Now, scientists have built on that data by publishing what they believe is the largest genealogy database in the world, with a family tree that links 13 million people and stretches back more than five centuries.
As Jocelyn Kaiser reports for Science magazine, Yaniv Erlich, a computational geneticist at Columbia University, thought up the project seven years ago after receiving an email from a distant relative cousin through Geni.com, one of the many sites where search for family ties.
With the support of Gemi.com’s chief technology officer, Erlich downloaded the site’s public profiles — tens of millions of them. Though it didn’t offer DNA data, the information included a person’s name, sex, date and place of birth, date of death and immediate relatives.
Nature wrote about Erlich’s project in its early stages back in 2013, and last year, the Atlantic’s Sarah Zhang reported that the researchers had released a preprint of the massive tree. Now, Kaiser writes, Erlich's team has published a study on their work in the journal Science. Using the data, they ended up with 5.3 million trees, the largest of which connects some 13 million relatives, mostly of European descent.
Since starting the project, Erlich has become the chief science officer of MyHeritage, a genealogy and DNA testing company that owns Geni.com. He did a Reddit Ask Me Anything last Friday on his findings, correcting misconceptions and explaining the methodology behind the project. He also noted that the most interesting part of the experience for him was figuring out how to translate all of the available data into something personal.
In an interview with National Geographic’s Nicole Wetsman, Erlich says that figuring out how to work with that data was also the most challenging part of the project. “Genomic datasets have specific tools, data structures, methods, but we didn’t have any of that for this. We were inventing the wheel as we went," he says.
Ultimately, the researchers used mathematical graph theory to organize and verify the information, reports Laura Geggel for Live Science. They also compared the profiles with about 80,000 publicly available death certificates from Vermont over a 25 year periods to ensure it wasn’t only wealthy profiles uploaded into Geni.com.
The team then decided what information they wanted to look for to test the database, Wetsman writes.
They started looking at patterns and found fluctuations in life span, something they had anticipated. For example, they saw a drop of young men during the Civil War and World Wars I and II, and a rise of childhood survival in the 1900s. They were also able to track migration, like the arrival of the Mayflower in 1620 in what is now Massachusetts, followed by an increase of births in that area.
Researchers also found that longevity has more to do with environment and behavior than genetics; in fact, the data revealed genes are only 16 percent responsible for life span. Paola Sebastiani, professor of biostatistics at the Boston University School of Public Health, however, cautions drawing conclusions around this data in an interview with Wetsman. “There’s a lot of confusion about the definitions of longevity,” she says.
Geneticist Peter Visscher of the University of Queensland in Brisbane, Australia, tells Kaiser the data Erlich's team complied does have the potential to provide insight into the role genetics on diseases if the data is linked to health information.
The research team has already begun to combine the tree with information from DNA.Land, which crowdsources DNA data, which could mean that an even larger tree may be coming soon. Researchers predict that if the database could go back 65 generations, they'll be able to complete the tree.