SCIENCE

The Work Is Only Beginning on Understanding the Human Genome

Ten years ago, scientists released a map of our genetic blueprint. But, as Eric D. Green explains, there are many more mysteries left to unravel

Elizabeth Quill

June 2, 2013

Eric Green — Eric D. Green is the director of the National Human Genome Research Institute. Maggie Bartlett, NHGRI

A decade ago, an international research team completed an ambitious effort to read the 3 billion letters of genetic information found in every human cell. The program, known as the Human Genome Project, provided the blueprint for human life, an achievement that has been compared to landing a man on the moon.

Dr. Eric D. Green was involved from the very beginning, refining some of the key technologies used in the project. At that time, he was a postdoctoral fellow and a resident in pathology at Washington University in St. Louis. He carved out his 5 percent of the genome, focusing on the mapping of the DNA of chromosome 7. Today, Green is the director of the National Human Genome Research Institute, which advances the understanding of the human genome through genomics research.

Let’s go back to the mid to late 1980s, when the idea for the Human Genome Project was first conceived. What was the motivation at the time?

It depends who you ask. Different people had different motivations. Keep in mind that the ’70s and early ’80s were the molecular biology revolution era. There were significant advances in methods that allowed us to isolate and study DNA in the laboratory.

In the U.S., for example, the Department of Energy got very interested in the notion of studying the genome because of interest in mutation, and the mutation process associated with some forms of energy, such as nuclear energy.

If you go to places like the National Institutes of Health, or you look at biomedical researchers and health-related researchers, they were very interested in being able to elucidate the genetic basis of disease. Among the many genetic diseases that were being considered, of course, was cancer.

A lot of other people across the biomedical research spectrum—even those working on model organisms, like flies and worms and yeast—recognized that if we could figure out how to comprehensively look at complex genomes, starting with flies and worms and yeast but then working our way up to humans, it would provide foundational information for understanding how the genome worked.

There was a coalescence of lots of different ideas that, with a backdrop of having incremental but important technological advances, made it seem that, while daunting, the problem of sequencing the human genome and determining the order of 3 billion letters was feasible.

Where did the material for the genome project come from? Whose genome was it?

When the genome project started, it was still pretty piecemeal. Different people were making different collections and DNA fragments called libraries, which are just pieces of DNA cloned. They would do it from anybody: Sometimes it would be the lab head, sometimes it would be the postdoctoral fellow or the grad student. They would just grab DNA back then when there were really no implications of that.

But then, when it finally came time to make the libraries that were going to be used for sequencing the human genome by the Human Genome Project, the person that was the best person for making those libraries was a scientist who worked at Roswell Park Cancer Institute in Buffalo, New York. [The team] got informed consent from about 10 or 20 anonymous blood donors, and then picked one of those at random, and that was the person. About 60 percent of the human genome sequence generated by the Human Genome Project was from one blood donor in Buffalo, New York.

But, you know what, it doesn’t matter. If you go across the human genome sequence generated by the Human Genome Project, it is like a mosaic. You may go for a hundred thousand letters and it may be that one person, from Buffalo. It might end up being that you’ll go the next hundred thousand and it will be somebody else. And the next hundred thousand, somebody else. All that served as was a reference. And since all humans are 99.9 percent identical at the sequence level, that first sequence doesn’t have to be a real person. It can just be a hypothetical reference of a person.

Of all that information, why did you choose to focus on chromosome 7 [the human genome has 23 chromosomes]?

It was somewhat arbitrary. We wanted to pick a chromosome that wasn’t too big. We didn’t want to pick one that was too small. We knew there was going to be a lot of work, so we picked a middle-sized chromosome.

We didn’t want to pick one that had a lot of people working on it already. At that point, the most famous gene on chromosome 7 was the cystic fibrosis gene, and that was discovered in 1989. And we had actually isolated some of that region and were doing some studies in a pilot fashion.

The truth is, we picked it because it wasn’t too big, wasn’t too small and wasn’t too crowded. That was an arbitrary way to start; by the time the genome project ended, most of the studies were being done genome-wide.

How did the work change over the project’s lifetime?

The whole story of genomics is one of technology development. If you trace where the huge advances were made, every one of them were associated with surges in technology. Early in the genome project, the surge came in that we had better ways of isolating big pieces of DNA.

When we were sequencing smaller organism genomes—like Drosophila fruit flies—we basically industrialized the process of doing sequencing, making it more and more and more automated.

When the genome project began, the idea was, “Let’s sequence the genomes of flies and worms and yeast, all these smaller organisms, using the method of the day,” which was this method developed by Fred Sanger in 1977. The idea was they wouldn’t push the accelerator to start sequencing the human genome until a revolutionary new sequencing method became available. So there were a lot of efforts to develop new crazy ways of sequencing DNA.

When it came time, in around 1997 or 1998, to actually think about starting to sequence the human genome, everybody said, “Maybe we don’t need to wait for a revolutionary method, maybe we have incrementally improved the old-fashioned method well enough that it can be used,” and indeed that is what was decided.

That said, since the genome project, the thing that has changed the face of genomics has been revolutionary new sequencing technologies that finally came on the scene by about 2005.

How have those improvements changed the cost and the times it takes for sequencing?

The Human Genome Project took six to eight years of active sequencing and, in terms of active sequencing, they spent about a billion dollars to produce the first human genome sequence. The day the genome project ended, we asked our sequencing groups, “All right, if you were going to go sequence a second human genome, hypothetically, how long would it take and how much would it cost?” With a back of the envelope calculation, they said, “Wow, if you gave us another 10 to 50 million dollars, we could probably do it in three to four months.”

But now, if you go to where we are today, you can sequence a human genome in about a day or two. By the end of this year, it will be about a day. And it will only cost about $3,000 to $5,000 dollars.

What were the major findings from the first genome and the ones that followed?

There are new findings that come everyday. In the first 10 years of having before us the human genome sequence, I think we on a day-by-day basis accumulate more and more information about how the human genome works. But we should recognize that even 10 years in, we are only at the early stages of interpreting that sequence. Decades from now we will still be interpreting, and reinterpreting, it.

Some of the earliest things that we learned, for example: We have many fewer genes than some people had predicted. When the genome began, many people predicted that humans probably had 100,000 genes, and they would have substantially more genes than other organisms, especially simpler organisms. It turns out that is not true. It turns out that we are a much lower gene number. In fact, we are probably more like 20,000 genes. And that is only a few thousand more than flies and worms. So our complexity is not in our gene number. Our complexity is elsewhere.

The other surprise came as we started sequencing other mammals—in particular, mouse genome, rat genome, dog genome and so forth, and by now we have sequenced 50, 60, 70 such genomes. You line up those genome sequences in a computer and you look to see where are sequences that are very conserved, in other words across tens of millions of years of evolutionary time, where have the sequences not changed at all. Highly, highly evolutionary conserved sequences almost for sure point to functional sequences. These are things that life doesn’t want to change and so they keep them the same because they are doing some vital fundamental function necessary for biology. Going into the genome project, we thought the majority of those most conserved regions that were functionally important were going to be in the genes—the parts of the genome that directly code for proteins. It turns out, the majority of the most highly conserved and inevitably functional sequences are not in protein coding regions; they are outside of genes.

So what are they doing? We don’t know all of them. But we know a lot of them are basically circuit switches, like dimmer switches for a light, that determine where and when and how much a gene gets turned on. It is much more complicated in humans than it is in lower organisms like flies and worms. So our biological complexity is not so much in our gene number. It is in the complex switches, like dimmer switches, that regulate where, when, and how much genes get turned on.

What do we have left to figure out?

When you think about how the genome works, that is thinking about how it works commonly for all of us. But the other big emphasis in genomics—especially in the last 10 years—is to understand how our genomes are different. So there you can emphasize the 0.1 percent of our genomes that are different compared to one another and how do those differences lead to different biological processes. So there, understanding variation is very, very important, and then correlating that variation to different consequences, of which disease is a major part of it.

There have been remarkable, just truly remarkable advances. We now know the genomic basis for almost 5,000 rare genetic diseases. When the genome project began, there were only a few dozen diseases for which we understood what the mutation was causing that disease. That is a huge difference. We now know many, many hundreds and hundreds of regions of the human genome that contain variants—we don’t know which variants yet—that are conferring risk for more complicated genetic diseases, like hypertension and diabetes and asthma, cardiovascular disease and so forth.

We have gone from having a complete lack of knowledge of where to look in the genome for those variants to now having very discrete regions to look in. So this is a big emphasis now in genomics, is trying to understand which variants are relevant to disease and what to do about them.