Introduction
Twenty years ago, sequencing the human genome was one of the most ambitious science projects ever attempted. Today, compared to the collection of genomes of the microorganisms living in our bodies, the ocean, the soil and elsewhere, each human genome, which easily fits on a DVD, is comparatively simple. Its 3 billion DNA base pairs and about 20,000 genes seem paltry next to the roughly 100 billion bases and millions of genes that make up the microbes found in the human body.
And a host of other variables accompanies that microbial DNA, including the age and health status of the microbial host, when and where the sample was collected, and how it was collected and processed. Take the mouth, populated by hundreds of species of microbes, with as many as tens of thousands of organisms living on each tooth. Beyond the challenges of analyzing all of these, scientists need to figure out how to reliably and reproducibly characterize the environment where they collect the data.
“There are the clinical measurements that periodontists use to describe the gum pocket, chemical measurements, the composition of fluid in the pocket, immunological measures,” said David Relman, a physician and microbiologist at Stanford University who studies the human microbiome. “It gets complex really fast.”
Ambitious attempts to study complex systems like the human microbiome mark biology’s arrival in the world of big data. The life sciences have long been considered a descriptive science — 10 years ago, the field was relatively data poor, and scientists could easily keep up with the data they generated. But with advances in genomics, imaging and other technologies, biologists are now generating data at crushing speeds.
One culprit is DNA sequencing, whose costs began to plunge about five years ago, falling even more quickly than the cost of computer chips. Since then, thousands of human genomes, along with those of thousands of other organisms, including plants, animals and microbes, have been deciphered. Public genome repositories, such as the one maintained by the National Center for Biotechnology Information, or NCBI, already house petabytes — millions of gigabytes — of data, and biologists around the world are churning out 15 petabases (a base is a letter of DNA) of sequence per year. If these were stored on regular DVDs, the resulting stack would be 2.2 miles tall.
“The life sciences are becoming a big data enterprise,” said Eric Green, director of the National Human Genome Research Institute in Bethesda, Md. In a short period of time, he said, biologists are finding themselves unable to extract full value from the large amounts of data becoming available.
Solving that bottleneck has enormous implications for human health and the environment. A deeper understanding of the microbial menagerie inhabiting our bodies and how those populations change with disease could provide new insight into Crohn’s disease, allergies, obesity and other disorders, and suggest new avenues for treatment. Soil microbes are a rich source of natural products like antibiotics and could play a role in developing crops that are hardier and more efficient.
Life scientists are embarking on countless other big data projects, including efforts to analyze the genomes of many cancers, to map the human brain, and to develop better biofuels and other crops. (The wheat genome is more than five times larger than the human genome, and it has six copies of every chromosome to our two.)
However, these efforts are encountering some of the same criticisms that surrounded the Human Genome Project. Some have questioned whether massive projects, which necessarily take some funding away from smaller, individual grants, are worth the trade-off. Big data efforts have almost invariably generated data that is more complicated than scientists had expected, leading some to question the wisdom of funding projects to create more data before the data that already exists is properly understood. “It’s easier to keep doing what we are doing on a larger and larger scale than to try and think critically and ask deeper questions,” said Kenneth Weiss, a biologist at Pennsylvania State University.
Compared to fields like physics, astronomy and computer science that have been dealing with the challenges of massive datasets for decades, the big data revolution in biology has also been quick, leaving little time to adapt.
“The revolution that happened in next-generation sequencing and biotechnology is unprecedented,” said Jaroslaw Zola, a computer engineer at Rutgers University in New Jersey, who specializes in computational biology.
Biologists must overcome a number of hurdles, from storing and moving data to integrating and analyzing it, which will require a substantial cultural shift. “Most people who know the disciplines don’t necessarily know how to handle big data,” Green said. If they are to make efficient use of the avalanche of data, that will have to change.
Big Complexity
When scientists first set out to sequence the human genome, the bulk of the work was carried out by a handful of large-scale sequencing centers. But the plummeting cost of genome sequencing helped democratize the field. Many labs can now afford to buy a genome sequencer, adding to the mountain of genomic information available for analysis. The distributed nature of genomic data has created its own challenges, including a patchwork of data that is difficult to aggregate and analyze. “In physics, a lot of effort is organized around a few big colliders,” said Michael Schatz, a computational biologist at Cold Spring Harbor Laboratory in New York. “In biology, there are something like 1,000 sequencing centers around the world. Some have one instrument, some have hundreds.”
As an example of the scope of the problem, scientists around the world have now sequenced thousands of human genomes. But someone who wanted to analyze all of them would first have to collect and organize the data. “It’s not organized in any coherent way to compute across it, and tools aren’t available to study it,” said Green.
Big Data in Biology
A selection of big data projects in the life sciences exploring health, the environment and beyond.
Cancer Genome Atlas: This effort to map the genome of more than 25 types of cancers has generated 1 petabyte of data to date, representing 7,000 cases of cancer. Scientists expect 2.5 petabytes by completion.
Encyclopedia of DNA Elements (ENCODE): This map of the functional elements in the human genome — regions that turn genes on and off — contains more than 15 terabytes of raw data.
Human Microbiome Project: One of a number of projects characterizing the microbiome at different parts of the body, this effort has generated 18 terabytes of data — about 5,000 times more data than the original human genome project.
Earth Microbiome Project: A plan to characterize microbial communities across the globe, which has created 340 gigabytes of sequence data to date, representing 1.7 billion sequences from more than 20,000 samples and 42 biomes. Scientists expect 15 terabytes of sequence and other data by completion.
Genome 10K: The total raw data for this effort to sequence and assemble the DNA of 10,000 vertebrate species and analyze their evolutionary relationships will exceed 1 petabyte.
Researchers need more computing power and more efficient ways to move their data around. Hard drives, often sent via postal mail, are still often the easiest solution to transporting data, and some argue that it’s cheaper to store biological samples than to sequence them and store the resulting data. Though the cost of sequencing technology has fallen fast enough for individual labs to own their own machines, the concomitant price of processing power and storage has not followed suit. “The cost of computing is threatening to become a limiting factor in biological research,” said Folker Meyer, a computational biologist at Argonne National Laboratory in Illinois, who estimates that computing costs ten times more than research. “That’s a complete reversal of what it used to be.”
Biologists say that the complexity of biological data sets it apart from big data in physics and other fields. “In high-energy physics, the data is well-structured and annotated, and the infrastructure has been perfected for years through well-designed and funded collaborations,” said Zola. Biological data is technically smaller, he said, but much more difficult to organize. Beyond simple genome sequencing, biologists can track a host of other cellular and molecular components, many of them poorly understood. Similar technologies are available to measure the status of genes — whether they are turned on or off, as well as what RNAs and proteins they are producing. Add in data on clinical symptoms, chemical or other exposures, and demographics, and you have a very complicated analysis problem.
“The real power in some of these studies could be integrating different data types,” said Green. But software tools capable of cutting across fields need to improve. The rise of electronic medical records, for example, means more and more patient information is available for analysis, but scientists don’t yet have an efficient way of marrying it with genomic data, he said.
To make things worse, scientists don’t have a good understanding of how many of these different variables interact. Researchers studying social media networks, by contrast, know exactly what the data they are collecting means; each node in the network represents a Facebook account, for example, with links delineating friends. A gene regulatory network, which attempts to map how different genes control the expression of other genes, is smaller than a social network, with thousands rather than millions of nodes. But the data is harder to define. “The data from which we construct networks is noisy and imprecise,” said Zola. “When we look at biological data, we don’t know exactly what we are looking at yet.”
Despite the need for new analytical tools, a number of biologists said that the computational infrastructure continues to be underfunded. “Often in biology, a lot of money goes into generating data but a much smaller amount goes to analyzing it,” said Nathan Price, associate director of the Institute for Systems Biology in Seattle. While physicists have free access to university-sponsored supercomputers, most biologists don’t have the right training to use them. Even if they did, the existing computers aren’t optimized for biological problems. “Very frequently, national-scale supercomputers, especially those set up for physics workflows, are not useful for life sciences,” said Rob Knight, a microbiologist at the University of Colorado Boulder and the Howard Hughes Medical Institute involved in both the Earth Microbiome Project and the Human Microbiome Project. “Increased funding for infrastructure would be a huge benefit to the field.”
In an effort to deal with some of these challenges, in 2012 the National Institutes of Health launched the Big Data to Knowledge Initiative (BD2K), which aims, in part, to create data sharing standards and develop data analysis tools that can be easily distributed. The specifics of the program are still under discussion, but one of the aims will be to train biologists in data science.
“Everyone getting a Ph.D. in America needs more competency in data than they have now,” said Green. Bioinformatics experts are currently playing a major role in the cancer genome project and other big data efforts, but Green and others want to democratize the process. “The kinds of questions to be asked and answered by super-experts today, we want a routine investigator to ask 10 years from now,” said Green. “This is not a transient issue. It’s the new reality.”
Not everyone agrees that this is the path that biology should follow. Some scientists say that focusing so much funding on big data projects at the expense of more traditional, hypothesis-driven approaches could be detrimental to science. “Massive data collection has many weaknesses,” said Weiss. “It may not be powerful in understanding causation.” Weiss points to the example of genome-wide association studies, a popular genetic approach in which scientists try to find genes responsible for different diseases, such as diabetes, by measuring the frequency of relatively common genetic variants in people with and without the disease. The variants identified by these studies so far raise the risk of disease only slightly, but larger and more expensive versions of these studies are still being proposed and funded.
“Most of the time it finds trivial effects that don’t explain disease,” said Weiss. “Shouldn’t we take what we have discovered and divert resources to understand how it works and do something about it?” Scientists have already identified a number of genes that are definitely linked to diabetes, so why not try to better understand their role in the disorder, he said, rather than spend limited funds to uncover additional genes with a murkier role?
Many scientists think that the complexities of life science research require both large and small science projects, with large-scale data efforts providing new fodder for more traditional experiments. “The role of the big data projects is to sketch the outlines of the map, which then enables researchers on smaller-scale projects to go where they need to go,” said Knight.
Small and Diverse
Efforts to characterize the microbes living on our bodies and in other habitats epitomize the promise and the challenges of big data. Because the vast majority of microbes can’t be grown in the lab, the two major microbiome projects — the Earth Microbiome and the Human Microbiome — have been greatly enabled by DNA sequencing. Scientists can study these microbes mainly through their genes, analyzing the DNA of a collection of microbes living in the soil, skin or any other environment, and start to answer basic questions, such as what types of microbes are present and how they respond to changes in their environment.
The goal of the Human Microbiome Project, one of a number of projects to map human microbes, is to characterize microbiomes from different parts of the body using samples taken from 300 healthy people. Relman likens it to understanding a forgotten organ system. “It’s a somewhat foreign organ, because it’s so distant from human biology,” he said. Scientists generate DNA sequences from thousands of species of microbes, many of which need to be painstakingly reconstructed. It’s like recreating a collection of books from fragments that are shorter than individual sentences.
“We are now faced with the daunting challenge of trying to understand the system from the perspective of all this big data, with not nearly as much biology with which to interpret it,” said Relman. “We don’t have the same physiology that goes along with understanding the heart or the kidney.”
One of the most exciting discoveries of the project to date is the highly individualized nature of the human microbiome. Indeed, one study of about 200 people showed that just by sequencing microbial residue left on a keyboard by an individual’s fingertips, scientists can match that individual with the correct keyboard with 95 percent accuracy. “Until recently, we had no idea how diverse the microbiome was, or how stable within a person,” said Knight.
Researchers now want to figure out how different environmental factors, such as diet, travel or ethnicity, influence an individual’s microbiome. Recent studies have revealed that simply transferring gut microbes from one animal to another can have a dramatic impact on health, improving infections or triggering weight loss, for example. With more data on the microbiome, they hope to discover which microbes are responsible for the changes and perhaps design medical treatments around them.
Relman said that some of the major challenges will be determining which of the almost unmanageable number of variables involved are important, and figuring out how to define some of the microbiome’s most important functions. For example, scientists know that our microbes play an integral role in shaping the immune system, and that some people’s microbial community is more resilient than others — the same course of antibiotics can have little long-term impact on one individual’s microbial profile and throw another’s completely out of whack. “We just don’t have a big sense of how to go about measuring these services,” said Relman, referring to the microbes’ role in shaping the immune system and other functions.
The Earth Microbiome Project presents an even larger data analysis challenge. Scientists have sequenced about 50 percent of the microbial species living in our guts, which makes it much easier to interpret new data. But only about one percent of the soil microbiome has been sequenced, leaving researchers with genomic fragments that are often impossible to assemble into a whole genome.
Data in the Brain
If genomics was the early adopter of big data analysis in the life sciences, neuroscience is quickly gaining ground. New imaging methods and techniques for recording the activity and the structure of many neurons are allowing scientists to capture large volumes of data.
Jeff Lichtman, a neuroscientist at Harvard, is collaborating on a project to build neural wiring maps from an unprecedented amount of data by taking snapshots of thin slices of the brain, one after another, and then computationally stitching them together. Lichtman said his team, which uses a technique called scanning electron microscopy, is currently generating about a terabyte of image data per day from a single sample. “In a year or so, we hope to be doing multiple terabytes per hour,” he said. “That’s a lot of still raw data that has to be processed by computer algorithms.” A cubic millimeter of brain tissue generates about 2,000 terabytes of data. As in other areas of the life sciences, storing and managing the data is proving to be a problem. While cloud computing works for some aspects of genomics, it may be less useful to neuroscience. Indeed, Lichtman said they have too much data for the cloud, too much even for passing around on hard drives.
Lichtman believes the challenges neuroscientists face will be even greater than those of genomics. “The nervous system is a far more complicated entity than the genome,” he said. “The whole genome can fit on a CD, but the brain is comparable to the digital content of the world.”
Lichtman’s study is just one of a growing number of efforts to chart the brain. In January, the European Union launched an effort to model the entire human brain. And the U.S. is now working on its own large-scale project — the details are still under discussion, but the focus will likely be on mapping brain activity rather than the neural wiring itself.
As in genomics, Lichtman said, neuroscientists will need to get used to the concept of sharing their data. “It’s essential that this data become freely and easily accessible to anyone, which is its own challenge. We don’t know the answer yet to problems like this.”
Questions remain about funding and necessary advances in hardware, software and analytical methods. “Ideas like this are almost certainly going to cost a lot, and they have not produced fundamental findings yet,” said Lichtman. “Will you just end up with a meaningless mass of connectional data? This is always a challenge for big data.”
Still, Lichtman is convinced that the major findings will come with time. “I feel confident that you don’t have to know beforehand what questions to ask,” he said. “Once the data is there, anyone who has an idea has a dataset they can use to mine it for an answer.
“Big data,” he said, “is the future of neuroscience but not the present of neuroscience.”
This article was reprinted on Wired.com.