A New Approach to Building the Tree of Life
When the British morphologist St. George Jackson Mivart published one of the first evolutionary trees in 1865, he had very little to go on. He built the tree — a delicately branching map of different primate species — using detailed analysis of the animals’ spinal columns. But a second tree, generated by comparing the animals’ limbs, predicted different relationships among the primates, highlighting a challenge in evolutionary biology that continues to this day.
Now, nearly 150 years later, scientists have vast amounts of data with which to build so-called phylogenetic trees, the modern version of Mivart’s structure. Advances in DNA sequencing technology and bioinformatics enable them to compare the sequence of hundreds of genes, sometimes entire genomes, among many different species, creating a tree of life more detailed than ever before.
But while the abundance of data has helped resolve some of the conflict surrounding parts of the evolutionary tree, it also presents new challenges. The current version of the tree of life is more like a contentious wiki page than a published book, with certain branches subject to frequent debate. Indeed, just as the spinal column and limbs created contrasting maps of primate evolution, scientists now know that different genes in the same organism can tell different stories.
According to a new study partly focused on yeast, the conflicting picture from individual genes is even broader than scientists suspected. “They report that every single one of the 1,070 genes conflicts somewhat,” said Michael Donoghue, an evolutionary biologist at Yale who was not involved in the study. “We are trying to figure out the phylogenetic relationships of 1.8 million species and can’t even sort out 20 [types of] yeast,” he said.
To resolve this paradox, the researchers developed an algorithm, based on information theory, to gauge the level of certainty in specific parts of the tree. They hope the new approach will help to clarify periods of evolution that are potentially the most illuminating but also the most conflicted, such as the Cambrian explosion — the rapid diversification of animal life that occurred about 540 million years ago.
“Historically, the areas of the tree of life that have attracted a lot of attention and disagreement usually have to do with the most interesting episodes,” such as the origins of animals, vertebrates and flowering plants, said Antonis Rokas, a biologist at Vanderbilt University who led the new study.
Based on the results of the new algorithm, scientists can select only the most informative genes to build phylogenetic trees, an approach that could make the process more accurate and efficient. “I think it will help us quite a bit in speeding up the reconstruction of the tree of life,” said Khidir Hilu, a biologist at Virginia Tech in Blacksburg.
Building Blocks
At the most basic level, scientists create phylogenetic trees by grouping species according to their degree of relatedness. Lining up the DNA of humans, chimpanzees and fish, for example, makes it readily apparent that humans and chimps are more closely related to each other than they are to fish.
Researchers once used just one gene or a handful to compare organisms. But the last decade has seen an explosion in phylogenetic data, rapidly inflating the data pool for generating these trees. These analyses filled in some of the sparse spots on the tree of life, but considerable disagreement still remains.
For example, it’s not clear whether snails are most closely related to clams and other bivalves or to another mollusk group known as tusk shells, said Rokas. And we have no idea how some of the earliest animals to branch off the tree, such as jellyfish and sponges, are related to each other. Scientists can rattle off examples of conflicting trees published in the same scientific journal within weeks, or even in the same issue.
“That poses a question: Why do you have this lack of agreement?” said Rokas.
Rokas and his graduate student Leonidas Salichos explored that question by evaluating each gene independently and using only the most useful genes — those that carry the greatest amount of information with respect to evolutionary history — to construct their tree.
They started with 23 species of yeast, focusing on 1,070 genes. They first created a phylogenetic tree using the standard method, called concatenation. This involves stringing together all the sequence data from individual species into one mega-gene and then comparing that long sequence among the different species and creating a tree that best explains the differences.
The resulting tree was accurate according to standard statistical analysis. But given that similar methods have produced trees of life that are rife with contradiction, Rokas and Salichos decided to delve deeper. They built a series of phylogenetic trees using data from individual yeast genes and employed an algorithm derived from information theory to find the areas of greatest agreement among the trees. The result, published in Nature in May, was unexpected. Every gene they studied appeared to tell a slightly different story of evolution.
“Just about all the trees from individual genes were in conflict with the tree based on a concatenated data set,” says Hilu. “It’s a bit shocking.”
They concluded that if a number of genes support a specific architecture, it is probably accurate. But if different sets of genes support two different architectures equally, it is much less likely that either structure is accurate. Rokas and Salichos used a statistical method called bootstrap analysis to select the most informative genes.
In essence, “if you take just the strongly supported genes, then you recover the correct tree,” said Donoghue.
The revised tree was consistent with one constructed using an alternative source of evolutionary information — large-scale alterations in chunks of DNA that are passed down from generation to generation — validating their approach.
The findings weren’t limited to yeast. When the researchers applied the same analysis to larger and more complex life forms, including genetic data from vertebrates and animals, they found extensive conflict among individual genes as well.
For some researchers, the idea of selectively excluding data from analysis could take some getting used to. “For many years, the biggest problem for people trying to understand relationships between organisms was getting enough data,” said Jeffrey Townsend, an evolutionary biologist at Yale University, who was not involved in the study. “The community has always been taught to get the data, so it’s reasonable that’s how they thought about the problem.”
While evolutionary biologists have been grappling with these issues for years, the new study is the largest-scale effort to date to explore the level of conflict among individual genes. “People will have two reactions: There is a lot more conflict than I thought, and we need to do a better job of analyzing it,” said Donoghue, who is interested in applying the new method in his own work. However, he also points out that it’s difficult to confirm the accuracy of the new approach. Even though the revised tree matched one built using alternative genetic information, the latter may harbor its own inaccuracies. “I am not so sure we know what the true relationships are,” he said. “If we aren’t sure what the truth is, we can’t tell if we have the right tree.”
A Changing Picture
Researchers need to apply the new technique more broadly to see how it might alter our picture of evolution. However, Rokas and Salichos have already shown that the most difficult parts of the tree to reconstruct are short branches, or “bushy” parts, which indicate periods of rapid speciation, especially those at the base of the tree, deep in evolutionary history.
“Theoretical work predicted such behavior, but our study is the first to show experimental data that this is the case,” Rokas said.
Rokas also argues that the new findings should alter how researchers interpret fuzzy parts of the tree. “Evolutionary biologists tend to assume that lack of resolution means failure to infer the right tree; thus, if only one had more data and better algorithms, then we would be able to infer the right tree,” he said. But conflicted parts of the tree that persist despite reams of data and the application of this new analysis may indicate bushy parts, he said. “I think in some cases the algorithm will actually resolve the conflict, and in others it highlights areas of conflict that are unlikely to ever be resolved.”
Studying these bushy parts may reveal new insight into particularly interesting epochs in evolution, such as the Cambrian explosion, when life transformed from mostly simple organisms into a diverse array of animal species.
Other scientists agree that the findings could have a significant impact on how the field deals with clashing pictures of evolution.
“I think it’s a harbinger of a paradigm shift,” said Townsend. “The point is that if we use the right methods, we will have the potential to learn about questions that have plagued us for a long time.”
Townsend, who has developed his own method for selecting the most informative genes based on the speed at which they are evolving, notes that not everyone in the scientific community agrees on the need for these new approaches. “I hope this paper will help bring that to the forefront,” he said.
Selecting the appropriate number of genes to use in drafting phylogenetic trees isn’t the only question plaguing evolutionary biologists. They must also settle on the number of species to include — the more species used in the tree, the greater the complexity of the analysis. The results can also be biased by differences in the quality of data for different species. “If we are interested in getting the true evolutionary history of how everything is related, is the best chance to sample more genes or more species?” said Donoghue. “I think both are good things to do.”
New approaches that allow researchers to get accurate results with fewer genes may make it possible to flesh out the evolutionary tree. Being able to select only the most informative genes could make the process more efficient, enabling scientists to create accurate trees with less data and at a lower cost. “If we could select a few genes that give us a tree as good as the whole genome,” said Hilu, “we would be able to build the tree of life with much more detail — at the genus level or maybe even the species level — instead of just the backbone of major lineages.”
This article was reprinted on Wired.com.