machine learning

Decoding Flu Viruses Before an Outbreak

Machine learning techniques are helping scientists pinpoint the mutations that allow bird and pig viruses to infect humans.
Human infections of a new variant of H7N9 influenza were first reported in China in April, mainly in people who had close contact with poultry.
CDC

Every few decades, a pandemic flu variant emerges that not only infects humans but also passes rapidly from person to person. The H7N9 avian flu virus that infected more than 130 people in China this spring, primarily from close contact with poultry, hasn’t yet become highly contagious among people. But given that humans lack the antibodies to combat the virus, its high lethality rate (44 of the infected died), and the possibility that it could resurface this fall or winter, scientists and public health officials are racing to unravel its mysteries.

Recent studies of H7N9 show that it can pass among ferrets, which are often used to model human flu transmission. If the virus gains the ability to spread easily among people, it has the potential to be deadlier than the 2009 H1N1 swine flu pandemic, which may have been responsible for more than 200,000 deaths worldwide.

Researchers like Raul Rabadan, a theoretical physicist working in biology at Columbia University, want to understand how viruses that ordinarily infect birds or pigs suddenly jump to humans and then become easily transmissible: “What are the specific mutations that contribute to a virus becoming a human pathogen?” he explained.

Traditionally, answering this question would have required a painstaking comparison of the DNA or protein sequence of different viruses. But armed with rapidly growing databases of virus sequences, scientists are now using sophisticated machine learning techniquesa branch of artificial intelligence in which computers develop algorithms based on the data they have been given to identify key properties in viruses like H7N9. Knowing these properties will help researchers identify the most dangerous new flu strains and could lead to more effective vaccines. Most importantly, scientists can now look at hundreds or thousands of flu strains simultaneously, which could reveal common mechanisms across different viruses or a broad diversity of transformations that enable human transmission.

“It’s changing the field radically,” said Nir Ben-Tal, a computational biologist at Tel Aviv University in Israel.

Researchers are also using these approaches to investigate a broad range of viral mysteries, including what makes some viruses more harmful than others and the factors that influence a virus’s ability to trigger an immune response. The latter could ultimately aid the development of flu vaccines. A study published in July analyzed differences in the human immune system’s response to flu, identifying for the first time genetic variants that seem to influence an individual’s ability to fight off H1N1. Machine learning techniques might even accelerate future efforts to identify the animal source of mystery viruses.

Swine Flu Computations

To identify mutations that transform a pig or bird virus into a human one, scientists traditionally compared the protein sequences from virus strains before and after they developed the ability to infect people. Researchers had to manually assess the sequences for differences, focusing on regions known to be important for transmission such as hemagglutinin, the protein that binds the virus to the host cell. The laborious nature of this method meant researchers could look at only tens of viruses at a time.

“It’s a fairly crude approach,” said Richard Webby, a virologist at St. Jude Children’s Research Hospital in Memphis.

Once researchers have identified candidate mutations, they can genetically engineer those candidates into a nonhuman virus to examine the effect. Although this approach has identified some mutations that probably contributed to the pandemic swine flu and other flus, there are some drawbacks. For example, this method can involve creating a potentially dangerous virus. Two papers published in 2012 on the H5N1 bird flu, which typically does not spread among people, caused a furor because researchers introduced mutations that made the virus better able to spread among ferrets. Critics were concerned that the findings could be co-opted by bioterrorists, but the research was eventually made public.

This approach also requires prior knowledge of the mechanisms of human transmissibility, limiting its ability to discover new mechanisms. “They had to kick-start the process with some mutations that people knew were important,” Webby said. “The computational approach doesn’t rely on biology in the initial phases, so it may work where the biological approach may not.”

In 2011, Ben-Tal, Webby and their collaborators became the first to use machine learning to compare protein sequences of the 2009 H1N1 pandemic swine flu with hundreds of other swine viruses. They were searching for changes that might have been responsible for its jump from pigs to people. “Before then, we had seen lots of infection of humans but not human-to-human spread,” said Webby, who also directs a World Health Organization center that studies influenza in animals.

Machine learning algorithms have been used to study DNA and protein sequences for more than 20 years, but only in the past few years have scientists applied them to viruses. Inspired by the growing amount of viral sequence data available for analysis, Ben-Tal’s team used an approach called supervised learning. With this method, each piece of input data used to train the algorithm is tagged with a category, in this case whether the virus sequence was derived from swine or humans. The resulting algorithm defines a decision tree capable of accurately sorting the viruses into the proper group — human or swine. The nodes of the tree point to the specific amino acids, or building blocks of proteins, that reliably differentiate the groups. Ben-Tal said this specific approach, called an alternating decision tree algorithm, is standard in machine learning but had rarely been applied to biological data before his team’s study.

In the study, the researchers identified 13 amino acid changes that appeared to distinguish human viruses from viruses that remained in swine and an additional 10 amino acid changes that distinguished the pandemic strain from standard seasonal flus. One or more of these candidates, which the scientists have since analyzed in more detail, could explain the virus’s dangerous transformation. (Ben-Tal, Webby and their collaborators will soon publish a paper characterizing a mutation that they say helped H1N1 become a pandemic virus.)

One of the key benefits of the computational approach was that researchers were able to look beyond the standard targets, regions of the genome known to be involved in traits such as transmissibility. For example, some of the candidate mutations lie nearby but outside the specific site where hemagglutinin binds to the host cell.

“The residue that was important wasn’t in a part of the protein that we had ever predicted,” Webby said. “Had I gone in using the old ways of looking at changes, I wouldn’t have been looking at this particular part of the protein.”

Novel Viruses

Rabadan and collaborators at Columbia are now using a similar approach to explore these viral properties more broadly, analyzing a database of more than 60,000 virus genomes. The large numbers enable a more robust statistical analysis and help to pinpoint the most important genetic changes, Rabadan said. While the H1N1 study looked for specific changes that made the virus jump to humans, Rabadan’s team hopes to identify mechanisms common to different viruses. For example, if two kinds of viruses develop the same genetic change, a phenomenon known as convergent evolution, it suggests that the change may actually be involved in the function under study, such as human transmission, rather than occurring by chance. “We have the power of not looking at one outbreak at a time,” Rabadan said.

The researchers have compiled a list of candidate mutations, some of which have already been identified as playing a role in human transmission. For the novel candidates, researchers are now studying viruses with these mutations in human cells lines to see how efficiently they can infect human cells.

It’s not yet clear how successful the approach will be in analyzing viruses that haven’t yet become highly contagious. Ben-Tal said his group had initially tried its approach on the H5N1 virus, which, like H7N9, lacks the ability to spread easily among people, but found it was much more difficult to identify amino acid changes unique to human versions. (Unlike H1N1 cases, those involving H5N1 and H7N9 have mainly been limited to people who had close contact with birds.)

Database Housekeeping

Raul Rabadan and collaborators are analyzing more than 60,000 viral sequences, searching for specific factors that make some viruses able to easily infect people. But the genome sequences in these databases can be of mixed quality, so researchers first had to come up with a way to glean only the most accurate and useful information, identifying likely sequencing errors or sources of contamination. “Some are sequences not correctly annotated, some are low-quality or incomplete, so we use only the ones we trust,” said Rabadan. “We have our own algorithms for trimming down to what we trust.” Without cleaning and curating the data, the signals picked up by machine learning algorithms could be associated with “weird stuff instead of biologically interesting things,” he said.

“The data represent a collection of different avian strains that mutated in different amino acid positions and infected humans,” Ben-tal said. “This is interesting because it may mean that there are many mutations that make it easier to cross the avian-to-human species barrier.” But it makes it challenging to identify specific culprits. He said H7N9 may be difficult to analyze for the same reason.

Meanwhile, Tomer Hertz, a researcher at the Fred Hutchinson Cancer Research Center in Seattle, and collaborators at St. Jude’s Children’s Hospital and elsewhere, are using machine learning to study the human immune system’s response to these viral strains. In July, they published new research suggesting for the first time that an individual’s immune profile influences how well that person can fight off novel influenza strains.

The main reason that new viruses such as H1N1, H7N9 and H5N1 are so dangerous is that humans have little or no existing ability to attack and neutralize them. Lacking antibodies from vaccines or earlier infections, the human immune system relies on T cells, which attack foreign proteins, and human leukocyte antigens (HLAs) which bind to short pieces of viral proteins. The chunks of viral proteins are then presented to the T cells, which triggers different processes to eliminate them.

HLA genes are highly diverse: The human population carries thousands of varieties, although each individual possesses only 12. Hertz and collaborators had previously shown that some HLA proteins are better at targeting the conserved regions of viruses, parts of the sequence that remain stable over time and are similar across different strains. That ensures that the immune system is able to target the broadest possible collection of viruses. Indeed, an individual’s HLA profile can influence how he or she responds to certain viruses, including HIV. People who don’t develop AIDS after HIV infection are more likely to have certain HLA variants. Hertz’s team has shown that those variants that are more effective at targeting the conserved parts of the virus.

For the new study, published in the Proceedings of the National Academy of Sciences, researchers used computational approaches to reveal that people whose HLA variants are less effective at targeting conserved viral regions demonstrate a weaker immune response. Computational analysis revealed that a certain class of HLA variants, known as HLA-A*24, is particularly ineffective.

The study also found that countries with higher mortality rates from H1N1 tended to have a higher frequency of HLA-A*24 alleles. Additionally, within the United States and Australia, Native American and aboriginal populations tend to have higher rates of this family of variations and both suffered higher death rates than surrounding populations. Hertz said further work is needed to confirm a causal link between HLA-A*24 alleles and greater risk of death from H1N1.

Hertz, who collaborated with Ben-Tal and Webby on the H1N1 study, said this work is enabled by rapidly growing sources of both virus and immune system data. “We had this idea a few years ago, but we didn’t think it was as feasible until now,” he said.

Beyond influenza, Rabadan and Chris Wiggins, a mathematician at Columbia, have developed a method for identifying a virus’s original animal host, which is sometimes difficult to determine from the virus itself.  The coronavirus behind the recent outbreak of Middle East respiratory syndrome, for example, has killed nearly half of the more than 100 confirmed cases. Last week, an exhaustive, 15-month search pointed to bats as the source. But machine learning approaches paired with large databases housing information on viruses from many different animal hosts might speed up this type of search.

The machine learning approach is likely to expand as even more genomic information becomes available. “Databases will get much richer, and computational approaches will get much more powerful,” Webby said. That in turn will help scientists better monitor emerging flu strains and predict their impact, ideally forecasting when a virus is likely to jump to people and how dangerous it is likely to become.

This article was reprinted on Wired.com.

Comment on this article