Wanted: More Data, the Dirtier the Better
Introduction
To distill a clear message from growing piles of unruly genomics data, researchers often turn to meta-analysis — a tried-and-true statistical procedure for combining data from multiple studies. But the studies that a meta-analysis might mine for answers can diverge endlessly. Some enroll only men, others only children. Some are done in one country, others across a region like Europe. Some focus on milder forms of a disease, others on more advanced cases. Even if statistical methods can compensate for these kinds of variations, studies rarely use the same protocols and instruments to collect the data, or the same software to analyze it. Researchers performing meta-analyses go to untold lengths trying to clean up the hodgepodge of data to control for these confounding factors.
Purvesh Khatri, a computational immunologist at Stanford University, thinks they’re going about it all wrong. His approach to genomic discovery calls for scouring public repositories for data collected at different hospitals on different populations with different methods — the messier the data, the better. “We start with dirty data,” he says. “If a signal sticks around despite the heterogeneity of the samples, you can bet you’ve actually found something.”
This strategy seems too easy, but in Khatri’s hands, it works. Analyzing troves of public data, Khatri and colleagues have uncovered signature genes that could allow clinicians to detect life-threatening infections that cause sepsis, classify infections as bacterial or viral, and tell if someone has a specific disease such as tuberculosis, dengue or malaria. Last year Khatri and two other scientists launched a company to develop a device for measuring these gene signatures at a patient’s bedside. In short, they’re deciphering the host immune response and turning key genes into diagnostics.
Over the past year Khatri discussed his ideas with Quanta Magazine over the phone, by email and from his whiteboard-lined Stanford office. An edited and condensed version of the conversations follows.
What turned you on to biology?
I left India and came to the U.S. in the “fix the Y2K bug” rush with plans to get a master’s in computer science and become a software engineer. Months after arriving at Wayne State University in Detroit I realized that writing software for the rest of my life was going to be really boring. I joined a lab working on neural networks.
But then my adviser switched to bioinformatics and said he’d pay my tuition if I switched with him. I was a poor Indian grad student. I thought, “You’re going to pay my salary? I’ll do whatever you are doing.” That’s how I moved into biology.
You made a splash pretty quickly. How did that happen?
While my adviser was away on sabbatical in 2000-2001, I worked in the lab doing bioinformatics analyses with a postdoc in our collaborator’s lab, a gynecologist studying genes involved in male fertility. Microarrays for running assays on large numbers of genes at once were brand-new. From a recent experiment, he’d gotten a list of some 3,000 genes of interest, and he was trying to figure out what they were doing.
One day I saw him going from one website to another, copying and pasting text into Excel spreadsheets. I said to him, “You know, I can write software for you that will do all of that automatically. Just tell me what you are doing.” So I wrote a script for him — it took me three days — and with the results we wrote a Lancet paper.
We put the software on the web. There was huge interest. They presented it at some conference, and Pfizer wanted to buy it. I thought, wow, this is such low-hanging fruit. I can be a millionaire soon.
What does the software do?
It takes the set of genes you specify and searches annotation databases to tell you what biological processes and molecular pathways those genes are involved in. If you have a list of 100 genes, it could tell you that 15 are involved in immune response, another 15 are involved in angiogenesis and 50 play a role in glucose metabolism. Let’s say you’re studying Type 1 diabetes. You could look at these results and say, “I’m on the right path.”
This was 15 years ago, when I was getting my master’s degree. I developed more tools and expanded the work into a Ph.D. It’s now an open-access, web-based suite of tools called Onto-Tools. Last I checked a few years ago, it had 15,000 users from many countries, analyzing an average of 100 data sets a day.
Although the tools became very popular, they weren’t telling me how the results get used, how they help people. I wanted to see how research progresses from bioinformatics analyses to lab experiments and ultimately to something that could help patients.
How did you make that switch?
When I came to Stanford as a postdoc in 2008, one of my conditions was that somebody with a wet lab — someone running experiments on samples from mice or actual patients, not just analyzing data in silico — would pay half my salary, because I wanted their skin in the game. I wanted to make predictions using methods I’d develop in one lab, and then work with another lab to validate those predictions and tell me what’s clinically important. That’s how I ended up working with Atul Butte, a bioinformatician, and Minnie Sarwal, a renal transplant physician. [Editor’s note: Butte and Sarwal have both since moved from Stanford to the University of California, San Francisco.]
What shifted your attention to immunology?
Reading papers to learn the basic biology of organ transplant rejection, I had an “Aha!” moment. I realized that heart transplant surgeons, kidney transplant surgeons and lung transplant surgeons don’t really talk to each other!
No matter which organ I was reading about, I saw a common theme: The B cells and T cells of the graft recipient’s immune system were attacking the transplant. Yet diagnostic criteria for rejection were different — kidney people follow Banff criteria for renal graft rejection, heart-and-lung people follow ISHLT [International Society for Heart and Lung Transplantation] criteria. If the biological mechanism is common, why are there different diagnostic criteria? That didn’t make sense to me as a computer scientist.
I was starting to form a hypothesis that there must be a common mechanism — some common trigger that tells the recipient’s immune cells that something is “not self.” While thinking about this, I came across a fantastic paper titled “The Immunologic Constant of Rejection.” The authors basically laid out my hypothesis. They proposed that while the triggers for organ rejection may differ, they share a common pathway. And they were saying someone should test this.
What did you do at that point?
I started asking my colleagues, “Why don’t we start collecting samples from various organ transplant cohorts and do the analysis to find out what common genes are involved?” They said you can’t do it because you’d have to account for all the heterogeneity — different organs, different microarray technologies, different treatment protocols. It would be expensive to control for all of that.
Plus, it would take years to get everyone to contribute all those samples. I was in a hurry. So Atul suggested getting ahold of existing public data instead. But these data are “dirty,” as they are confounded by a number of biological and technical factors.
I wondered if we really had to control for heterogeneity. If all this “dirty” data exists, maybe we could just combine it somehow. And if we found a signal, despite the heterogeneity, wouldn’t you then say, oh, that’s what I should be looking at?
I started working on it.
What happened on that first try?
I went to the Gene Expression Omnibus website and downloaded data from several organ transplant studies — heart, kidney, lung, liver. The data came from five hospitals and used at least two different diagnostic criteria. Because we weren’t throwing out “incompatible” data, we set our [allowable] false discovery rate higher than usual (20 percent instead of the usual 5 percent). We were willing to get more false positives if we could find a common mechanism across all the solid-organ transplant rejections. We checked some other things, like making sure one data set wasn’t driving all the results, and did some additional steps to make sure we weren’t just getting a bunch of genes changing. And it worked.
What do you mean by “worked”?
Using a lot of heterogeneous data, we found a set of 11 genes that were overexpressed in patients who rejected their transplants, and we showed that we could validate that gene signature in other cohorts from different hospitals in different countries. Plus, using this gene set, we could predict — from a biopsy six months after the graft surgery — which patients would experience significant subclinical graft injury (a harder condition to detect than acute rejection) 18 months later. So it was also a prognostic marker.
We confirmed these results in mice. We took a heart from one mouse, put it into another animal, and asked: Do these genes change when we see transplant rejection? The answer was yes.
We then did a Google search to find drugs whose mechanisms suggest they regulate the biological processes of the genes we had found. We chose two FDA-approved drugs to try on our mice. Lo and behold, they worked. Both drugs reduced graft-infiltrating immune cells [a marker for rejection]. They looked as good as a drug we currently give to transplant patients.
One of those two drugs is a statin, a drug widely prescribed to prevent heart disease. I sought help from a former colleague who now works in Belgium and has access to electronic medical records dating back to 1989. I asked him to search the database for patients who got renal transplants and see what drugs they took, when their grafts failed, all of that. He ran the analysis and a week later said to me, “Guess what? If the patients received statins, their graft failure rate was reduced 30 percent.”
Diagnosis, prognosis, therapy and validation of the findings against electronic medical records — all in one paper.
I don’t quite get how your approach differs from traditional meta-analysis. What’s fundamentally different?
The biggest difference is that our group ignores heterogeneity across data sets, whereas in traditional meta-analysis we are taught to reduce heterogeneity.
People say, for example, “I’m not going to use this sample because that patient had a different drug treatment. Or maybe these patients were early post-transplant whereas this other data set is late, five years after transplant, so I’m not going to use that data.” In bioinformatics, we have learned to take data sets and select samples making sure there is no noise, no confounding factors.
But when we do this, it does not capture the heterogeneity of the disease. We know that. That’s why we have to replicate the findings in other cohorts.
What I’m saying is, don’t worry about the heterogeneity. Using dirty data allows you to account for clinical heterogeneity.
But to be sure that heterogeneity wasn’t going to screw up my results, I set stringent criteria for validating that statistical associations we found between genes and medical conditions were not flukes. Validation had to be done in an independent cohort that was not part of the discovery set. In other words, if a lab had more than one data set published, I made each data set either a discovery or a validation cohort a priori. [Editor’s note: Traditionally, researchers often divide a group of participants into two subgroups: a “discovery” group to be mined for genes associated with a certain condition, and a “validation” group, which they analyze separately to validate the genes identified in the discovery group.]
This approach has worked. The genes we have identified using a lot of dirty data — where we just took all the biological and technological heterogeneity we could find — we have been able to validate in cohorts that came from different groups at different hospitals in different countries.
Last fall we published a set of guidelines so anyone can do this. It compares several methods and is quite technical, but here is the punchline: Reproducibility is good (greater than 85 percent) when you use three to five data sets with a total of 200-250 samples. Which meta-analysis method you choose is not important. What really matters is not having a large, homogeneous data set but rather multiple heterogeneous data sets.
Our method, MetaIntegrator, is available on CRAN, an open-access repository for programs written in R.
Recently, we did an analysis showing that using dirty data is not just good, but required, because of research biases in literature. We just released the preprint on biorxiv.org. The gist is that forming hypotheses based on what’s been published is akin to looking for your keys under a random street lamp because that’s where the light is better.
Does your approach work in other scenarios besides transplant rejection?
We have applied this framework to cancer and to both infectious and autoimmune diseases. For example, a friend of mine works on cancers driven by mutations in a gene called KRAS. He came to me and asked, “I have these five genes I’m interested in. Can you run your analysis and tell me which ones I should focus on?”
I ran the method across 13 data sets: six for pancreatic cancer, seven for lung cancer. No matter what I did, one gene always showed up as changing the most. He ran with that result and figured out a mechanism, and it became a Nature paper.
That was in 2014, just before a local 10th-grader arrived to do a summer research project. What did you propose to him?
Thinking more about the 11 genes from the organ transplant work, I started to wonder: How specific is that set of genes? Do those same 11 genes increase when you have an infection? What about cancer? Autoimmune disease?
I said to the student, who was spending a summer working with me, let’s start collecting data for all these different diseases. Just download the data, run our pipeline and show me the gene signatures — the list of genes whose expression changes for each condition. He used 173 microarray data sets, with more than 8,000 human samples coming from 42 diseases. Bacterial infections, viral infections, autoimmune and neurodegenerative disorders, psychiatric conditions, cancers.
He spent the summer downloading data, putting it into our database and annotating it — whether it was case or control, what disease, what tissue. For each disease, he identified a gene signature. Based on those signatures, he correlated every disease with every other disease. Simple correlation: If one gene is up in this disease, is it also up in this other disease? Then he did hierarchical clustering. Simplest possible thing you could imagine.
He came to me with a figure — a matrix summarizing all these results — and I’ve been using it as a source of new questions to address. A high school student’s summer project set the core research direction in my lab!
What are some recent findings?
Tim Sweeney, a Stanford surgery resident doing a master’s degree in biomedical informatics in my lab, used this approach a few years ago to systematically figure out what’s causing the immune response — like a flowchart. He first used it to find a gene signature to distinguish sepsis from noninfectious inflammation, then to distinguish whether it’s a bacterial or viral infection. If it’s viral, is it influenza or something else? If it’s bacterial, is it TB? And besides bacteria and viruses, infections can also be caused by parasites. Recently we identified a gene signature for a person’s response to malaria. We can now answer all these questions by measuring gene expression in the host immune response.
Last May, Tim and I helped found a company, Inflammatix, to commercialize our “dirty data”-based diagnostics. The company has licensed these signatures from Stanford and will develop methods to leverage dirty data to their true potential. I think we have not even scratched the surface of what we can do with available data.
One more thing. In our 2014 Cancer Research paper, we showed that the enzyme PTK7 plays an important role in lung cancer. If you knock down levels of it, tumors start shrinking. At the time PTK7 was what’s called an “orphan receptor tyrosine kinase” — it wasn’t known where it binds in the body. But earlier this year Pfizer published a report about a drug that targets PTK7 for non-small-cell lung cancer.
All of this seems like it would convince other researchers to take up your approach. Have they?
My worry was that the moment we published this, there would be so many people competing with us. Yet now it’s in the public domain and barely anyone uses it!
When I present about this approach, I get converts. But until then I get grant reviews such as this one I posted on Twitter the other day: Primary investigator “seems to like bright objects and flits from one shiny project to another without focus.”
So that’s my challenge. How do we convince them?
This article was reprinted on ScientificAmerican.com.