A New Approach to Understanding How Machines Think
Introduction
If a doctor told that you needed surgery, you would want to know why — and you’d expect the explanation to make sense to you, even if you’d never gone to medical school. Been Kim, a research scientist at Google Brain, believes that we should expect nothing less from artificial intelligence. As a specialist in “interpretable” machine learning, she wants to build AI software that can explain itself to anyone.
Since its ascendance roughly a decade ago, the neural-network technology behind artificial intelligence has transformed everything from email to drug discovery with its increasingly powerful ability to learn from and identify patterns in data. But that power has come with an uncanny caveat: The very complexity that lets modern deep-learning networks successfully teach themselves how to drive cars and spot insurance fraud also makes their inner workings nearly impossible to make sense of, even by AI experts. If a neural network is trained to identify patients at risk for conditions like liver cancer and schizophrenia — as a system called “Deep Patient” was in 2015, at Mount Sinai Hospital in New York — there’s no way to discern exactly which features in the data the network is paying attention to. That “knowledge” is smeared across many layers of artificial neurons, each with hundreds or thousands of connections.
As ever more industries attempt to automate or enhance their decision-making with AI, this so-called black box problem seems less like a technological quirk than a fundamental flaw. DARPA’s “XAI” project (for “explainable AI”) is actively researching the problem, and interpretability has moved from the fringes of machine-learning research to its center. “AI is in this critical moment where humankind is trying to decide whether this technology is good for us or not,” Kim says. “If we don’t solve this problem of interpretability, I don’t think we’re going to move forward with this technology. We might just drop it.”
Kim and her colleagues at Google Brain recently developed a system called “Testing with Concept Activation Vectors” (TCAV), which she describes as a “translator for humans” that allows a user to ask a black box AI how much a specific, high-level concept has played into its reasoning. For example, if a machine-learning system has been trained to identify zebras in images, a person could use TCAV to determine how much weight the system gives to the concept of “stripes” when making a decision.
TCAV was originally tested on machine-learning models trained to recognize images, but it also works with models trained on text and certain kinds of data visualizations, like EEG waveforms. “It’s generic and simple — you can plug it into many different models,” Kim says.
Quanta Magazine spoke with Kim about what interpretability means, who it’s for, and why it matters. An edited and condensed version of the interview follows.
You’ve focused your career on “interpretability” for machine learning. But what does that term mean, exactly?
There are two branches of interpretability. One branch is interpretability for science: If you consider a neural network as an object of study, then you can conduct scientific experiments to really understand the gory details about the model, how it reacts, and that sort of thing.
The second branch of interpretability, which I’ve been mostly focused on, is interpretability for responsible AI. You don’t have to understand every single thing about the model. But as long as you can understand just enough to safely use the tool, then that’s our goal.
But how can you have confidence in a system that you don’t fully understand the workings of?
I’ll give you an analogy. Let’s say I have a tree in my backyard that I want to cut down. I might have a chain saw to do the job. Now, I don’t fully understand how the chain saw works. But the manual says, “These are the things you need to be careful of, so as to not cut your finger.” So, given this manual, I’d much rather use the chainsaw than a handsaw, which is easier to understand, but would make me spend five hours cutting down the tree.
You understand what “cutting” is, even if you don’t exactly know everything about how the mechanism accomplishes that.
Yes. The goal of the second branch of interpretability is: Can we understand a tool enough so that we can safely use it? And we can create that understanding by confirming that useful human knowledge is reflected in the tool.
How does “reflecting human knowledge” make something like a black box AI more understandable?
Here’s another example. If a doctor is using a machine-learning model to make a cancer diagnosis, the doctor will want to know that the model isn’t picking up on some random correlation in the data that we don’t want to pick up. One way to make sure of that is to confirm that the machine-learning model is doing something that the doctor would have done. In other words, to show that the doctor’s own diagnostic knowledge is reflected in the model.
So if doctors were looking at a cell specimen to diagnose cancer, they might look for something called “fused glands” in the specimen. They might also consider the age of the patient, as well as whether the patient has had chemotherapy in the past. These are factors or concepts that the doctors trying to diagnose cancer would care about. If we can show that the machine-learning model is also paying attention to these factors, the model is more understandable, because it reflects the human knowledge of the doctors.
Is this what TCAV does — reveal which high-level concepts a machine-learning model is using to make its decisions?
Yes. Prior to this, interpretability methods only explained what neural networks were doing in terms of “input features.” What do I mean by that? If you have an image, every single pixel is an input feature. In fact, Yann LeCun [an early pioneer in deep learning and currently the director of AI research at Facebook] has said that he believes these models are already superinterpretable because you can look at every single node in the neural network and see numerical values for each of these input features. That’s fine for computers, but humans don’t think that way. I don’t tell you, “Oh, look at pixels 100 to 200, the RGB values are 0.2 and 0.3.” I say, “There’s a picture of a dog with really puffy hair.” That’s how humans communicate — with concepts.
How does TCAV perform this translation between input features and concepts?
Let’s return to the example of a doctor using a machine-learning model that has already been trained to classify images of cell specimens as potentially cancerous. You, as the doctor, may want to know how much the concept of “fused glands” mattered to the model in making positive predictions of cancer. First you collect some images — say, 20 — that have examples of fused glands. Now you plug those labeled examples into the model.
Then what TCAV does internally is called “sensitivity testing.” When we add in these labeled pictures of fused glands, how much does the probability of a positive prediction for cancer increase? You can output that as a number between zero and one. And that’s it. That’s your TCAV score. If the probability increased, it was an important concept to the model. If it didn’t, it’s not an important concept.
“Concept” is a fuzzy term. Are there any that won’t work with TCAV?
If you can’t express your concept using some subset of your [dataset’s] medium, then it won’t work. If your machine-learning model is trained on images, then the concept has to be visually expressible. Let’s say I want to visually express the concept of “love.” That’s really hard.
We also carefully validate the concept. We have a statistical testing procedure that rejects the concept vector if it has the same effect on the model as a random vector. If your concept doesn’t pass this test, then the TCAV will say, “I don’t know. This concept doesn’t look like something that was important to the model.”
Is TCAV essentially about creating trust in AI, rather than a genuine understanding of it?
It is not — and I’ll explain why, because it’s a fine distinction to make.
We know from repeated studies in cognitive science and psychology that humans are very gullible. What that means is that it’s actually pretty easy to fool a person into trusting something. The goal of interpretability for machine learning is the opposite of this. It is to tell you if a system is not safe to use. It’s about revealing the truth. So “trust” isn’t the right word.
So the point of interpretability is to reveal potential flaws in an AI’s reasoning?
Yes, exactly.
How can it expose flaws?
You can use TCAV to ask a trained model about irrelevant concepts. To return to the example of doctors using AI to make cancer predictions, the doctors might suddenly think, “It looks like the machine is giving positive predictions of cancer for a lot of images that have a kind of bluish color artifact. We don’t think that factor should be taken into account.” So if they get a high TCAV score for “blue,” they’ve just identified a problem in their machine-learning model.
TCAV is designed to bolt on to existing AI systems that aren’t interpretable. Why not make the systems interpretable from the beginning, rather than black boxes?
There is a branch of interpretability research that focuses on building inherently interpretable models that reflect how humans reason. But my take is this: Right now you have AI models everywhere that are already built, and are already being used for important purposes, without having considered interpretability from the beginning. It’s just the truth. We have a lot of them at Google! You could say, “Interpretability is so useful, let me build you another model to replace the one you already have.” Well, good luck with that.
So then what do you do? We still need to get through this critical moment of deciding whether this technology is good for us or not. That’s why I work “post-training” interpretability methods. If you have a model that someone gave to you and that you can’t change, how do you go about generating explanations for its behavior so that you can use it safely? That’s what the TCAV work is about.
TCAV lets humans ask an AI if certain concepts matter to it. But what if we don’t know what to ask — what if we want the AI system to explain itself?
We have work that we’re writing up right now that can automatically discover concepts for you. We call it DTCAV — discovery TCAV. But I actually think that having humans in the loop, and enabling the conversation between machines and humans, is the crux of interpretability.
A lot of times in high-stakes applications, domain experts already have a list of concepts that they care about. We see this repeat over and over again in our medical applications at Google Brain. They don’t want to be given a set of concepts — they want to tell the model the concepts that they are interested in. We worked with a doctor who treats diabetic retinopathy, which is an eye disease, and when we told her about TCAV, she was so excited because she already had many, many hypotheses about what this model might be doing, and now she can test those exact questions. It’s actually a huge plus, and a very user-centric way of doing collaborative machine learning.
You believe that without interpretability, humankind might just give up on AI technology. Given how powerful it is, do you really think that’s a realistic possibility?
Yes, I do. That’s what happened with expert systems. [In the 1980s] we established that they were cheaper than human operators to conduct certain tasks. But who is using expert systems now? Nobody. And after that we entered an AI winter.
Right now it doesn’t seem likely, because of all the hype and money in AI. But in the long run, I think that humankind might decide — perhaps out of fear, perhaps out of lack of evidence — that this technology is not for us. It’s possible.