Introduction
You probably can’t remember what it feels like to play Super Mario Bros. for the very first time, but try to picture it. An 8-bit game world blinks into being: baby blue sky, tessellated stone ground, and in between, a squat, red-suited man standing still — waiting. He’s facing rightward; you nudge him farther in that direction. A few more steps reveal a row of bricks hovering overhead and what looks like an angry, ambulatory mushroom. Another twitch of the game controls makes the man spring up, his four-pixel fist pointed skyward. What now? Maybe try combining nudge-rightward and spring-skyward? Done. Then, a surprise: The little man bumps his head against one of the hovering bricks, which flexes upward and then snaps back down as if spring-loaded, propelling the man earthward onto the approaching angry mushroom and flattening it instantly. Mario bounces off the squished remains with a gentle hop. Above, copper-colored boxes with glowing “?” symbols seem to ask: What now?
This scene will sound familiar to anyone who grew up in the 1980s, but you can watch a much younger player on Pulkit Agrawal’s YouTube channel. Agrawal, a computer science researcher at the University of California, Berkeley, is studying how innate curiosity can make learning an unfamiliar task — like playing Super Mario Bros. for the very first time — more efficient. The catch is that the novice player in Agrawal’s video isn’t human, or even alive. Like Mario, it’s just software. But this software comes equipped with experimental machine-learning algorithms designed by Agrawal and his colleagues Deepak Pathak, Alexei A. Efros and Trevor Darrell at the Berkeley Artificial Intelligence Research Lab for a surprising purpose: to make a machine curious.
“You can think of curiosity as a kind of reward which the agent generates internally on its own, so that it can go explore more about its world,” Agrawal said. This internally generated reward signal is known in cognitive psychology as “intrinsic motivation.” The feeling you may have vicariously experienced while reading the game-play description above — an urge to reveal more of whatever’s waiting just out of sight, or just beyond your reach, just to see what happens — that’s intrinsic motivation.
Humans also respond to extrinsic motivations, which originate in the environment. Examples of these include everything from the salary you receive at work to a demand delivered at gunpoint. Computer scientists apply a similar approach called reinforcement learning to train their algorithms: The software gets “points” when it performs a desired task, while penalties follow unwanted behavior.
But this carrot-and-stick approach to machine learning has its limits, and artificial intelligence researchers are starting to view intrinsic motivation as an important component of software agents that can learn efficiently and flexibly — that is, less like brittle machines and more like humans and animals. Approaches to using intrinsic motivation in AI have taken inspiration from psychology and neurobiology — not to mention decades-old AI research itself, now newly relevant. (“Nothing is really new in machine learning,” said Rein Houthooft, a research scientist at OpenAI, an independent artificial intelligence research organization.)
Such agents may be trained on video games now, but the impact of developing meaningfully “curious” AI would transcend any novelty appeal. “Pick your favorite application area and I’ll give you an example,” said Darrell, co-director of the Berkeley Artificial Intelligence lab. “At home, we want to automate cleaning up and organizing objects. In logistics, we want inventory to be moved around and manipulated. We want vehicles that can navigate complicated environments and rescue robots that can explore a building and find people who need rescuing. In all of these cases, we are trying to figure out this really hard problem: How do you make a machine that can figure its own task out?”
The Problem With Points
Reinforcement learning is a big part of what helped Google’s AlphaGo software beat the world’s best human player at Go, an ancient and intuitive game long considered invulnerable to machine learning. The details of successfully using reinforcement learning in a particular domain are complex, but the general idea is simple: Give a learning algorithm, or “agent,” a reward function, a mathematically defined signal to seek out and maximize. Then set it loose in an environment, which could be any real or virtual world. As the agent operates in the environment, actions that increase the value of the reward function get reinforced. With enough repetition — and if there’s anything that computers are better at than people, it’s repetition — the agent learns patterns of action, or policies, that maximize its reward function. Ideally, these policies will result in the agent reaching some desirable end state (like “win at Go”), without a programmer or engineer having to hand-code every step the agent needs to take along the way.
In other words, a reward function is the guidance system that keeps a reinforcement-learning-powered agent locked on target. The more clearly that target is defined, the better the agent performs — that is why many of them are currently tested on old video games, which often provide simple extrinsic reward schemes based on points. (The blocky, two-dimensional graphics are useful, too: Researchers can run and repeat their experiments quickly because the games are relatively simple to emulate.)
Yet “in the real world, there are no points,” said Agrawal. Computer scientists want to have their creations explore novel environments that don’t come preloaded with quantifiable objectives.
In addition, if the environment doesn’t supply extrinsic rewards quickly and regularly enough, the agent “has no clue whether it’s doing something right or wrong,” Houthooft said. Like a heat-seeking missile unable to lock onto a target, “it doesn’t have any way of [guiding itself through] its environment, so it just goes haywire.”
Moreover, even painstakingly defined extrinsic reward functions that can guide an agent to display impressively intelligent behavior — like AlphaGo’s ability to best the world’s top human Go player — won’t easily transfer or generalize to any other context without extensive modification. And that work must be done by hand, which is precisely the kind of labor that machine learning is supposed to help us sidestep in the first place.
Instead of a battery of pseudo-intelligent agents that can reliably hit specified targets like those missiles, what we really want from AI is more like an internal piloting ability. “You make your own rewards, right?” Agrawal said. “There’s no god constantly telling you ‘plus one’ for doing this or ‘minus one’ for doing that.”
Curiosity as Co-Pilot
Deepak Pathak never set out to model anything as airily psychological as curiosity in code. “The word ‘curiosity’ is nothing but saying, ‘a model which leads an agent to efficiently explore its environment in the presence of noise,’” said Pathak, a researcher in Darrell’s lab at Berkeley and the lead author of the recent work.
But in 2016, Pathak was interested in the sparse-rewards problem for reinforcement learning. Deep-learning software, powered by reinforcement learning techniques, had recently made significant gains in playing simple score-driven Atari games like Space Invaders and Breakout. But even slightly more complex games like Super Mario Bros. — which require navigating toward a goal distant in time and space without constant rewards, not to mention an ability to learn and successfully execute composite moves like running and jumping at the same time — were still beyond an AI’s grasp.
Pathak and Agrawal, working with Darrell and Efros, equipped their learning agent with what they call an intrinsic curiosity module (ICM) designed to pull it forward through the game without going haywire (to borrow Houthooft’s term). The agent, after all, has absolutely no prior understanding of how to play Super Mario Bros. — in fact, it’s less like a novice player and more like a newborn infant.
Indeed, Agrawal and Pathak took inspiration from the work of Alison Gopnik and Laura Schulz, developmental psychologists at Berkeley and at the Massachusetts Institute of Technology, respectively, who showed that babies and toddlers are naturally drawn to play with objects that surprise them the most, rather than with objects that are useful to achieving some extrinsic goal. “One way to [explain] this kind of curiosity in children is that they build a model of what they know about the world, and then they conduct experiments to learn more about what they don’t know,” Agrawal said. These “experiments” can be anything that generates an outcome which the agent (in this case, an infant) finds unusual or unexpected. The child might start with random limb movements that cause new sensations (known as “motor babbling”), then progress up to more coordinated behaviors like chewing on a toy or knocking over a pile of blocks to see what happens.
In Pathak and Agrawal’s machine-learning version of this surprise-driven curiosity, the AI first mathematically represents what the current video frame of Super Mario Bros. looks like. Then it predicts what the game will look like several frames hence. Such a feat is well within the powers of current deep-learning systems. But then Pathak and Agrawal’s ICM does something more. It generates an intrinsic reward signal defined by how wrong this prediction model turns out to be. The higher the error rate — that is, the more surprised it is — the higher the value of its intrinsic reward function. In other words, if a surprise is equivalent to noticing when something doesn’t turn out as expected — that is, to being wrong — then Pathak and Agrawal’s system gets rewarded for being surprised.
This internally generated signal draws the agent toward unexplored states in the game: informally speaking, it gets curious about what it doesn’t yet know. And as the agent learns — that is, as its prediction model becomes less and less wrong — its reward signal from the ICM decreases, freeing the agent up to maximize the reward signal by exploring other, more surprising situations. “It’s a way to make exploration go faster,” Pathak said.
This feedback loop also allows the AI to quickly bootstrap itself out of a nearly blank-slate state of ignorance. At first, the agent is curious about any basic movement available to its onscreen body: Pressing right nudges Mario to the right, and then he stops; pressing right several times in a row makes Mario move without immediately stopping; pressing up makes him spring into the air, and then come down again; pressing down has no effect. This simulated motor babbling quickly converges on useful actions that move the agent forward into the game, even though the agent doesn’t know it.
For example, since pressing down always has the same effect — nothing — the agent quickly learns to perfectly predict the effect of that action, which cancels the curiosity-supplied reward signal associated with it. Pressing up, however, has all kinds of unpredictable effects: Sometimes Mario goes straight up, sometimes in an arc; sometimes he takes a short hop, other times a long jump; sometimes he doesn’t come down again (if, say, he happens to land on top of an obstacle). All of these outcomes register as errors in the agent’s prediction model, resulting in a reward signal from the ICM, which makes the agent keep experimenting with that action. Moving to the right (which almost always reveals more game world) has similar curiosity-engaging effects. The impulse to move up and to the right can clearly be seen in Agrawal’s demo video: Within seconds, the AI-controlled Mario starts hopping rightward like a hyperactive toddler, causing ever-more-unpredictable effects (like bumping against a hovering brick, or accidentally squishing a mushroom), all of which drive further exploration.
“By using this curiosity, the agent learns how to do all the things it needs to explore the world, like jump and kill enemies,” explained Agrawal. “It doesn’t even get penalized for dying. But it learns to avoid dying, because not-dying maximizes its exploration. It’s reinforcing itself, not getting reinforcement from the game.”
Avoiding the Novelty Trap
Artificial curiosity has been a subject of AI research since at least the early 1990s. One way of formalizing curiosity in software centers on novelty-seeking: The agent is programmed to explore unfamiliar states in its environment. This broad definition seems to capture an intuitive understanding of the experience of curiosity, but in practice it can cause the agent to become trapped in states that satisfy its built-in incentive but prevent any further exploration.
For example, imagine a television displaying nothing but static on its screen. Such a thing would quickly engage the curiosity of a purely novelty-seeking agent, because a square of randomly flickering visual noise is, by definition, totally unpredictable from one moment to the next. Since every pattern of static appears entirely novel to the agent, its intrinsic reward function will ensure that it can never cease paying attention to this single, useless feature of the environment — and it becomes trapped.
It turns out that this type of pointless novelty is ubiquitous in the kind of richly featured environments — virtual or physical — that AI must learn to cope with to become truly useful. For example, a self-driving delivery vehicle equipped with a novelty-seeking intrinsic reward function might never make it past the end of the block. “Say you’re moving along a street and the wind is blowing and the leaves of a tree are moving,” Agrawal said. “It’s very, very hard to predict where every leaf is going to go. If you’re predicting pixels, these kinds of interactions will cause you to have high prediction errors, and make you very curious. We want to avoid that.”
Agrawal and Pathak had to come up with a way to keep their agent curious, but not too curious. Predicting pixels — that is, using deep learning and computer vision to model an agent’s visual field in its entirety from moment to moment — makes it hard to filter out potential distractions. It’s computationally expensive, too.
So instead, the Berkeley researchers engineered their Mario-playing agent to translate its visual input from raw pixels into an abstracted version of reality. This abstraction incorporates only features of the environment that have the potential to affect the agent (or that the agent can influence). In essence, if the agent can’t interact with a thing, it won’t even be perceived in the first place.
Using this stripped-down “feature space” (versus the unprocessed “pixel space”) not only simplifies the agent’s learning process, it also neatly sidesteps the novelty trap. “The agent can’t get any benefit out of modeling, say, clouds moving overhead, to predict the effects of its actions,” explained Darrell. “So it’s just not going to pay attention to the clouds when it’s being curious. The previous versions of curiosity — at least some of them — were really only considering pixel-level prediction. Which is great, except for when you suddenly pass a very unpredictable but very boring thing.”
The Limits of Artificial Curiosity
Darrell conceded that this model of curiosity isn’t perfect. “The system learns what’s relevant, but there’s no guarantee it’ll always get it right,” he said. Indeed, the agent makes it only about halfway through the first level of Super Mario Bros. before getting trapped in its own peculiar local optimum. “There’s this big gap which the agent has to jump across, which requires executing 15 or 16 continuous actions in a very, very specific order,” Agrawal said. “Because it is never able to jump this gap, it dies every time by going there. And when it learns to perfectly predict this outcome, it stops becoming curious about going any further in the game.” (In the agent’s defense, Agrawal notes that this flaw emerges because the AI can press its simulated directional controls only in discrete intervals, which makes certain moves impossible.)
Ultimately, the problem with artificial curiosity is that even researchers who have studied intrinsic motivation for years still can’t precisely define what curiosity is. Paul Schrater, a neuroscientist who leads the Computational Perception and Action Lab at the University of Minnesota, said that the Berkeley model “is the most intelligent thing to do in the short term to get an agent to automatically learn a novel environment,” but he thinks it has less to do with “the intuitive concept of curiosity” than with motor learning and control. “It’s controlling things that are beneath cognition, and more in the details of what the body does,” he said.
To Schrater, the Berkeley team’s novel idea comes in attaching their intrinsic curiosity module to an agent that perceives Super Mario Bros. as a feature space rather than as sequential frames of pixels. He argues that this approach may roughly approximate the way our own brains “extract visual features that are relevant for a particular kind of task.”
Curiosity may also require an agent to be at least somewhat embodied (virtually or physically) within an environment to have any real meaning, said Pierre-Yves Oudeyer, a research director at Inria in Bordeaux, France. Oudeyer has been creating computational models of curiosity for over a decade. He pointed out that the world is so large and rich that an agent can find surprises everywhere. But this isn’t itself enough. “If you’ve got a disembodied agent using curiosity to explore a large feature space, its behavior is going to just end up looking like random exploration because it doesn’t have any constraints on its actions,” Oudeyer said. “The constraints of, for example, a body enable a simplification of the world.” They focus the attention and help to guide exploration.
But not all embodied agents need intrinsic motivation, either — as the history of industrial robotics makes clear. For tasks that are simpler to specify — say, shuttling cargo from place to place using a robot that follows a yellow line painted on the floor — adding curiosity to the mix would be machine-learning overkill.
“You could just give that kind of agent a perfect reward function — everything it needs to know in advance,” Darrell explained. “We could solve that problem 10 years ago. But if you’re putting a robot in a situation that can’t be modeled in advance, like disaster search-and-rescue, it has to go out and learn to explore on its own. That’s more than just mapping — it has to learn the effects of its own actions in the environment. You definitely want an agent to be curious when it’s learning how to do its job.”
AI is often informally defined as “whatever computers can’t do yet.” If intrinsic motivation and artificial curiosity are methods for getting agents to figure out tasks that we don’t already know how to automate, then “that’s something I’m pretty sure we’d want any AI to have,” said Houthooft, the OpenAI researcher. “The difficulty is in tuning it.” Agrawal and Pathak’s Mario-playing agent may not be able to get past World 1-1 on its own. But that’s probably what tuning curiosity — artificial or otherwise — will look like: a series of baby steps.
This article was reprinted on Wired.com and Spektrum.de.