artificial intelligence

How Do Machines ‘Grok’ Data?

By apparently overtraining them, researchers have seen neural networks discover novel solutions to problems.

Irene Pérez for Quanta Magazine

Introduction

For all their brilliance, artificial neural networks remain as inscrutable as ever. As these networks get bigger, their abilities explode, but deciphering their inner workings has always been near impossible. Researchers are constantly looking for any insights they can find into these models.

A few years ago, they discovered a new one.

In January 2022, researchers at OpenAI, the company behind ChatGPT, reported that these systems, when accidentally allowed to munch on data for much longer than usual, developed unique ways of solving problems. Typically, when engineers build machine learning models out of neural networks — composed of units of computation called artificial neurons — they tend to stop the training at a certain point, called the overfitting regime. This is when the network basically begins memorizing its training data and often won’t generalize to new, unseen information. But when the OpenAI team accidentally trained a small network way beyond this point, it seemed to develop an understanding of the problem that went beyond simply memorizing — it could suddenly ace any test data.

The researchers named the phenomenon “grokking,” a term coined by science-fiction author Robert A. Heinlein to mean understanding something “so thoroughly that the observer becomes a part of the process being observed.” The overtrained neural network, designed to perform certain mathematical operations, had learned the general structure of the numbers and internalized the result. It had grokked and become the solution.

“This [was] very exciting and thought provoking,” said Mikhail Belkin of the University of California, San Diego, who studies the theoretical and empirical properties of neural networks. “It spurred a lot of follow-up work.”

Indeed, others have replicated the results and even reverse-engineered them. The most recent papers not only clarified what these neural networks are doing when they grok but also provided a new lens through which to examine their innards. “The grokking setup is like a good model organism for understanding lots of different aspects of deep learning,” said Eric Michaud of the Massachusetts Institute of Technology.

Peering inside this organism is at times quite revealing. “Not only can you find beautiful structure, but that beautiful structure is important for understanding what’s going on internally,” said Neel Nanda, now at Google DeepMind in London.

Beyond Limits

Fundamentally, the job of a machine learning model seems simple: Transform a given input into a desired output. It’s the learning algorithm’s job to look for the best possible function that can do that. Any given model can only access a limited set of functions, and that set is often dictated by the number of the parameters in the model, which in the case of neural networks is roughly equivalent to the number of connections between artificial neurons.

Neel Nanda in a print shirt stands outside.

Neel Nanda helped show how neural networks that had grokked modular arithmetic transformed the numbers using complicated mathematics.

Courtesy of Neel Nanda

As a network trains, it tends to learn more complex functions, and the discrepancy between the expected output and the actual one starts falling for training data. Even better, this discrepancy, known as loss, also starts going down for test data, which is new data not used in training. But at some point, the model starts to overfit, and while the loss on training data keeps falling, the test data’s loss starts to rise. So, typically, that’s when researchers stop training the network.

That was the prevailing wisdom when the team at OpenAI began exploring how a neural network could do math. They were using a small transformer — a network architecture that’s recently revolutionized large language models — to do different kinds of modular arithmetic, in which you work with a limited set numbers that loop back on themselves. Modulo 12, for example, can be done on a clock face: 11 + 2 = 1. The team showed the network examples of adding two numbers, a and b, to produce an output, c, in modulo 97 (equivalent to a clock face with 97 numbers). They then tested the transformer on unseen combinations of a and b to see if it could correctly predict c.

As expected, when the network entered the overfitting regime, the loss on the training data came close to zero (it had begun memorizing what it had seen), and the loss on the test data began climbing. It wasn’t generalizing. “And then one day, we got lucky,” said team leader Alethea Power, speaking in September 2022 at a conference in San Francisco. “And by lucky, I mean forgetful.”

The team member who was training the network went on vacation and forgot to stop the training. As this version of the network continued to train, it suddenly became accurate on unseen data. Automatic testing revealed this unexpected accuracy to the rest of the team, and they soon realized that the network had found clever ways of arranging the numbers a and b. Internally, the network represents the numbers in some high-dimensional space, but when the researchers projected these numbers down to 2D space and mapped them, the numbers formed a circle.

This was astonishing. The team never told the model it was doing modulo 97 math, or even what modulo meant — they just showed it examples of arithmetic. The model seemed to have stumbled upon some deeper, analytical solution — an equation that generalized to all combinations of a and b, even beyond the training data. The network had grokked, and the accuracy on test data shot up to 100%. “This is weird,” Power told her audience.

The team verified the results using different tasks and different networks. The discovery held up.

Of Clocks and Pizzas

But what was the equation the network had found? The OpenAI paper didn’t say, but the result caught Nanda’s attention. “One of the core mysteries and annoying things about neural networks is that they’re very good at what they do, but that by default, we have no idea how they work,” said Nanda, whose work focuses on reverse-engineering a trained network to figure out what algorithms it learned.

Nanda was fascinated by the OpenAI discovery, and he decided to pick apart a neural network that had grokked. He designed an even simpler version of the OpenAI neural network so that he could closely examine the model’s parameters as it learned to do modular arithmetic. He saw the same behavior: overfitting that gave way to generalization and an abrupt improvement in test accuracy. His network was also arranging numbers in a circle. It took some effort, but Nanda eventually figured out why.

While it was representing the numbers on a circle, the network wasn’t simply counting off digits like a kindergartner watching a clock: It was doing some sophisticated mathematical manipulations. By studying the values of the network’s parameters, Nanda and colleagues revealed that it was adding the clock numbers by performing “discrete Fourier transforms” on them — transforming the numbers using trigonometric functions such as sines and cosines and then manipulating these values using trigonometric identities to arrive at the solution. At least, this was what his particular network was doing.

When a team at MIT followed up on Nanda’s work, they showed that the grokking neural networks don’t always discover this “clock” algorithm. Sometimes, the networks instead find what the researchers call the “pizza” algorithm. This approach imagines a pizza divided into slices and numbered in order. To add two numbers, imagine drawing arrows from the center of the pizza to the numbers in question, then calculating the line that bisects the angle formed by the first two arrows. This line passes through the middle of some slice of the pizza: The number of the slice is the sum of the two numbers. These operations can also be written down in terms of trigonometric and algebraic manipulations of the sines and cosines of a and b, and they’re theoretically just as accurate as the clock approach.

Ziming Liu in a striped blue shirt and Yankees baseball cap in front of a fountain.

Ziming Liu followed up on Nanda’s work and discovered a second way that grokking networks were processing data, along with several other approaches that didn’t seem to have interpretations.

Wenting Gong

“Both [the] clock and pizza algorithms have this circular representation,” said Ziming Liu, a member of the MIT team. “But … how they leverage these sines and cosines are different. That’s why we call them different algorithms.”

And that still wasn’t all. After training numerous networks to do modulo math, Liu and colleagues discovered that about 40% of algorithms discovered by these networks were varieties of the pizza or clock algorithms. The team hasn’t been able to decipher what the networks are doing the rest of the time. For the pizza and clock algorithms, “it just happens that it finds something we humans can interpret,” Liu said.

And whatever the algorithm a network learns when it groks a problem, it’s even more powerful at generalization than researchers suspected. When a team at the University of Maryland fed a simple neural network training data with random errors, the network at first behaved as expected: Overfit the training data, errors and all, and perform poorly on uncorrupted test data. However, once the network grokked and began answering the test questions correctly, it could produce correct answers even for the wrong entries, forgetting the memorized incorrect answers and generalizing even to its training data. “The grokking task is actually quite robust to these kinds of corruptions,” said Darshil Doshi, one of the paper’s authors.

Battle for Control

As a result, researchers are now beginning to understand the process that leads up to a network grokking its data. Nanda sees the apparent outward suddenness of grokking as the outcome of a gradual internal transition from memorization to generalization, which use two different algorithms inside the neural network. When a network begins learning, he said, it first figures out the easier memorizing algorithm; however, even though the algorithm is simpler, it requires considerable resources, as the network needs to memorize each instance of the training data. But even as it is memorizing, parts of the neural network start forming circuits that implement the general solution. The two algorithms compete for resources during training, but generalization eventually wins out if the network is trained with an additional ingredient called regularization.

“Regularization slowly drifts the solution toward the generalization solution,” said Liu. This is a process that reduces the model’s functional capacity — the complexity of the function that the model can learn. As regularization prunes the model’s complexity, the generalizing algorithm, which is less complex, eventually triumphs. “Generalization is simpler for the same [level of] performance,” said Nanda. Finally, the neural network discards the memorizing algorithm.

So, while the delayed ability to generalize seems to emerge suddenly, internally the network’s parameters are steadily learning the generalizing algorithm. It’s only when the network has both learned the generalizing algorithm and completely removed the memorizing algorithm that you get grokking. “It’s possible for things that seem sudden to actually be gradual under the surface,” Nanda said — an issue that has also come up in other machine learning research.

Despite these breakthroughs, it’s important to remember that grokking research is still in its infancy. So far, researchers have studied only extremely small networks, and it’s not clear if these findings will hold with bigger, more powerful networks. Belkin also cautions that modular arithmetic is “a drop in the ocean” compared with all the different tasks being done by today’s neural networks. Reverse-engineering a neural network’s solution for such math might not be enough to understand the general principles that drive these networks toward generalization. “It’s great to study the trees,” Belkin said. “But we also have to study the forest.”

Nonetheless, the ability to peer inside these networks and understand them analytically has huge implications. For most of us, Fourier transforms and bisecting arcs of circles are a very weird way to do modulo addition — human neurons just don’t think like that. “But if you’re built out of linear algebra, it actually makes a lot of sense to do it like this,” said Nanda.

“These weird [artificial] brains work differently from our own,” he said. “[They] have their own rules and structure. We need to learn to think how a neural network thinks.”

Comment on this article