artificial intelligence

To Make Language Models Work Better, Researchers Sidestep Language

We insist that large language models repeatedly translate their mathematical processes into words. There may be a better way.

Myriam Wares for Quanta Magazine

Introduction

Language isn’t always necessary. While it certainly helps in getting across certain ideas, some neuroscientists have argued that many forms of human thought and reasoning don’t require the medium of words and grammar. Sometimes, the argument goes, having to turn ideas into language actually slows down the thought process.

Now there’s intriguing evidence that certain artificial intelligence systems could also benefit from “thinking” independently of language.

When large language models (LLMs) process information, they do so in mathematical spaces, far from the world of words. That’s because LLMs are built using deep neural networks, which essentially transform one sequence of numbers into another — they’re effectively complicated math functions. Researchers call the numerical universe in which these calculations take place a latent space.

But these models must often leave the latent space for the much more constrained one of individual words. This can be expensive, since it requires extra computational resources to convert the neural network’s latent representations of various concepts into words. This reliance on filtering concepts through the sieve of language can also result in a loss of information, just as digitizing a photograph inevitably means losing some of the definition in the original. “A lot of researchers are curious,” said Mike Knoop, co-creator of one of the leading benchmarks for testing abstract reasoning in AI models. “Can you do reasoning purely in latent space?”

Two recent papers suggest that the answer may be yes. In them, researchers introduce deep neural networks that allow language models to continue thinking in mathematical spaces before producing any text. While still fairly basic, these models are more efficient and reason better than their standard alternatives.

“It’s an exciting new research direction,” said Luke Zettlemoyer, a computer scientist and natural language processing expert at the University of Washington who wasn’t involved in either paper.

Token Gesture

To understand why LLMs might be constrained by language, we first need to take a look inside them. Most modern models use a type of neural network known as a transformer, which processes a stream of text at one go, rather than piece by piece. It’s proved astonishingly adept at helping a language model to predict the next likely word given some text, and to generate surprisingly realistic writing as a result.

However, transformers don’t work with words directly. They use pieces of text, called tokens. These can be whole words, word fragments or even single characters.

Here’s how these models typically work. When a user queries an LLM, an algorithm breaks that input text into a sequence of tokens. The model then converts each token into a string of numbers called an embedding, fodder for the underlying mathematical machinery. An input of 10 tokens results in 10 embeddings, for example. The transformer then processes these embeddings through its various components, called layers. Each layer feeds its results into the next layer, gradually connecting each embedding to every other embedding. The final layer puts all this information together to generate one final set of embeddings. The last embedding in this sequence is called a hidden state — “hidden” because it’s not exposed to the outside world. This hidden state contains all the relevant information needed for the model to predict the most likely next token, or word, to follow the initial input sequence of tokens.

This is only the start of the process. This predicted token is added to the end of the initial input sequence, and the new set of tokens is fed back into the network. The transformer then processes it as above and ultimately produces one more token — which is appended to the most recent input and sent back in again. This continues until the network produces an end-of-text token, a signal that the process is complete.

Crucially, today’s LLMs are trained to produce an extended sequence of tokens designed to mimic its thought process before producing the final answer. For example, given a math problem, the LLM can generate numerous tokens that show the steps it took to get the answer. Researchers call the tokens leading up to the answer the LLM’s “chain of thought.” Producing it not only helps researchers understand what the model’s doing, but also makes it much more accurate.

The approach has proved tremendously effective, as evidenced by the power of modern LLMs. But it also means that an LLM must convert token embeddings into a hidden state and then back into token embeddings over and over. This back-and-forth creates a logjam, resulting in inefficiency and possibly a loss of information. “If we want to reason in a latent space, we want to skip this step,” said Shibo Hao, a graduate student at the University of California, San Diego. That’s just what he and his team did.

Don’t Verbalize

As an intern at Meta last year, Hao and his colleagues wanted to see if they could build an LLM that reasons mostly in latent space. They started with a standard version of GPT-2, an early LLM that OpenAI had already made public. It was a relatively small model, with only 124 million parameters, the internal variables set during training that determine how well the model works.

Shibo Hao in a black and white shirt outside a large building

Shibo Hao helped build an LLM, called Coconut, that avoids having to constantly turn mathematical information into words.

Yi Gu

Hao’s team focused on the crucial point in the process where the hidden state, generated by the final transformer layer, gets converted into a token. The conversion causes the information to descend from the infinite possibilities of continuous numbers to the limited vocabulary of, in this case, GPT-2’s 50,000 or so tokens. The team altered the model to avoid this step, looping the hidden state directly back to input embeddings, which again pass through the transformer’s layers.

Now the LLM could process all information within a continuous mathematical space, rather than a discrete space forced upon it by human language. The researchers called their model Coconut, for “chain of continuous thought,” and released it in December.

Hao’s team tested their model against the best-performing version of GPT-2, one that had been trained to produce a chain of thought before answering. As they hoped, Coconut almost always came out ahead. On one test of logical reasoning, both models were 98.8% accurate, but Coconut used only about one-tenth as many tokens to achieve the same result, making it significantly more efficient. On another test that required choosing from a large set of options, Coconut used about one-third as many tokens and was also significantly more accurate, 97% compared to 77.5%.

“In continuous or latent reasoning, you don’t need to transform your thoughts into language. You can maintain these uncertainties in your thoughts, and then finally answer very confidently,” Hao said. “It’s a fundamentally different reasoning pattern.”

But on a task that required solving elementary math problems, Coconut faltered. It generated about one-third as many tokens but was only 34% accurate, compared to the 43% accuracy of its competitor. Even then, Hao suspects Coconut would have done better if it had been trained using latent space reasoning from the start, instead of being based on a standard, pretrained model.

Hao also thinks something else might be holding it back. Although Coconut reasons in a latent space, it faces another, more subtle restriction. Hao’s team imposed a limitation on the number of times information could loop through its transformer layers while remaining in latent space before the process had to end and produce tokens. “Ideally, the language model should decide itself when the reasoning is over,” Hao said.

Getting Loopy

A team led by Tom Goldstein, of the University of Maryland, had also been working on the same goal. Last year, they designed and trained a transformer that not only learned to reason in latent space, but also figured out when to stop and switch back to language on its own. But this team came at the task from a different direction than Hao’s.

All modern LLMs have a fixed number of transformer layers. “It seems fundamentally limiting,” Goldstein said, since it means that problems that need extra computations — more passes through layers — don’t get them. This was particularly true for early LLMs, which had relatively few layers. Goldstein wanted to figure out a way to increase the number of layers in an LLM on demand.

Another LLM, built by Tom Goldstein and his team, reasons in latent space by repeatedly using the same layers in its architecture before turning to words.

University of Maryland

His team discovered they could do this by, in effect, letting the model use some of its layers more than once. To test their idea, they built an LLM with eight layers. The computation proceeds as usual through the first two layers (the “prelude”). The next four layers are effectively bundled together as a block, which the computation can reuse as much as it needs to. Once it’s done, the output of this “recurrent block” is passed on to the final two layers (the “coda”), which predict the next token. For only one pass through the recurrent block, the model functions as an eight-layer LLM; for 25 passes, it’s 104 layers.

This means the model reasons almost entirely in latent space, since the output of the recurrent block is never converted into tokens. Instead, the embeddings it generates are fed directly back into the recurrent block and processed again.

And unlike Coconut, Goldstein’s recurrent model is trained from scratch, learning for itself the number of times it should use the recurrent block to reason through various problems. (It stops looping when the embeddings generated by the recurrent block stop changing significantly.) Goldstein’s team had access to significant computing power, thanks to a grant from the U.S. Department of Energy, so they could build a model that, at 3.5 billion parameters, was much larger than Coconut.

This system allowed for surprisingly sophisticated behavior. The model learned to exit earlier on simpler tasks and only spend more time (and resources) on difficult ones. For example, on reasoning tasks involving moral scenarios, the model took about 3.5 more passes through the recurrent block than it did on tasks involving high school math. “It’s kind of exciting,” said co-author Jonas Geiping of the Max Planck Institute for Intelligent Systems in Tübingen, Germany. “We didn’t really train for that. This just emerged as a behavior. When it was an easier [task], the model seemed to know that.”

Goldstein’s team also tested their model on standard benchmarks involving coding tasks and mathematical reasoning. Their model fared significantly better than the largest first-generation OLMo models from the Allen Institute for AI, even though the OLMo models have twice as many parameters. On tasks of reasoning about elementary math problems, OLMo-7B was about 4% accurate, while the recurrent model achieved about 28% accuracy — despite OLMo’s more sophisticated and longer training run. “Our model still beats it by a wide margin,” Goldstein said.

Back to Basics

Despite these positive results, Hao believes it may take more time and research for latent reasoning models to become mainstream. Leading companies, such as OpenAI and Anthropic, are already heavily invested in existing LLM architectures. Redoing them to incorporate latent space reasoning would require heavy reengineering, so it’s unlikely they’ll adopt such techniques anytime soon.

Zettlemoyer also cautions that latent space reasoning may have its own shortcomings. Ultimately, the data that LLMs train on is based on text, and the traditional approach has been extremely successful at finding patterns in it. LLMs can learn any kind of reasoning pattern, as long as it exists in texts — ensuring that the models reason in ways that humans do. Letting LLMs reason without using words could mean they’ll work in ways that aren’t amenable to human thinking. “Moving into a continuous space could allow for all kinds of possibilities that aren’t actually going to be helpful,” Zettlemoyer said.

But even so, we now know it’s at least possible for models to work this way. Reasoning in latent space introduces a completely new mode of “thinking” for LLMs, Zettlemoyer said. Who knows what new patterns such an approach might find?

“Part of the goal of this kind of work is to really change the type of reasoning you’re doing,” Zettlemoyer said. “It has a chance to be a big game changer.”

Comment on this article