Chatbot Software Begins to Face Fundamental Limitations
Introduction
On December 17, 1962, Life International published a logic puzzle consisting of 15 sentences describing five houses on a street. Each sentence was a clue, such as “The Englishman lives in the red house” or “Milk is drunk in the middle house.” Each house was a different color, with inhabitants of different nationalities, who owned different pets, and so on. The story’s headline asked: “Who Owns the Zebra?” Problems like this one have proved to be a measure of the abilities — limitations, actually — of today’s machine learning models.
Also known as Einstein’s puzzle or riddle (likely an apocryphal attribution), the problem tests a certain kind of multistep reasoning. Nouha Dziri, a research scientist at the Allen Institute for AI, and her colleagues recently set transformer-based large language models (LLMs), such as ChatGPT, to work on such tasks — and largely found them wanting. “They might not be able to reason beyond what they have seen during the training data for hard tasks,” Dziri said. “Or at least they do an approximation, and that approximation can be wrong.”
Einstein’s riddle requires composing a larger solution from solutions to subproblems, which researchers call a compositional task. Dziri’s team showed that LLMs that have only been trained to predict the next word in a sequence — which is most of them — are fundamentally limited in their ability to solve compositional reasoning tasks. Other researchers have shown that transformers, the neural network architecture used by most LLMs, have hard mathematical bounds when it comes to solving such problems. Scientists have had some successes pushing transformers past these limits, but those increasingly look like short-term fixes. If so, it means there are fundamental computational caps on the abilities of these forms of artificial intelligence — which may mean it’s time to consider other approaches.
“The work is really motivated to help the community make this decision about whether transformers are really the architecture we want to embrace for universal learning,” said Andrew Wilson, a machine learning expert at New York University who was not involved with this study.
Success Begets Scrutiny
Ironically, LLMs have only themselves to blame for this discovery of one of their limits. “The reason why we all got curious about whether they do real reasoning is because of their amazing capabilities,” Dziri said. They dazzled on tasks involving natural language, despite the seeming simplicity of their training. During the training phase, an LLM is shown a fragment of a sentence with the last word obscured (though technically it isn’t always a single word). The model predicts the missing information and then “learns” from its mistakes.
The largest LLMs — OpenAI’s o1 and GPT-4, Google’s Gemini, Anthropic’s Claude — train on almost all the available data on the internet. As a result, the LLMs end up learning the syntax of, and much of the semantic knowledge in, written language. Such “pre-trained” models can be further trained, or fine-tuned, to complete sophisticated tasks far beyond simple sentence completion, such as summarizing a complex document or generating code to play a computer game. The results were so powerful that the models seemed, at times, capable of reasoning. Yet they also failed in ways both obvious and surprising.
“On certain tasks, they perform amazingly well,” Dziri said. “On others, they’re shockingly stupid.”
Take basic multiplication. Standard LLMs, such as ChatGPT and GPT-4, fail badly at it. In early 2023 when Dziri’s team asked GPT-4 to multiply two three-digit numbers, it initially succeeded only 59% of the time. When it multiplied two four-digit numbers, accuracy fell to just 4%.
The team also tested the LLMs on tasks like Einstein’s riddle, where it also had limited success. GPT-4 always got the right answer when the puzzle involved two houses with two attributes per house. But the accuracy fell to 10% when the complexity of the puzzle increased to four houses with four attributes per house. For the original version in Life International — five houses, each with five attributes — the success rate was 0%.
Dziri’s team thought that maybe the LLMs simply hadn’t seen enough examples in their training data, so they fine-tuned GPT-3 on 1.8 million examples of multiplying two numbers. Then, when they showed it new problems, the LLM aced them — but only if they were sufficiently similar to what it had seen during training. For example, the training data included the multiplication of two three-digit numbers, and of a two-digit number with a four-digit number, but when the model was asked to multiply a four-digit number with a three-digit number, it succeeded only 2% of the time. “If they are truly reasoning and understanding certain tasks, they should get the implicit algorithm,” Dziri said. That’s not what her team saw. “That raises a lot of questions about how LLMs perform tasks and whether they’re doing true reasoning.”
The team observed the same pattern when it came to solving Einstein’s riddle: GPT-3 failed when asked to answer bigger versions of the puzzle compared to the ones it was fine-tuned on. “It’s mimicking something that it has seen, but it doesn’t have full understanding of it,” Dziri said.
Hard Limits
As Dziri and her co-authors were finalizing their results, a different team was taking another approach to understanding why LLMs struggled with compositional tasks. Binghui Peng, at the time a doctoral student at Columbia University, was working with one of his advisers, Christos Papadimitriou, and colleagues to understand why LLMs “hallucinate,” or generate factually incorrect information. Peng, now a postdoctoral researcher at Stanford University, suspected it was because transformers seem to lack the “capability of composition.”
To understand why, imagine we feed an LLM two pieces of information: The father of Frédéric Chopin was Nicolas Chopin, and Nicolas Chopin was born on April 15, 1771. If we then ask it, “What is the birth date of Frédéric Chopin’s father?” the LLM would have to answer by composing, or putting together, the different facts. In effect, it would need to answer the following nested question: “What is the birth date of (Who is the father of (Frédéric Chopin)?)?” If the LLM predicts the wrong words as an answer, it’s said to have hallucinated — in this case, possibly as a result of failing to solve the compositional task.
Peng wanted to test this hunch. His team started by studying the properties of a simple transformer, one with only a single layer, which learns to “pay attention” to the ordering and position of a sentence’s words when trying to predict the next word. (Modern LLMs have scores of such layers.) The team established a link between the complexity of the transformer layer and the “domain size,” or the number of bits required to represent the questions. By focusing on this simple model, they proved a mathematical bound. “If the total number of parameters in this one-layer transformer is less than the size of a domain, then transformers provably cannot solve the compositional task,” Peng said. In other words, an LLM with only one transformer layer was clearly and mathematically limited.
While this was a strong theoretical result, its practical implications weren’t clear, because modern LLMs are so much more complex. “It’s not easy to extend our proof,” Peng said. So his team used a different approach to study the abilities of more complicated transformers: They turned to computational complexity theory, which studies problems in terms of the resources, such as time and memory, needed to solve them.
They ended up using a well-known conjecture to show that the computational power of even multilayer transformers is limited when it comes to solving complicated compositional problems. Then, in December 2024, Peng and colleagues at the University of California, Berkeley posted a proof — without relying on computational complexity conjectures — showing that multilayer transformers indeed cannot solve certain complicated compositional tasks. Basically, some compositional problems will always be beyond the ability of transformer-based LLMs.
“If your model gets larger, you can solve much harder problems,” Peng said. “But if, at the same time, you also scale up your problems, it again becomes harder for larger models.” This suggests that the transformer architecture has inherent limitations.
Pushing the Boundaries
To be clear, this is not the end of LLMs. Wilson of NYU points out that despite such limitations, researchers are beginning to augment transformers to help them better deal with, among other problems, arithmetic. For example, Tom Goldstein, a computer scientist at the University of Maryland, and his colleagues added a twist to how they presented numbers to a transformer that was being trained to add, by embedding extra “positional” information in each digit. As a result, the model could be trained on 20-digit numbers and still reliably (with 98% accuracy) add 100-digit numbers, whereas a model trained without the extra positional embedding was only about 3% accurate. “This suggests that maybe there are some basic interventions that you could do,” Wilson said. “That could really make a lot of progress on these problems without needing to rethink the whole architecture.”
Another way to overcome an LLM’s limitations, beyond just increasing the size of the model, is to provide a step-by-step solution of a problem within the prompt, a technique known as chain-of-thought prompting. Empirical studies have shown that this approach can give an LLM such as GPT-4 a newfound ability to solve more varieties of related tasks. It’s not exactly clear why, which has led many researchers to study the phenomenon. “We were curious about why it’s so powerful and why you can do so many things,” said Haotian Ye, a doctoral student at Stanford University.
When Ye was still an undergraduate at Peking University, he and his colleagues modeled the behavior of transformers with and without chain-of-thought prompting. Their proof, using another branch of computer science called circuit complexity theory, established how chain-of-thought prompting essentially turns a large problem into a sequence of smaller problems, making it possible for transformers to tackle more complex compositional tasks. “That means … it can solve some problems that lie in a wider or more difficult computational class,” Ye said.
But, Ye cautions, their result does not imply that real-world models will actually solve such difficult problems, even with chain-of-thought. The work focused on what a model is theoretically capable of; the specifics of how models are trained dictate how they can come to achieve this upper bound.
Ultimately, as impressive as these results are, they don’t contradict the findings from Dziri’s and Peng’s teams. LLMs are fundamentally matching the patterns they’ve seen, and their abilities are constrained by mathematical boundaries. Embedding tricks and chain-of-thought prompting simply extends their ability to do more sophisticated pattern matching. The mathematical results imply that you can always find compositional tasks whose complexity lies beyond a given system’s abilities. Even some newer “state-space models,” which have been touted as more powerful alternatives to transformers, show similar limitations.
On the one hand, these results don’t change anything for most people using these tools. “The general public doesn’t care whether it’s doing reasoning or not,” Dziri said. But for the people who build these models and try to understand their capabilities, it matters. “We have to really understand what’s going on under the hood,” she said. “If we crack how they perform a task and how they reason, we can probably fix them. But if we don’t know, that’s where it’s really hard to do anything.”