Tiny Language Models Come of Age
Introduction
Learning English is no easy task, as countless students well know. But when the student is a computer, one approach works surprisingly well: Simply feed mountains of text from the internet to a giant mathematical model called a neural network. That’s the operating principle behind generative language models like OpenAI’s ChatGPT, whose ability to converse coherently (if not always truthfully) on a wide range of topics has surprised researchers and the public over the past year.
But the approach has its drawbacks. For one thing, the “training” procedure required to transmute vast text archives into state-of-the-art language models is costly and time-intensive. For another, even the people who train large language models find it hard to understand their inner workings; that, in turn, makes it hard to predict the many ways they can fail.
Faced with these difficulties, some researchers have opted to train smaller models on smaller data sets and then study their behavior. “It’s like sequencing the Drosophila genome versus sequencing the human genome,” said Ellie Pavlick, a language model researcher at Brown University.
Now, in a paper recently posted to the scientific preprint server arxiv.org, a pair of Microsoft researchers have introduced a new method for training tiny language models: Raise them on a strict diet of children’s stories.
The two researchers showed that language models thousands of times smaller than today’s state-of-the-art systems rapidly learned to tell consistent and grammatical stories when trained in this way. Their results hint at new research directions that might be helpful for training larger models and understanding their behavior.
“I found this paper very informative,” said Chandra Bhagavatula, a language model researcher at the Allen Institute for Artificial Intelligence in Seattle. “The concept itself is super interesting.”
Once Upon a Time
The neural networks at the heart of language models are mathematical structures loosely inspired by the human brain. Each one contains many artificial neurons arranged in layers, with connections between neurons in adjacent layers. The neural network’s behavior is governed by the strength of these connections, called parameters. In a language model, the parameters control which words the model might spit out next, given an initial prompt and the words it has generated already.
A model only truly comes to life during training, when it repeatedly compares its own output to the text in its training data set and adjusts its parameters to increase the resemblance. An untrained network with random parameters is trivially easy to assemble from a few lines of code, but it will just produce gibberish. After training, it can often plausibly continue unfamiliar text. Larger models often undergo further fine-tuning that teaches them to answer questions and follow instructions, but the bulk of the training is mastering word prediction.
Success at word prediction requires a language model to master many different skills. For example, the rules of English grammar suggest that the next word after the word “going” is likely to be “to,” regardless of the subject of the text. In addition, a system needs factual knowledge to complete “the capital of France is,” and completing a passage containing the word “not” requires a rudimentary grasp of logic.
“Raw language is very complicated,” said Timothy Nguyen, a machine learning researcher at DeepMind. “In order for interesting linguistic capabilities to arise, people have resorted to ‘more data is better.’”
Machine learning researchers have embraced this lesson. GPT-3.5, the large language model that powers the ChatGPT interface, has nearly 200 billion parameters, and it was trained on a data set comprising hundreds of billions of words. (OpenAI hasn’t released the corresponding figures for its successor, GPT-4.) Training such large models typically requires at least 1,000 specialized processors called GPUs running in parallel for weeks at a time. Only a few companies can muster the requisite resources, let alone train and compare different models.
Ronen Eldan, a mathematician who joined Microsoft Research in 2022 to study generative language models, wanted to develop a cheaper and faster way to explore their abilities. The natural way to do that was by using a small data set, and that in turn meant he’d have to train models to specialize in a specific task, so they wouldn’t spread themselves too thin. Initially, he wanted to train models to solve a certain class of math problems, but one afternoon, after spending time with his 5-year-old daughter, he realized that children’s stories were a perfect fit.
“It literally came to me after I read her a story,” he said.
To generate coherent children’s stories, a language model would need to learn facts about the world, keep track of characters and events, and observe the rules of grammar — simpler versions of the challenges facing large models. But large models trained on massive data sets learn countless irrelevant details along with the rules that really matter. Eldan hoped the brevity and limited vocabulary of children’s stories might make learning more manageable for small models — making them both easier to train and easier to understand.
In the world of language models, though, “small” is relative: A data set a thousand times smaller than the one used to train GPT-3.5 would still need to contain millions of stories. “I don’t know how much money you want to spend, but I’m guessing you’re not going to hire professionals to write [a couple million] short stories,” Nguyen said.
It would take an extraordinarily prolific author to satisfy such voracious readers, but Eldan had a few candidates in mind. Who better to write for an audience of small language models than large ones?
Toy Stories
Eldan immediately set out to create a library of synthetic children’s stories generated by large language models. But he soon discovered that even state-of-the-art models aren’t naturally very creative. If you just tell GPT-4 to write stories appropriate for 4-year-olds, Eldan said, “about one-fifth of the stories will be about children going to the park being scared of the slides.” That’s apparently the quintessential preschool story, as far as the internet is concerned.
The solution was to add a bit of randomness into the prompt. First, Eldan used GPT-4 to generate a list of 1,500 nouns, verbs and adjectives that a 4-year-old might know — short enough that he could easily check it himself. Then he wrote a simple computer program that would repeatedly prompt GPT-3.5 or GPT-4 to generate an age-appropriate story that included three random words from the list, along with an additional randomly chosen detail like a happy ending or plot twist. The resulting stories, mercifully, were less focused on scary slides.
Eldan now had a procedure for churning out training data on demand, but he had no idea how many stories he’d need to train a functional model, or how big that model would need to be. That’s when he teamed up with Yuanzhi Li, a machine learning researcher at Microsoft and Carnegie Mellon University, to try different possibilities, taking advantage of the fact that small models could be trained very quickly. Step 1 was deciding how to evaluate their models.
In language model research — as in every classroom — grading is a fraught topic. There’s no perfect rubric that encapsulates everything researchers want to know, and models that excel at some tasks often fail spectacularly at others. Over time, researchers have developed various standard benchmarks based on questions with unambiguous answers, which is a good approach if you’re trying to evaluate specific skills. But Eldan and Li were interested in something more nebulous: How big do language models really need to be if you simplify language as much as possible?
“In order to directly test if the model speaks English, I think the only thing you can do is let the model generate English in an open-ended way,” Eldan said.
There are only two ways to measure a model’s performance on such qualitative questions: Rely on human graders, or turn once again to GPT-4. The two researchers chose the latter route, effectively letting the big models both write the textbooks and grade the essays.
Bhagavatula said he would have liked to see how GPT-4’s evaluations compared to those of human reviewers — GPT-4 may be biased toward models that it helped train, and the opaqueness of language models makes it hard to quantify such biases. But he doesn’t think such subtleties would affect comparisons between different models trained on similar sets of synthetic stories — the main focus of Eldan and Li’s work.
Eldan and Li used a two-step procedure for evaluating each of their small models after training. First, they prompted the small model with the first half of a story distinct from those in the training data set so that it generated a new ending, repeating this process with 50 different test stories. Second, they instructed GPT-4 to grade each of the small model’s endings based on three categories — creativity, grammar and consistency with the beginning of the story. They then averaged the scores in each category, ending up with three final grades per model.
With this procedure in hand, Eldan and Li were finally ready to compare different models and find out which were the star students.
Test Results
After some preliminary exploration, the two researchers settled on a training data set containing roughly 2 million stories. They then used this data set, dubbed TinyStories, to train models ranging in size from 1 million to 30 million parameters, with varying numbers of layers. It was quick work: Using only four GPUs, the largest of these models took no more than a day to train.
The smallest models struggled. For example, one test story begins with a mean-looking man telling a girl he will take her cat. A million-parameter model got stuck in a loop with the girl repeatedly telling the man she wanted to be friends. But the larger ones — still thousands of times smaller than GPT-3.5 — performed surprisingly well. The 28-million-parameter version told a coherent story, though the ending was grim: “Katie started to cry, but the man didn’t care. He took the cat away and Katie never saw her cat again. The end.”
In addition to testing their own models, Eldan and Li presented the same challenge to OpenAI’s GPT-2, a 1.5-billion-parameter model released in 2019. It fared far worse — before the story’s abrupt ending, the man threatens to take the girl to court, jail, the hospital, the morgue and finally the crematorium.
Nguyen said it’s exciting that such tiny models were so fluent, but perhaps not surprising that GPT-2 struggled with the task: It’s a larger model but far from the state of the art, and it was trained on a very different data set. “A toddler training only on toddler tasks, like playing with some toys, might do better than you or I,” he noted. “We didn’t specialize in this simple thing.”
Comparisons between different TinyStories models don’t suffer from the same confounding factors. Eldan and Li observed hints that networks with fewer layers but more neurons per layer were better at answering questions that required factual knowledge; conversely, networks with more layers and fewer neurons per layer were better at keeping track of characters and plot points from earlier in the story. Bhagavatula found this result especially intriguing. If it can be replicated in larger models, he said, “that would be a really cool result that could stem out of this work.”
Eldan and Li also studied how their small models’ abilities depended on the duration of the training period. In every case, models mastered grammar first and consistency later. To Eldan, this pattern illustrates how differences in reward structures lead to differences in language acquisition patterns between neural networks and children. For language models, which learn by predicting words, “the incentive on the words ‘I want to have’ is as big as it is on the words ‘ice cream,’” he said. Children, on the other hand, “don’t care about whether they say ‘I would like to have some ice cream’ or just ‘ice cream, ice cream, ice cream.’”
Quality Versus Quantity
Eldan and Li hope that the research will motivate other researchers to train different models on the TinyStories data set and compare their capabilities. But it’s often hard to predict which characteristics of small models will also appear in larger ones.
“Maybe mouse models of vision are really good proxies of human vision, but are mouse models of depression good models of human depression?” Pavlick said. “For every case it’s a little bit different.”
The success of the TinyStories models also suggests a broader lesson. The standard approach to compiling training data sets involves vacuuming up text from across the internet and then filtering out the garbage. Synthetic text generated by large models could offer an alternative way to assemble high-quality data sets that wouldn’t have to be so large.
“We have more and more evidence that this is very effective, not only in TinyStories-sized models but also in larger models,” Eldan said. That evidence comes from a pair of follow-up papers about billion-parameter models by Eldan, Li and other Microsoft researchers. In the first paper, they trained a model to learn the programming language Python using snippets of code generated by GPT-3.5 along with carefully curated code from the internet. In the second, they augmented the training data set with synthetic “textbooks,” covering a wide range of topics, to train a general-purpose language model. In their tests, both models compared favorably to larger models trained on larger data sets. But evaluating language models is always tricky, and the synthetic training data approach is still in its infancy — more independent tests are necessary.
As state-of-the-art language models grow ever larger, surprising findings from their tiny cousins are reminders that there’s still much we don’t understand about even the simplest models. Nguyen expects to see many more papers exploring the approach pioneered by TinyStories.
“The question is: Where and why does size matter?” he said. “There should be a science of that, and this paper is hopefully the beginning of a rich story.”