The Poetry Fan Who Taught an LLM to Read and Write DNA

Comments

genomics

The Poetry Fan Who Taught an LLM to Read and Write DNA

February 5, 2025

By treating DNA as a language, Brian Hie’s “ChatGPT for genomes” could pick up patterns that humans can’t see, accelerating biological design.

The computer scientist Brian Hie led the team behind Evo, a large language model trained on 2.7 million bacterial, archaeal and viral genomes. The AI tool can write original DNA sequences that encode functional biological machines.

Rachel Bujalski for *Quanta Magazine*

DNA is often compared to a written language. The metaphor leaps out: Like letters of the alphabet, molecules (the nucleotide bases A, T, C and G, for adenine, thymine, cytosine and guanine) are arranged into sequences — words, paragraphs, chapters, perhaps — in every organism, from bacteria to humans. Like a language, they encode information. But humans can’t easily read or interpret these instructions for life. We cannot, at a glance, tell the difference between a DNA sequence that functions in an organism and a random string of A’s, T’s, C’s and G’s.

“It’s really hard for humans to understand biological sequence,” said the computer scientist Brian Hie, who heads the Laboratory of Evolutionary Design at Stanford University, based at the nonprofit Arc Institute. This was the impetus behind his new invention, named Evo: a genomic large language model (LLM), which he describes as ChatGPT for DNA.

ChatGPT was trained on large volumes of written English text, from which the algorithm learned patterns that let it read and write original sentences. Similarly, Evo was trained on large volumes of DNA — 300 billion base pairs from 2.7 million bacterial, archaeal and viral genomes — to glean functional information from stretches of DNA that a user inputs as prompts. A more complete understanding of the code for life, Hie said, could accelerate biological design: the creation of better biological tools to improve medicine and the environment.

Hie became interested in using language models for biology during graduate school, when he began building protein LLMs, which can predict how proteins fold and help design new ones. Proteins are molecular machines encoded by DNA in the wordlike segments we call genes. But an organism’s genome — the entire length of its DNA — represents more information than a list of proteins, just as a sentence contains more information than a list of words. Biologists are still struggling to understand the grammar of DNA. What’s more, genomes include many regions that do not code for proteins. Hie wondered: What if machine learning could help make sense of the genetic library?

From its immersion in the language of nucleotides, Evo picks up patterns that humans can’t see. It uses those patterns to predict how changes to DNA affect the function of its downstream products, RNA and proteins. The LLM has also written new sequences for alternative versions of molecules; in some cases, these Evo-crafted complexes perform their task as well as or better than nature’s versions.

“These variations are like alternative paths that could have been taken by evolution, but that were not,” Hie said. “Now we have a model that lets us explore these alternate evolutionary universes.”

Share this article

Newsletter

Get Quanta Magazine delivered to your inbox

Recent newsletters

After being trained on 70,000 natural DNA sequences that produce variations of the CRISPR-Cas complex, Evo wrote some of its own. Hie’s team created 11 of its inventions in the lab: One of the AI-written complexes worked.

Rachel Bujalski for *Quanta Magazine*

Brian Hie holds a pipette at a lab bench. Surrounding images show lab reagents and pipettes. — After being trained on 70,000 natural DNA sequences that produce variations of the CRISPR-Cas complex, Evo wrote some of its own. Hie’s team created 11 of its inventions in the lab: One of the AI-written complexes worked.

Rachel Bujalski for *Quanta Magazine*

The formula for Evo’s success is basic in principle. The model is large, bestowed with 7 billion variables, known in computer science as parameters, and trained on loads of data. Its objective is simple: Predict the next base pair in the DNA sequence. From a large model and a simple objective, complex properties arise. “This is a very powerful paradigm that has emerged in machine learning over the past couple years,” Hie said. Under that paradigm, Evo acquires an uncanny knack for divining what sequences are compatible with life and for spinning out useful variations of nature’s molecules. Evo even wrote an entire genome of its own design, though not one that could function in an organism, he said — not yet, anyway.

“Biological design right now is very artisanal. It’s very random, and it has a really low success rate,” Hie said. “We hope to improve all of these aspects with machine learning.”

Quanta spoke with Hie about the parallels between DNA and human language, what Evo can and can’t do, and the poetry in programming. The interview has been condensed and edited for clarity.

What were you interested in first: computers, biology or language?

I have very broad interests, and I explored a lot of career paths. At one point in my life, I wanted to pursue a Ph.D. in English literature. In high school and college, I learned to appreciate poetry. The type of poetry I really liked had lyrics that have lots of structure and grand concepts and use language in very new and interesting ways.

The affinity for scanning a sonnet or identifying structure in a well-composed English lyric is similar to wanting to develop models that make genomic or protein sequences more interpretable and reveal their hidden structure. It’s almost like literary criticism on biology sequences. In that way, I’m still doing literary criticism.

What made you think DNA could be treated like a language?

The DNA itself is sequential like human natural language. It’s a sequence of discrete “tokens,” or building blocks. We tokenize human natural language into words, letters in the alphabet or Chinese characters. In biology, a token can correspond to a DNA base pair or an amino acid [the molecular building blocks for proteins].

Brian Hie walks down a hallway. — As an undergrad, Hie studied English literature and poetry. “The affinity for scanning a sonnet,” he said, translates easily to developing models that “reveal the hidden structure” of genomic or protein sequences.

Rachel Bujalski for *Quanta Magazine*

And like natural language, DNA has a natural structure. The sequences are not random. A lot of structure in natural language is also informal; it can be ambiguous, and it’s changing all the time. In the same way, DNA sequences have some ambiguity. The same sequence in a different context can mean different things.

How did you become interested in applying large language models to DNA?

It was right at the beginning of my current faculty position, in fall of 2023. Something about changing jobs makes one want to reconsider things. I was on vacation with friends in Tokyo. I was jet-lagged, so I woke up early. Since everyone else was asleep, I took a long walk by myself. I was thinking about DNA language modeling.

The central dogma in molecular biology is a very beautiful thing. It states that DNA encodes for RNA, which encodes for protein. So if you train a model in DNA, and it’s a good model, you get RNA and protein language modeling for free because there is a direct correspondence between the DNA and the protein sequence.

You also get to train on the genome itself: the genes as they are, next to each other on the genome. When you train a protein language model, you basically take a whole genome and cut out all the portions that code for proteins, and train on all those small portions individually. But you ignore the vast genetic context the proteins are in. In microbial genomes especially, proteins with related functions are directly next to each other on the genome, so the order of these protein-coding regions on the genome matters. You lose that information in a protein language model.

I realized that training a model on a more basic level — going from protein down to DNA — could expand the capabilities of a model.

How did you train Evo to “read” DNA?

One important difference between protein and DNA language models is the length of the sequence that the model uses to make its next-base-pair predictions, which we call “context length.” Context length is akin to the one or two pages of a novel a person can see at one time. Evo was trained on a “novel” consisting of many genomes — the E. coli genome alone is 2 million to 4 million base pairs — but with a maximum context length of 131,000 tokens. By comparison, the original protein language models were trained with a context length of 1,000 amino acids.

This required some technological development because long context lengths consume a lot of computational power. This power requirement, which grew quadratically with context length, limited the original versions of ChatGPT. But by the time we were thinking about Evo, researchers — including, auspiciously, a team at Stanford — had found a way to reduce the computing needed for longer context lengths. A student from that Stanford lab helped us apply those advances to our DNA model.

Evo’s training data set was also important: Its exposure to 2.7 million genomes from bacteria, archaea and viruses. From my protein language modeling, I learned that sequence diversity matters. It shows the model evolutionary alternatives for life — different ways of expressing the same idea — that the model can use to learn general rules for, say, building proteins that perform a particular function.

We began training Evo in December 2023, a couple weeks before I started my lab. We gave it different DNA prompts and asked it to predict the next token (in this case, a DNA base pair) in a sequence. In January, I decided to test whether it worked.

How did you test it, and how did it do?

I gave it protein-coding DNA sequences that had various mutations: base pairs that differed from the typical gene sequence. The task was to predict the “evolutionary likelihood” of these mutations, the probability that they would exist in nature. Mutations deemed likely should preserve or improve a protein’s function in the lab. Unlikely mutations should correlate with poor function.

Evo did not have any explicit knowledge of function. It only knew what mutations had been used by evolution in the past. Moreover, the model was trained only on DNA without any instruction about which portions of the DNA matched proteins. So it had to figure out how DNA codes for proteins, and where proteins start and stop on the genome.

Brian Hie talks to his students, who are working at computers. — At the nonprofit Arc Institute, Hie advises students from across Stanford University’s science departments, including biophysics, bioengineering and genetics.

Rachel Bujalski for *Quanta Magazine*

We scored likelihoods from the model using experimental tests of protein function. We found that if a base pair has high likelihood under Evo, then that base pair is likely to preserve or improve the protein’s function. But if that base pair has low likelihood, then putting that base pair into a protein sequence will likely destroy function.

We also compared the model’s results to those of state-of-the-art protein language models. We found that Evo matched the performance of the protein models, despite never having seen a protein sequence. That was the first indication that, OK, maybe we were on to something.

What else did you ask Evo to do?

We used it to generate DNA sequences, just as ChatGPT can generate text. One of my students, Brian Kang, helped me fine-tune the Evo model on DNA that coded for a protein as well as at least one RNA molecule; they link together to create a complex called CRISPR-Cas. CRISPR-Cas breaks DNA in specific spots, which helps bacteria defend against viruses. Scientists use them for genome editing.

After training Evo on more than 70,000 DNA natural sequences for the CRISPR-Cas complex, we asked it to generate the complete system in the DNA code. For 11 of its suggestions, we ordered the DNA sequences from a company and used these to create the CRISPR-Cas complexes in the lab and test their function.

One of them worked. We consider that a very successful pilot. With typical protein design workflows, you’d be lucky to find one working protein for every 100 sequences tested.

How well did the successful sequence work?

It does as well as the state-of-the-art Cas system. If you squint a little bit, maybe it has a little bit faster cleavage [cutting of the DNA strand].

Has this ever been done before?

This is a very complicated task. The Cas enzyme is too long for current protein language models to process. In addition, a protein model could not generate the RNA.

What is the longest DNA sequence Evo has generated?

The model generated a million tokens freely from scratch — essentially, an entire bacterial genome. If you asked ChatGPT to generate a million tokens of text, at some point it would go off the rails. There would be some grammatical structure, but it would not produce Wuthering Heights.

Evo’s genome also had structure. It had a similar density of genes to natural genomes, and proteins that folded like natural proteins. But it fell short of something that could drive an organism because it lacked many genes that we know to be critical to an organism’s survival. To generate a coherent genome, the model needs the ability to edit its product — to correct errors, just as a human writer would do for a longer passage of text.

What are Evo’s other limitations?

It’s only the beginning. Evo is trained only on genomes from the simplest organisms, prokaryotes. We want to expand it to eukaryotes — organisms such as animals, plants and fungi whose cells have a nucleus. Their genomes are much more complicated.

Evo also only reads the language of DNA, and DNA is only part of what determines the characteristics of an organism, its phenotype. The environment also plays a role. So, in addition to having a good model of genotype, we would like to build a really good model of the environment and its connection to phenotype.

I have found LLM chatbots to be error-prone. Is Evo more accurate?

With ChatGPT, you want it to get the facts right. In biology these hallucinations can almost be a feature and not a bug. If some crazy new sequence works in the cell, then biologists think it’s novel.

But Evo does make mistakes. It may, for example, predict a protein structure from a sequence that turns out to be wrong when we make the protein in the lab. Still, a human would be almost completely worthless on a task like this. No human could write, from scratch, a DNA sequence that would fold into a CRISPR-Cas complex.

Where do you see this technology leading in five or 10 years?

We are going to push the boundaries of biological design way beyond individual protein molecules to more complex systems involving many proteins, or to proteins bound to RNA or DNA. That’s the message of the Evo paper. We might engineer a synthetic pathway that produces a small-molecule drug with therapeutic value or that degrades discarded plastic or oil from spills.

I also expect the models to aid biological discovery. When you sequence a new organism from nature, you just get DNA. It’s very hard to identify what parts of the genome correspond to different functions. If the models can learn the concept of, say, a phage defense system or a biosynthetic pathway, they will help us annotate and discover new biological systems in sequencing data. The algorithm is fluent in the language, whereas humans are very much not.

Does a model like Evo present any dangers?

If the model were used to design viruses, maybe those viruses could be used for nefarious purposes. We should have some way of ensuring that these models are used for good. But the level of biotechnology is already sufficient to create dangerous things. What biotechnology can’t do yet is protect us from dangerous things.

Nature is creating deadly viruses all the time. I think that if we raise our level of technological capability, it will have a larger impact on our ability to defend ourselves against biological threats than it does on creating new ones.

Share this article

Newsletter

Get Quanta Magazine delivered to your inbox

Recent newsletters

Also in Biology

A close-up image of a mantis shrimp, which has rainbow spots and stripes and beady, complex eyes on stalks.

evolution

When Did Nature Burst Into Vivid Color?

By Molly Herring

June 27, 2025

microbiology

The Ecosystem Dynamics That Can Make or Break an Invasion

By Gabriel Popkin

June 16, 2025

The Joy of Why

Does Form Really Shape Function?

By Janna Levin +1 authors

June 12, 2025

Comment on this article

Quanta Magazine moderates comments to facilitate an informed, substantive, civil conversation. Abusive, profane, self-promotional, misleading, incoherent or off-topic comments will be rejected. Moderators are staffed during regular business hours (New York time) and can only accept comments written in English.

New Proofs Probe the Limits of Mathematical Truth

Saved Articles

Log out

Change password

Share

The Poetry Fan Who Taught an LLM to Read and Write DNA

What were you interested in first: computers, biology or language?

What made you think DNA could be treated like a language?