Why Claude Shannon Would Have Been Great at Wordle
Introduction
If you’ve interacted with another human being this year, you’ve probably heard of Wordle, the addictive word-guessing game the coder Josh Wardle created for his partner and then sold to The New York Times for over $1 million. You might even be one of the millions who enjoy guessing those five-letter words seemingly selected with just the right balance of difficulty and solvability.
Perhaps you have a favorite first word that helps you solve the puzzle in fewer than the six allowed guesses. Or maybe you like to mix it up at the start and play your hunches. However you approach this word game, knowing a bit about a mathematical field called information theory can help you achieve your best scores.
To see how, let’s say you open up Wordle and your first guess at the secret word, after a big breakfast, is BLOAT. Wordle shows you this:
Yellow tells you that A and T are in the secret word but in the wrong positions. Knowing the word contains an A and a T, and running late for school or work, you guess WATCH and get lucky.
The green letters are in the secret word and in those exact positions. You’ve almost got it! So, what’s your next guess? How you continue says something about you both as a Wordle player and as an information theorist.
One approach is to guess a word like MATCH, which could be the answer. But, strange as it may seem, a better strategy is to guess CHIMP. Even though CHIMP can’t be the secret word, it’s the perfect move according to information theory, the field pioneered by Claude Shannon in the 1940s that laid the foundation for the digital revolution. Let’s see why.
Shannon was interested in quantifying the amount of information contained in messages across different contexts, from communication over telephone lines (he worked at Bell Labs, the famous research branch of the American phone company) to the knowledge stored in DNA (he wrote his doctoral thesis on genetics). In defining the concept of “information,” Shannon started from a few basic mathematical principles. First, the amount of information should be inversely related to predictability: A rare event should be more informative than an expected one. And second, information should add up: The amount of information in two messages should be related to the sum of the information from each individual message.
We’ll get to the details of Shannon’s definition of information in a moment, but first let’s finish up our Wordle game. Here’s everything from Wordle’s word list that ends in ATCH: BATCH, CATCH, HATCH, LATCH, MATCH, PATCH and WATCH.
From our first guess, BLOAT, we know the secret word doesn’t contain a B or an L, so that eliminates BATCH and LATCH. We also know WATCH isn’t the word, so we’ve narrowed the list of possible secret words down to: CATCH, HATCH, MATCH or PATCH.
If we try MATCH we might get lucky and win the game. But what if MATCH isn’t the word? We can try CATCH, HATCH and PATCH and eventually win the game by process of elimination, but it could take as many as four guesses to get there.
Now consider what happens when we guess CHIMP. If M or P in CHIMP comes back yellow, we know the word is MATCH or PATCH, respectively. If the C in CHIMP comes back green, we know the word is CATCH. And if none of those things happen, the only other possibility is HATCH. By guessing CHIMP, we’re guaranteed to get the word on our next guess.
Shannon’s second principle helps us understand why CHIMP is the right move. It’s not that MATCH is a bad guess. It’s more that if it isn’t the answer, that’s the only information we gain. Guessing CHIMP leverages the additivity of information from the letters C, H, M and P to give us everything we need to know to solve the puzzle.
Now that our Wordle streak is safe, let’s dig deeper into Shannon’s definition of information. The mathematical principles he started with led him to the following mathematical formula:
$latex I = \log_{2}\frac{1}{p}$.
Here I stands for information, measured in what Shannon deemed “bits”; p is the probability of the event whose information content you are quantifying; and $latex\log_{2}$ represents the base-2 logarithm function. Logarithms tell you the exponent of a number relative to a certain base: For example, $latex\log_{2}16 = 4$ because $latex 2^4 = 16$, and $latex\log_{2}32 = 5$ because $latex2^5 = 32$. You’d need a calculator (or a logarithm table) to figure out that $latex\log_{2}25 \approx 4.643856…$, but you know it’s between 4 and 5 because $latex16<25<32$. (You can choose any base of the logarithm function to quantify information, but base 2 is standard in a world of digital information stored in binary, or base-2, numbers.)
Let’s see how this definition of information follows from the two guiding principles mentioned above. First, because of the way logarithms work, the quantity
$latex I = \log_{2}\frac{1}{p}$
gets bigger as the probability of an event gets smaller. For example, an event that happens one out of every two times contains one bit of information, because $latexp=\frac{1}{2}$ and so
$latex\log_{2}\frac{1}{\frac{1}{2}}=\log_{2}2=1$.
But an event that happens once every 32 times contains five bits of information, since $latexp=\frac{1}{32}$, and so
$latex\log_{2}\frac{1}{\frac{1}{32}}=\log_{2}32=5$.
The less predictable something is, the more information it contains. A digital photo of a polar bear in a snowstorm wouldn’t contain much information because all the pixels would be predictably white. On the other hand, a lot of information would be contained in a picture of two multicolored birds of paradise dancing on the forest floor.
The second principle, the additivity of information, follows from a law of logarithms you might remember learning in algebra class:
$latex\log_{2}AB=\log_{2}A +\log_{2}B$.
In other words, the log of a product is the sum of the numbers’ individual logs. To see the connection to information, imagine that two events occur with individual probabilities $latexp_{1}$ and $latexp_{2}$. If these events are independent — that is, neither event influences the outcome of the other — then the probability of both events happening is the product of those probabilities, $latexp_{1}p_{2}$. For example, each flip of a fair coin is independent of every other flip, so the probability of flipping tails twice in a row on a fair coin is just $latex\frac{1}{2}\times\frac{1}{2}=\frac{1}{4}$, which is the probability of flipping tails on the first toss times the probability of flipping tails on the second.
When two events are independent, and thus don’t influence each other, knowing about one doesn’t give you any information about the other, so it makes sense to add up information in this situation. The total information associated with both events occurring is equal to the sum of the information associated with each event occurring individually. And it’s the clever choice of logarithms in the definition that makes information add up. The probability of the two independent events occurring together is $latexp_{1}p_{2}$, so the associated measure of information is
$latex\log_{2}\frac{1}{p_{1}p_{2}} = \log_{2}\frac{1}{p_{1} }\times \frac{1}{p_{2}}=\log_{2}\frac{1}{p_{1}} + \log_{2}\frac{1}{p_{2}}$,
which is the sum of the information from the individual events.
To see some information theory in action, let’s play a simplified version of Wordle that uses only two-letter words. In this game of “2-Wordle” there are only 16 possible answers:
AD | BE | IF | ME |
AM | HI | ON | MY |
AS | GO | IS | NO |
AT | TV | IT | TO |
Suppose you are trying to guess today’s 2-Wordle and you ask me for a hint. How would you feel if I told you, “The word does not contain a J”? You’d probably be annoyed, because none of the possible 2-Wordle words contain a J. In the context of information theory, this is a terrible hint because it contains no information. The probability of a 2-Wordle word not containing a J is 1, so the information associated with that event is $latex\log_{2}\frac{1}{1}=\log_{2}1=0$.
(Notice that the rule of exponents, $latex2^0=1$, becomes the rule of logarithms, $latex\log_{2}1=0$. Every rule of logarithms is really a rule of exponents in disguise.)
So you ask for another hint, and this time I tell you that the secret 2-Wordle doesn’t contain an A. This is more helpful. Among the 16 possible words, 12 of them don’t contain an A, so the information associated with this event is
$latex\log_{2}\frac{1}{\frac{12}{16}}=\log_{2}\frac{16}{12}=\log_{2}\frac{4}{3} \approx0.415$
bits.
This hint gives us some information, but how much information do we need? Well, there are 16 possible outcomes, so each word on the list has a $latex\frac{1}{16}$ chance of being the secret word. Plugging this into our formula gives us:
$latex\log_{2}\frac{1}{\frac{1}{16}}=\log_{2}16 = 4$.
This means there are four bits of information associated with knowing the identity of the secret word. How do we get all four bits of information? By adding up information.
Suppose I also tell you that today’s secret word doesn’t contain a T. As with A, there are 12 words that don’t contain a T, so by itself this hint provides
$latex\log_{2}\frac{1}{\frac{12}{16}}=\log_{2}\frac{16}{12}=\log_{2}\frac{4}{3} \approx0.415$
bits of information. But since these two events are independent — which means the probability of not containing an A and not containing a T is equal to the product of the probabilities of the individual events — we can add the information for a total of 0.415 + 0.415 = 0.83 bits. Combining the knowledge from the different hints means adding information together, which gets us closer to our goal of four bits.
A simple rule of thumb in information theory is that one bit of information is equivalent to cutting the possibilities in half, because half of the possibilities would be equivalent to an event with probability $latexp=\frac{1}{2}$, and this contains $latex\log_{2}\frac{1}{\frac{1}{2}}=\log_{2}2=1$ bit of information. In our current 2-Wordle game we know the word doesn’t contain an A or a T, which reduces the possibilities to nine out of the original 16, so slightly more than half. This corresponds to the 0.83 bits of information we’ve got, which is a little less than 1.
Suppose tomorrow brings a new 2-Wordle to solve and I tell you the word does contain an A. Since four of the 16 words contain an A, this amounts to
$latex\log_{2}\frac{1}{\frac{1}{4}}=\log_{2}4=2$
bits of information. Another way to think about it is that this cuts the possibilities in half twice: from 16 to eight to four.
Now watch what happens if I also tell you that the word contains a T. As with the A, the fact that four of the 16 words contain a T conveys
$latex\log_{2}\frac{1}{\frac{1}{4}}=\log_{2}4=2$
bits of information. And since these events are independent — meaning the probability of containing both an A and a T is equal to the product of the individual probabilities — we add the information. This means we now know 2 + 2 = 4 bits of information. As mentioned above, in 2-Wordle four bits of information should be sufficient to identify the secret word, and sure enough the only 2-Wordle word with both an A and a T is AT.
The secret to information theory is the clever mathematical choices Claude Shannon made in defining information. Information is high when probabilities are low, and information gets added up when outcomes are independent. The secret to Wordle is to use your guesses strategically to extract the maximum amount of information possible. If your first guess tells you that the secret word ends in S, don’t automatically guess another word that ends in S. Instead, choose a word that ends with a letter you don’t have any information about yet. After all, you don’t need confirmation that S is the last letter, you need more information about the other letters, so choose your guesses so the information adds up. You’ll find that a little information, like a little knowledge, goes a long way.
Exercises
1. In a game of 2-Wordle you guess AM and it comes back
How much information have you gained?
2. In a game of 2-Wordle, knowing that the secret word contains an N is equivalent to
$latex\log_{2}\frac{1}{\frac{2}{16}}=\log_{2}\frac{16}{2}=\log_{2}8 =3$
bits of information. Knowing that the word contains an O is equivalent to
$latex\log_{2}\frac{1}{\frac{4}{16}}=\log_{2}\frac{16}{4}=\log_{2}4=2$
bits of information. Why isn’t this apparent 2 + 3 = 5 bits of information enough to uniquely identify the secret word?
3. In a Wordle game where you know that the last four letters of the secret word are ATCH, there’s one situation where you definitely should guess a word like MATCH instead of CHIMP. What is it?
4. Suppose you are playing a Wordle-like game where you are trying to identify a secret two-digit number string (the first digit could be zero). Devise an optimal strategy for guessing the number.
Click for Answer 1:
Click for Answer 2:
Click for Answer 3:
Click for Answer 4:
Correction: May 26, 2022
Due to a production error, the inequality symbols in $latex16<25<32$ were formatted incorrectly when this article was initially published. The error has been corrected.