Three Hundred Years Later, a Tool from Isaac Newton Gets an Update

Michele Sclafani for Quanta Magazine
Introduction
Every day, researchers search for optimal solutions. They might want to figure out where to build a major airline hub. Or to determine how to maximize return while minimizing risk in an investment portfolio. Or to develop self-driving cars that can distinguish between traffic lights and stop signs.
Mathematically, these problems get translated into a search for the minimum values of functions. But in all these scenarios, the functions are too complicated to assess directly. Researchers have to approximate the minimal values instead.
It turns out that one of the best ways to do this is by using an algorithm that Isaac Newton developed over 300 years ago. This algorithm is fairly simple. It’s a little like searching, blindfolded, for the lowest point in an unfamiliar landscape. As you put one foot in front of the other, the only information you need is whether you’re going uphill or downhill, and whether the grade is increasing or decreasing. Using that information, you can get a good approximation of the minimum relatively quickly.
Although enormously powerful — centuries later, Newton’s method is still crucial for solving present-day problems in logistics, finance, computer vision and even pure math — it also has a significant shortcoming. It doesn’t work well on all functions. So mathematicians have continued to study the technique, figuring out different ways to broaden its scope without sacrificing efficiency.
Last summer, three researchers announced the latest improvement to Newton’s method. Amir Ali Ahmadi of Princeton University, along with his former students Abraar Chaudhry (now at the Georgia Institute of Technology) and Jeffrey Zhang (now at Yale University), extended Newton’s method to work efficiently on the broadest class of functions yet.
“Newton’s method has 1,000 different applications in optimization,” Ahmadi said. “Potentially our algorithm can replace it.”

In the 1680s, Isaac Newton developed an algorithm for finding optimal solutions. Three centuries later, mathematicians are still using and honing his method.
Godfrey Kneller/Public Domain
A Centuries-Old Technique
Mathematical functions transform inputs into outputs. Often, the most important feature of a function is its minimum value — the combination of inputs that produces the smallest possible output.
But finding the minimum is hard. Functions can have dozens of variables raised to high powers, defying formulaic analysis; graphs of their solutions form high-dimensional landscapes that are impossible to explore from a bird’s-eye view. In those higher-dimensional landscapes, said Coralia Cartis of the University of Oxford, “We want to find a valley. Some are local valleys; others are the lowest point. You’re trying to find these things, and the question is: What info do you have to guide you to that?”
In the 1680s, Newton recognized that even when you’re dealing with a very complicated function, you’ll still always have access to at least two pieces of information to help you find its deepest valley. First, you can calculate the function’s so-called first derivative, or slope: the steepness of the function at a given point. Second, you can compute the rate at which the slope itself is changing (the function’s second derivative).

Amir Ali Ahmadi sees optimization problems everywhere he looks.
Archives of the Mathematisches Forschungsinstitut Oberwolfach
Say you’re trying to find the minimum of some complicated function. First, choose a point on the function that you think might be close to the true minimum. Compute the function’s first and second derivatives at that point. These derivatives can be used to construct a special quadratic equation — a parabola if your function lives in a 2D plane, and a cuplike shape called a paraboloid if your function is higher dimensional. This quadratic equation, which is called a Taylor approximation, roughly resembles your function at the point you chose.
Now calculate the minimum of the quadratic equation instead of the original — something you can do easily, using a well-known formula. (That’s because quadratic equations are simple; it’s when equations get more complicated that calculating the minimum becomes prohibitive.) You’ll get a point. Then plug the coordinates of that point back into your original function, and you’ll get a new point on the function that is, hopefully, closer to its true minimum. Start the entire process again.
Newton proved that if you keep on repeating this process, you’ll eventually home in on the minimum value of the original, more complicated function. The method doesn’t always work, especially if you start at a point that’s too far away from the true minimum. But for the most part, it does. And it has some desirable attributes.
Mark Belan/Quanta Magazine; Source: arxiv:2305.07512
Other iterative methods, like gradient descent — the algorithm used in today’s machine learning models — converge toward the true minimum at a linear rate. Newton’s method converges toward it much faster: at a “quadratic” rate. In other words, it can identify the minimum value in fewer iterations than gradient descent. (Each iteration of Newton’s method is more computationally expensive than an iteration of gradient descent, which is why researchers prefer gradient descent for certain applications, like training neural networks. But Newton’s method is still enormously efficient, making it useful in all sorts of contexts.)
Newton could have written his method to converge toward the true minimum value even faster if, instead of taking just the first and second derivatives at each point, he had also taken, say, the third and fourth derivatives. That would have given him more complicated Taylor approximations, with exponents greater than 2. But the whole crux of his strategy was to transform a complicated function into a simpler one. These more complicated Taylor equations were more than Newton could handle mathematically.
“Newton did it for degree 2. He did that because nobody knew how to minimize higher-order polynomials,” Ahmadi said.
In the centuries since, mathematicians have worked to extend his method, to probe how much information they can squeeze out of more complicated Taylor approximations of their functions.
In the 19th century, for instance, the Russian mathematician Pafnuty Chebyshev proposed a version of Newton’s method that approximated functions with cubic equations (which have an exponent of 3). But his algorithm didn’t work when the original function involved multiple variables. Much more recently, in 2021, Yurii Nesterov (now at Corvinus University of Budapest) demonstrated how to approximate functions of any number of variables efficiently with cubic equations. But his method couldn’t be extended to approximate functions using quartic equations, quintics and so on without losing its efficiency. Nevertheless, the proof was a major breakthrough in the field.
Now Ahmadi, Chaudhry and Zhang have taken Nesterov’s result another step further. Their algorithm works for any number of variables and arbitrarily many derivatives. Moreover, it remains efficient for all these cases — something that until now wasn’t possible.
But first, they had to find a way to make a hard math problem a lot easier.
Finding Wiggle Room
There is no fast, general purpose method for finding the minima of functions raised to high exponents. That’s always been the main limitation of Newton’s method. But there are certain types of functions that have characteristics that make them easy to minimize. In the new work, Ahmadi, Chaudhry and Zhang prove that it’s always possible to find approximating equations that have these characteristics. They then show how to adapt these equations to run Newton’s method efficiently.
What properties make an equation easy to minimize? Two things: The first is that the equation should be bowl-shaped, or “convex.” Rather than having many valleys, it has just one — meaning that when you try to minimize it, you don’t have to worry about mistaking an arbitrary valley for the lowest one.

Abraar Chaudhry and two colleagues recently found a way to improve a centuries-old method for finding the minima of functions.
Camille Carpenter Henriquez
The second property is that the equation can be written as a sum of squares. For example, 5x2 + 16x + 13 can be written as the sum (x + 2)2 + (2x + 3)2. In recent years, mathematicians have developed techniques for minimizing equations with arbitrarily large exponents so long as they are both convex and a sum of squares. However, those techniques were of little help when it came to Newton’s method. Most of the time, the Taylor approximation you use won’t have these nice properties.
But Ahmadi, Chaudhry and Zhang figured out how to use a technique called semidefinite programming to wiggle the Taylor approximation just enough to make it both a sum of squares and convex, though not so much that it became unmoored from the original function it was supposed to resemble.
They essentially added a fudge factor to the Taylor expansion, turning it into an equation that had the two desired properties. “We can change the Taylor expansion a bit to make it simpler to minimize. Think of the Taylor expansion, but modified a little bit,” Ahmadi said. He and his colleagues then showed that, using this modified version of the Taylor expansion — which involved arbitrarily many derivatives — their algorithm would still converge on the true minimum of the original function. Moreover, the rate of convergence would scale with the number of derivatives used: Just as using two derivatives allowed Newton to approach the true minimum at a quadratic rate, using three derivatives enabled the researchers to approach it at a cubic rate, and so on.
Ahmadi, Chaudhry and Zhang had created a more powerful version of Newton’s method that could reach the true minimum value of a function in fewer iterations than previous techniques.
Like the original version of Newton’s method, each iteration of this new algorithm is still computationally more expensive than methods such as gradient descent. As a result, for the moment, the new work won’t change the way self-driving cars, machine learning algorithms or air traffic control systems work. The best bet in these cases is still gradient descent.
“Many ideas in optimization take years before they are made fully practical,” said Jason Altschuler of the University of Pennsylvania. “But this seems like a fresh perspective.”
If, over time, the underlying computational technology needed to run Newton’s method becomes more efficient — making each iteration less computationally expensive — then the algorithm developed by Ahmadi, Chaudhry and Zhang could eventually surpass gradient descent for all sorts of applications, including machine learning.
“Our algorithm right now is provably faster, in theory,” Ahmadi said. He’s hopeful, he added, that in 10 to 20 years, it will also be so in practice.
Correction: March 25, 2025
The graphic in this article has been updated.