Neural Networks Need Data to Learn. Even If It’s Fake.
Introduction
On a sunny day in late 1987, a Chevy van drove down a curvy wooded path on the campus of Carnegie Mellon University in Pittsburgh. The hulking vehicle, named Navlab, wasn’t notable for its beauty or speed, but for its brain: It was an experimental version of an autonomous vehicle, guided by four powerful computers (for their time) in the cargo area.
At first, the engineers behind Navlab tried to control the vehicle with a navigation algorithm, but like many previous researchers they found it difficult to account for the huge range of driving conditions with a single set of instructions. So they tried again, this time using an approach to artificial intelligence called machine learning: The van would teach itself how to drive. A graduate student named Dean Pomerleau constructed an artificial neural network, made from small logic-processing units meant to work like brain cells, and set out to train it with photographs of roads under different conditions. But taking enough photographs to cover the huge range of potential driving situations was too difficult for the small team, so Pomerleau generated 1,200 synthetic road images on a computer and used those to train the system. The self-taught machine drove as well as anything else the researchers came up with.
Navlab didn’t directly lead to any major breakthroughs in autonomous driving, but the project did show the power of synthetic data to train AI systems. As machine learning leapt forward in subsequent decades, it developed an insatiable appetite for training data. But data is hard to get: It can be expensive, private or in short supply. As a result, researchers are increasingly turning to synthetic data to supplement or even replace natural data for training neural networks. “Machine learning has long been struggling with the data problem,” said Sergey Nikolenko, the head of AI at Synthesis AI, a company that generates synthetic data to help customers make better AI models. “Synthetic data is one of the most promising ways to solve that problem.”
Fortunately, as machine learning has grown more sophisticated, so have the tools for generating useful synthetic data.
One area where synthetic data is proving useful is in addressing concerns about facial recognition. Many facial recognition systems are trained with huge libraries of images of real faces, which raises issues about the privacy of the people in the images. Bias is also a problem, since various populations are over- and underrepresented in those libraries. Researchers at Microsoft’s Mixed Reality & AI Lab have tackled these concerns, releasing a collection of 100,000 synthetic faces for training AI systems. These faces are generated from a set of 500 people who gave permission for their faces to be scanned.
Microsoft’s system takes elements of faces from the initial set to make new and unique combinations, then adds visual flair with details like makeup and hair. The researchers say their data set spans a wide range of ethnicities, ages and styles. “There’s always a long tail of human diversity. We think and hope we’re capturing a lot of it,” said Tadas Baltrušaitis, a Microsoft researcher working on the project.
Another advantage of the synthetic faces is that the computer can label every part of every face, which helps the neural net learn faster. Real photos must instead be labeled by hand, which takes much longer and is never as consistent or accurate.
The results aren’t photorealistic — the faces look a little like characters from a Pixar movie — but Microsoft has used them to train face recognition networks whose accuracy approaches that of networks trained on millions of real faces.
The ability of computers to generate useful synthetic data has also improved recently, due in part to better GPUs — a type of chip designed for graphical processing that can produce more realistic images. Erroll Wood, a researcher currently at Google who also helped create the synthetic faces, relied on GPUs for an eye-tracking project. Eye tracking is a difficult task for computers, since it involves following the minute movements of different-looking eyes under varied lighting conditions, even at extreme angles with the eyeball only barely visible. Normally it would take thousands of photos of human eyes for a machine to learn where a person’s looking — and those photos are hard to obtain and prohibitively expensive.
Wood’s team showed that a computer powered by a GPU and running Unity, a software package for producing video games, could generate the necessary pictures — including detailed reflections of digital images wrapped around the curved, wet human eyeball. It took the GPU system just 23 milliseconds to generate each photo. (In fact, each image actually took only 3.6 milliseconds to produce; the rest of the time was spent storing the image.) The researchers produced 1 million eye images and used them to train a neural network, which performed as well as the same network trained on real photos of human eyes, for a fraction of the price and in much less time. As with Microsoft’s synthetic faces, the eye-tracking network benefited from the computer’s ability to apply pixel-perfect labels to the training images.
Researchers are also using the latest AI systems to create the data needed to train AI systems. In medicine, for example, a long-standing goal has been to create a neural network that can interpret radiological images as well as human radiologists can. But it’s hard to get the data necessary to train these systems, since X-rays and CT scans of real patients are private health information. It’s a burden to get access to the thousands or millions of images necessary to train a truly accurate model.
Earlier this year, Hazrat Ali, a computer scientist at Hamad Bin Khalifa University in Qatar, described his early experiments using DALL·E 2, a popular diffusion model, to create realistic X-ray and CT images of lungs, including representations of specific lung conditions. These images can then be used to train a neural network to detect tumors and other abnormalities. Within a year, he expects diffusion models to set a new benchmark for AI radiology tools. “Once we are able to synthesize more realistic MRIs, CTs and perhaps ultrasounds, this is going to speed up research and, ultimately, clinical translation, without raising concerns about patients’ privacy and data sharing.”
As Navlab timidly rolled through the CMU campus, onlookers probably didn’t think they were watching the birth of an important technology. But that slow journey helped introduce the world to synthetic data, which has taken on a key role in the development of artificial intelligence. And that role may become truly essential in the future. “Synthetic data is here to stay,” said Marina Ivasic‐Kos, a machine learning researcher at the University of Rijeka in Croatia. “The endgame is to completely replace real data with synthetic data.”
Correction: June 20, 2023
A previous version of this article included an older affiliation for Hazrat Ali. He’s currently a researcher at Hamas Bin Khalifa University.