SERIES
Data Driven: The New Big Science

The Future Fabric of Data Analysis

The nature of computing has changed dramatically over the last decade, and more innovation is needed to weather the gathering data storm.

Tang Yau Hoong for Quanta Magazine

Introduction

When subatomic particles smash together at the Large Hadron Collider in Switzerland, they create showers of new particles whose signatures are recorded by four detectors. The LHC captures 5 trillion bits of data — more information than all of the world’s libraries combined — every second. After the judicious application of filtering algorithms, more than 99 percent of those data are discarded, but the four experiments still produce a whopping 25 petabytes (25×1015 bytes) of data per year that must be stored and analyzed. That is a scale far beyond the computing resources of any single facility, so the LHC scientists rely on a vast computing grid of 160 data centers around the world, a distributed network that is capable of transferring as much as 10 gigabytes per second at peak performance.

The LHC’s approach to its big data problem reflects just how dramatically the nature of computing has changed over the last decade. Since Intel co-founder Gordon E. Moore first defined it in 1965, the so-called Moore’s law — which predicts that the number of transistors on integrated circuits will double every two years — has dominated the computer industry. While that growth rate has proved remarkably resilient, for now, at least, “Moore’s law has basically crapped out; the transistors have gotten as small as people know how to make them economically with existing technologies,” said Scott Aaronson, a theoretical computer scientist at the Massachusetts Institute of Technology.

Instead, since 2005, many of the gains in computing power have come from adding more parallelism via multiple cores, with multiple levels of memory. The preferred architecture no longer features a single central processing unit (CPU) augmented with random access memory (RAM) and a hard drive for long-term storage. Even the big, centralized parallel supercomputers that dominated the 1980s and 1990s are giving way to distributed data centers and cloud computing, often networked across many organizations and vast geographical distances.

These days, “People talk about a computing fabric,” said Stanford University electrical engineer Stephen Boyd. These changes in computer architecture translate into the need for a different computational approach when it comes to handling big data, which is not only grander in scope than the large data sets of yore but also intrinsically different from them.

The demand for ever-faster processors, while important, isn’t the primary focus anymore. “Processing speed has been completely irrelevant for five years,” Boyd said. “The challenge is not how to solve problems with a single, ultra-fast processor, but how to solve them with 100,000 slower processors.” Aaronson points out that many problems in big data can’t be adequately addressed by simply adding more parallel processing. These problems are “more sequential, where each step depends on the outcome of the preceding step,” he said. “Sometimes, you can split up the work among a bunch of processors, but other times, that’s harder to do.” And often the software isn’t written to take full advantage of the extra processors. “If you hire 20 people to do something, will it happen 20 times faster?” Aaronson said. “Usually not.”

Researchers also face challenges in integrating very differently structured data sets, as well as the difficulty of moving large amounts of data efficiently through a highly distributed network.

Those issues will become more pronounced as the size and complexity of data sets continue to grow faster than computing resources, according to California Institute of Technology physicist Harvey Newman, whose team developed the LHC’s grid of data centers and trans-Atlantic network. He estimates that if current trends hold, the computational needs of big data analysis will place considerable strain on the computing fabric. “It requires us to think about a different kind of system,” he said.

Memory and Movement

Emmanuel Candes, an applied mathematician at Stanford University, was once able to crunch big data problems on his desktop computer. But last year, when he joined a collaboration of radiologists developing dynamic magnetic resonance imaging — whereby one could record a patient’s heartbeat in real time using advanced algorithms to create high-resolution videos from limited MRI measurements — he found that the data no longer fit into his computer’s memory, making it difficult to perform the necessary analysis.

Addressing the storage-capacity challenges of big data is not simply a matter of building more memory, which has never been more plentiful. It is also about managing the movement of data. That’s because, increasingly, the desired data is no longer at people’s fingertips, stored in a single computer; it is distributed across multiple computers in a large data center or even in the “cloud.”

There is a hierarchy to data storage, ranging from the slowest, cheapest and most abundant memory to the fastest and most expensive, with the least available space. At the bottom of this hierarchy is so-called “slow memory” such as hard drives and flash drives, the cost of which continues to drop. There is more space on hard drives, compared to the other kinds of memory, but saving and retrieving the data takes longer. Next up this ladder comes RAM, which is much faster than slow memory but offers less space and is more expensive. Then there is cache memory — another trade-off of space and price in exchange for faster retrieval speeds — and finally the registers on the microchip itself, which are the fastest of all but the priciest to build, with the least available space. If memory storage were like real estate, a hard drive would be a sprawling upstate farm, RAM would be a medium-sized house in the suburbs, cache memory would be a townhouse on the outskirts of a big city, and the register memory would be a tiny studio in a prime urban location.

Longer commutes for stored data translate into processing delays. “When computers are slow today, it’s not because of the microprocessor,” Aaronson said. “The microprocessor is just treading water waiting for the disk to come back with the data.” Big data researchers prefer to minimize how much data must be moved back and forth from slow memory to fast memory.  The problem is exacerbated when the data is distributed across a network or in the cloud, because it takes even longer to move the data back and forth, depending on bandwidth capacity, so that it can be analyzed.

One possible solution to this dilemma is to embrace the new paradigm. In addition to distributed storage, why not analyze the data in a distributed way as well, with each unit (or node) in a network of computers performing a small piece of a computation? Each partial solution is then integrated to find the full result. This approach is similar in concept to the LHC’s, in which one complete copy of the raw data (after filtering) is stored at the CERN research facility in Switzerland that is home to the collider. A second copy is divided into batches that are then distributed to data centers around the world. Each center analyzes its chunk of data and transmits the results to regional computers before moving on to the next batch.

Boyd’s system is based on so-called consensus algorithms. “It’s a mathematical optimization problem,” he said of the algorithms. “You are using past data to train the model in hopes that it will work on future data.” Such algorithms are useful for creating an effective SPAM filter, for example, or for detecting fraudulent bank transactions.

This can be done on a single computer, with all the data in one place.  Machine learning typically uses many processors, each handling a little bit of the problem. But when the problem becomes too large for a single machine, a consensus optimization approach might work better, in which the data set is chopped into bits and distributed across 1,000 “agents” that analyze their bit of data and each produce a model based on the data they have processed. The key is to require a critical condition to be met: although each agent’s model can be different, all the models must agree in the end — hence the term “consensus algorithms.”

The process by which 1,000 individual agents arrive at a consensus model is similar in concept to the Mechanical Turk crowd-sourcing methodology employed by Amazon — with a twist. With the Mechanical Turk, a person or a business can post a simple task, such as determining which photographs contain a cat, and ask the crowd to complete the task in exchange for gift certificates that can be redeemed for Amazon products, or for cash awards that can be transferred to a personal bank account. It may seem trivial to the human user, but the program learns from this feedback, aggregating all the individual responses into its working model, so it can make better predictions in the future.

In Boyd’s system, the process is iterative, creating a feedback loop. The initial consensus is shared with all the agents, which update their models in light of the new information and reach a second consensus, and so on. The process repeats until all the agents agree. Using this kind of distributed optimization approach significantly cuts down on how much data needs to be transferred at any one time.

The Quantum Question

Late one night, during a swanky Napa Valley conference last year, MIT physicist Seth Lloyd found himself soaking in a hot tub across from Google’s Sergey Brin and Larry Page — any aspiring technology entrepreneur’s dream scenario. Lloyd made his pitch, proposing a quantum version of Google’s search engine whereby users could make queries and receive results without Google knowing which questions were asked. The men were intrigued. But after conferring with their business manager the next day, Brin and Page informed Lloyd that his scheme went against their business plan. “They want to know everything about everybody who uses their products and services,” he joked.

It is easy to grasp why Google might be interested in a quantum computer capable of rapidly searching enormous data sets. A quantum computer, in principle, could offer enormous increases in processing power, running algorithms significantly faster than a classical (non-quantum) machine for certain problems. Indeed, the company just purchased a reportedly $15 million prototype from a Canadian firm called D-Wave Systems, although the jury is still out on whether D-Wave’s product is truly quantum.

“This is not about trying all the possible answers in parallel. It is fundamentally different from parallel processing,” said Aaronson. Whereas a classical computer stores information as bits that can be either 0s or 1s, a quantum computer could exploit an unusual property: the superposition of states. If you flip a regular coin, it will land on heads or tails. There is zero probability that it will be both heads and tails. But if it is a quantum coin, technically, it exists in an indeterminate state of both heads and tails until you look to see the outcome.

A true quantum computer could encode information in so-called qubits that can be 0 and 1 at the same time. Doing so could reduce the time required to solve a difficult problem that would otherwise take several years of computation to mere seconds. But that is easier said than done, not least because such a device would be highly sensitive to outside interference: The slightest perturbation would be equivalent to looking to see if the coin landed heads or tails, and thus undo the superposition.

However, Aaronson cautions against placing too much hope in quantum computing to solve big data’s computational challenges, insisting that if and when quantum computers become practical, they will be best suited to very specific tasks, most notably to simulate quantum mechanical systems or to factor large numbers to break codes in classical cryptography. Yet there is one way that quantum computing might be able to assist big data: by searching very large, unsorted data sets — for example, a phone directory in which the names are arranged randomly instead of alphabetically.

It is certainly possible to do so with sheer brute force, using a massively parallel computer to comb through every record. But a quantum computer could accomplish the task in a fraction of the time. That is the thinking behind Grover’s algorithm, which was devised by Bell Labs’ Lov Grover in 1996. However, “to really make it work, you’d need a quantum memory that can be accessed in a quantum superposition,” Aaronson said, but it would need to do so in such a way that the very act of accessing the memory didn’t destroy the superposition, “and that is tricky as hell.”

In short, you need quantum RAM (Q-RAM), and Lloyd has developed a conceptual prototype, along with an accompanying program he calls a Q-App (pronounced “quapp”) targeted to machine learning. He thinks his system could find patterns within data without actually looking at any individual records, thereby preserving the quantum superposition (and the users’ privacy). “You can effectively access all billion items in your database at the same time,” he explained, adding that “you’re not accessing any one of them, you’re accessing common features of all of them.”

For example, if there is ever a giant database storing the genome of every human being on Earth, “you could search for common patterns among different genes” using Lloyd’s quantum algorithm, with Q-RAM and a small 70-qubit quantum processor while still protecting the privacy of the population, Lloyd said. The person doing the search would have access to only a tiny fraction of the individual records, he said, and the search could be done in a short period of time. With the cost of sequencing human genomes dropping and commercial genotyping services rising, it is quite possible that such a database might one day exist, Lloyd said. It could be the ultimate big data set, considering that a single genome is equivalent to 6 billion bits.

Lloyd thinks quantum computing could work well for powerhouse machine-learning algorithms capable of spotting patterns in huge data sets — determining what clusters of data are associated with a keyword, for example, or what pieces of data are similar to one another in some way. “It turns out that many machine-learning algorithms actually work quite nicely in quantum computers, assuming you have a big enough Q-RAM,” he said. “These are exactly the kinds of mathematical problems people try to solve, and we think we could do very well with the quantum version of that.”

The Future Is Integration

Google’s Alon Halevy believes that the real breakthroughs in big data analysis are likely to come from integration — specifically, integrating across very different data sets. “No matter how much you speed up the computers or the way you put computers together, the real issues are at the data level,” he said. For example, a raw data set could include thousands of different tables scattered around the Web, each one listing crime rates in New York, but each may use different terminology and column headers, known as “schema.” A header of “New York” can describe the state, the five boroughs of New York City, or just Manhattan. You must understand the relationship between the schemas before the data in all those tables can be integrated.

That, in turn, requires breakthroughs in techniques to analyze the semantics of natural language. It is one of the toughest problems in artificial intelligence — if your machine-learning algorithm aspires to perfect understanding of nearly every word. But what if your algorithm needs to understand only enough of the surrounding text to determine whether, for example, a table includes data on coffee production in various countries so that it can then integrate the table with other, similar tables into one common data set? According to Halevy, a researcher could first use a coarse-grained algorithm to parse the underlying semantics of the data as best it could and then adopt a crowd-sourcing approach like a Mechanical Turk to refine the model further through human input. “The humans are training the system without realizing it, and then the system can answer many more questions based on what it has learned,” he said.

Chris Mattmann, a senior computer scientist at NASA’s Jet Propulsion Laboratory and director at the Apache Software Foundation, faces just such a complicated scenario with a research project that seeks to integrate two different sources of climate information: remote-sensing observations of the Earth made by satellite instrumentation and computer-simulated climate model outputs. The Intergovernmental Panel on Climate Change would like to be able to compare the various climate models against the hard remote-sensing data to determine which models provide the best fit. But each of those sources stores data in different formats, and there are many different versions of those formats.

Many researchers emphasize the need to develop a broad spectrum of flexible tools that can deal with many different kinds of data. For example, many users are shifting from traditional highly structured relational databases, broadly known as SQL, which represent data in a conventional tabular format, to a more flexible format dubbed NoSQL. “It can be as structured or unstructured as you need it to be,” said Matt LeMay, a product and communications consultant and the former head of consumer products at URL shortening and bookmarking service Bitly, which uses both SQL and NoSQL formats for data storage, depending on the application.

Mattmann cites an Apache software program called Tika that allows the user to integrate data across 1,200 of the most common file formats. But in some cases, some human intervention is still required. Ultimately, Mattmann would like to fully automate this process via intelligent software that can integrate differently structured data sets, much like the Babel Fish in Douglas Adams’ “Hitchhiker’s Guide to the Galaxy” book series enabled someone to understand any language.

Integration across data sets will also require a well-coordinated distributed network system comparable to the one conceived of by Newman’s group at Caltech for the LHC, which monitors tens of thousands of processors and more than 10 major network links. Newman foresees a computational future for big data that relies on this type of automation through well-coordinated armies of intelligent agents, that track the movement of data from one point in the network to another, identifying bottlenecks and scheduling processing tasks. Each might only record what is happening locally but would share the information in such a way as to shed light on the network’s global situation.

“Thousands of agents at different levels are coordinating to help human beings understand what’s going on in a complex and very distributed system,” Newman said. The scale would be even greater in the future, when there would be billions of such intelligent agents, or actors, making up a vast global distributed intelligent entity. “It’s the ability to create those things and have them work on one’s behalf that will reduce the complexity of these operational problems,” he said. “At a certain point, when there’s a complicated problem in such a system, no set of human beings can really understand it all and have access to all the information.”

This article was reprinted on Wired.com.

Comment on this article