The Big Data Effect

Two years ago, Andreas Weigend, former chief scientist of Amazon, offered the stunning prediction that human beings would generate more data in 2009 than in all of prior human history. Today, that statement seems archaic. According to Eric Schmidt, chairman of Google, we are adding nearly that much information to the human database every two days.

We are now living in what the technorati have dubbed the era of “big data.” Gargantuan data storage systems bulge with the digital detritus of everyday life, inexorably generated by computer system logs, electronic transactions, cellphones, Web search streams, page views, ad impressions, e-mail metadata, search engine queries, social networking activity, UPC bar code reads, RFID scans, GPS location data, industrial plant monitoring, weather and seismic data, automotive sensors, photographs — and more.

Business and nonbusiness enterprises are scrambling to leverage this continual data deluge using cloud technology, which makes data storage and services available on demand from remote computer networks, and software frameworks like Apache Hadoop and Google’s Map-Reduce, which rapidly analyze data by distributing the task to clusters of hundreds, even thousands, of computers.

Companies like Bank of America, Dell, eBay and Wal-Mart Stores each store petabytes (one quadrillion bytes) of enterprise data. The Walt Disney Company uses a Hadoop cluster to analyze patterns across diverse customer data such as theme park attendance, purchases from Disney stores and viewership of Disney’s cable television programming. LinkedIn processes the interactions of more than 100 billion relationships a day.

“It turns out that having 10 times the amount of data allows you to automatically discover patterns that would be impossible with smaller samples,” said Peter Skomoroch, a principal data scientist at LinkedIn.
Big data analysis is also becoming the norm in government, economics, intelligence, meteorology, architecture, air traffic control, engineering and all kinds of scientific and medical research. In health care, for example, huge data streams from electronic medical records, medical imaging and pharmaceutical and genetic research are being analyzed to identify the causes of disease and options for prevention, diagnosis and treatment.

As the volume and availability of data have grown and the analytical tools to parse the data have become faster and more sophisticated, there has been a growing trend toward trying to find patterns in “unstructured” data — information streams that do not conform to any preconceived model and may not even have any immediately apparent relevance to one another. “In the past, all too often what were disregarded as ‘outliers’ on the far edges of a data model turned out to be the telltale signs of a micro-trend that became a major event,” wrote Brett Sheppard, executive director at Zettaforce, in “Putting Big Data to Work: Opportunities for Enterprises,” a recent report published in conjunction with the GigaOM Network. “Now, letting data speak for itself through analysis of entire data sets is eclipsing modeling from subsets.”

Chris Anderson, editor in chief of Wired magazine, was one of the first to put forth the notion of big data “speaking for itself.” In a 2008 article entitled “The End of Theory,” he wrote, “This is the most measured age in history. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. We can analyze the data without hypotheses about what it might show.”

Anderson’s article drew strong reactions from the scientific community; many scientists suggested data analysis, big or otherwise, without hypothesis or theory is not science at all, but rather, in the words of Douglas Rushkoff, an author and media theorist, “mindless petabyte churn [that] favors industry over consideration, consumption over creation.”
Rushkoff might get no argument from business users of big data, who are generally more interested in what works and what sells than in causation theories. Still, the idea that with enough data and enough computing power the right answers inevitably emerge is likely to change decision making, planning and the very definition of what constitutes talent and leadership in organizations.

“The continued merging of business and technology and the resulting intensity of competition is driving down the market share of intuition and experience as a basis for decision making,” said Andrew McAfee, principal research scientist at The MIT Center for Digital Business, whose work focuses on how technology is changing the structure, behavior and performance of organizations. “We need to apply the best methods possible to our decisions, and those methods will increasingly be algorithmic, automated and data based. But that does not get rid of the obligation to create theory.”

According to McAfee, the skills that will be valued in the coming world of work will blend and balance high literacy with high numeracy — an integrated fluency with both conceptual and statistical reasoning. If so, that would imply that organizations rushing to embrace raw data analysis would do well to also vigilantly retain an ethos of critical and contextual thinking. As Kevin Kelly, author of “Out of Control,” the classic book on hive intelligence, put it: “In the coming world of cloud computing, perfectly good answers will become a commodity. The real value then becomes asking good questions.”

In that world, perhaps we should not be too quick to dismiss intuition and experience as the outmoded results of whim or bias. They are, in fact, the results of our brains engaging in some big data processing of their own, constantly sifting through a welter of conscious and unconscious inputs to our senses, storing them in the cloud of our awareness and proces-sing them rapidly and on demand with parallel and distributed efficiency to help us make decisions. All in real time, no less.

There is no doubt that the availability of prodigious amounts of data along with the statistical tools to slice and dice them will greatly enhance our ability to observe and codify the world. But if those become the only tools we use, then creativity, inspiration and understanding will become little more than the outputs of statistical probability. The role of big data may best be put into perspective by a sign that is said to have hung in Albert Einstein’s office at Princeton University: “Not everything that counts can be counted, and not everything that can be counted counts.”