Making Sense of Big Data: Needles of Insight, Haystacks of Numbers

I am shy and reserved, the analyst tells me. Social events are all right, but I often enjoy a quiet night at home.

authorImage

I am shy and reserved, the analyst tells me. Social events are all right, but I often enjoy a quiet night at home. Not that I am prone to stress. Quite the opposite: “You come across to others as someone who is rarely bothered by things.” More insights are coming: This analyst can tell whether I am gay or straight, whether I smoke, use drugs or drink alcohol, and how happy I am with my life. It can discern my approximate IQ score, my politics, religious views, ethnicity, age, gender and even whether my parents divorced during my childhood. Impressive, considering that the analyst is a computer algorithm working with limited information: All it has to go on is a list of my “likes” on Facebook.

A Web site called youarewhatyoulike.com generated the personality profile in less than a second, by comparing my Facebook information with a vast trove of data on other users. And in a paper published this year in the Proceedings of the National Academy of Sciences, the site’s proprietors—researchers at Cambridge University and Microsoft—demonstrated how Facebook “likes” can predict, with an 80 percent to 95 percent chance of being right, all those other private (and marketable) details. You may feel your life has its distinct and separate parts (Facebook likes, vacation preferences, credit history, kind of childhood you had) but analytic algorithms are getting better at interpreting each data point as part of a whole—and using that single clue to extrapolate the entire person. Indeed, with people leaving so many digital traces—some 2.5 quintillion new bytes of data daily, according to the consultant Marcia Conner—this kind of inference is getting easier and cheaper every day.

“Big Data” is our buzzword for machines combing huge stockpiles of information to find connections that human beings can’t see, but that phrase is something of a misnomer. It isn’t quantity that makes the new tidal wave of data so disruptive—businesses and governments have been dealing with oceans of facts and figures for decades. What matters, rather, is that this data is different. As Conner reports, some 80 percent of the data that consumers now reveal about themselves is unstructured, in those Facebook “likes,” tweets, blogs, YouTube clicks and other forms of self-expression. These can’t be captured by old-school information tools that require structure (like, for instance, the boxes of a census form or the one-through-nine “agree-disagree” scales of a survey).

Today’s data tools don’t need information to be packed into predesigned boxes. Instead, analysts can treat almost any activity, human or machine, as an opportunity to harvest useful information. Meanwhile, data of all kinds has become easier to collect, thanks to improved technology. Sensors can monitor the position, speed and mechanical function of a delivery truck; radio frequency identification tags can log what happens to every item in a supply chain; and the passing thoughts and feelings of consumers, once inaccessible, are being recorded online.

MEANWHILE, as the kinds of data available to for analytics have multiplied, so have the organizations that use these techniques. Cloud-based data storage and distributed computing have made the analytics affordable. With Hadoop software, for example, a large number of inexpensive computers can work with an enormous amount of data, without sharing any memory or processors. All these computers working in parallel can address massive data-processing challenges that used to be the domain of expensive mainframes. According to a recent report by Thomas H. Davenport and Jill Dyché of the analytics firm SAS, one company estimated that the cost of using one terabyte of data for a year was $37,000 for a conventional relational database, $5,000 for a dedicated combination of hardware and software (a “database appliance”) but only $2,000 for a Hadoop cluster.

In fact, this low barrier to entry is another way in which Big Data is profoundly different from previous technologies. Information is now so cheap to collect and analyze that individuals can and increasingly do avail themselves of the same techniques. So far, many of these uses are purely for fun—on the site weddingcrunchers.com, for example, users can track changes in the language and content of wedding announcements in The New York Times, mapping social change over the decades. (In the 1990’s, for instance, newlyweds in their 30’s began to outnumber newlyweds in their 20’s, a trend that has not reversed.) But other applications may affect consumers’ relationships to brands and their buying behavior. Consider Buycott, a new app for smartphones that scans a product code and tells its user in an instant about the company that made the item, and its parent company as well. Geared to social-change campaigns, the app relates basic corporate facts to social-responsibility and political reports. With the app, a shopper who scans a box at the supermarket can immediately know if the maker (or its parent company, or parent company’s parent company) is one of 36 firms that regularly give money to defeat laws that would require food derived from genetically modified organisms to be identified on labels.

This convergence of three factors—tools that can treat almost any digital signal as useful data, more and more means to gather such information, and ever-cheaper analytics—powers the Big Data revolution, with all its well-publicized successes in cost control, quality assurance and productivity. Davenport and Dyché, for example, cite a health insurance company that now has a better gauge of customer dissatisfaction because it analyzes speech-to-text data from its call center recordings. And United Parcel Service now has sensors on its more than 46,000 vehicles, monitoring and reporting speed, direction, braking and drivetrain performance. Analyses of this information have improved route planning: In 2011, the company saved more than 8.4 million gallons of fuel by shaving 85 million miles off its pickup and delivery routes.

Some companies have avoided the risk of heartbreak by stepping back and leaving the early adoption of Big Data to others. Some retailers, for example, are backing away from loyalty card marketing, happily giving up the chance to gather the heaps of data that those cards provide about customer behavior. Last summer, for example, Supermarket News noted that AB Acquisitions, the parent company of firms that run Albertsons, Acme, Jewel-Osco and other grocery chains, was abandoning the cards (spinning the move as “discounts for everybody” egalitarianism). “We found that tracking individual shopping habits isn’t as critical to our overall strategy as knowing what our customers in our neighborhoods are shopping for,” an Albertsons spokeswoman told Supermarket News editor David Orgel. “Tracking individual purchases can be one way to do it, but it’s not the only way.”

Meanwhile, consumer wariness about giving away data may be causing some to throw away their loyalty cards. A 2012 study by the research firm Colloquy found that less than half of Americans’ loyalty-card memberships were in use, for example, and between 2010 and 2012 the number of American supermarket loyalty accounts declined. Journalist Brian Palmer thinks that’s great. Big Data, he wrote recently, can make retailers unimaginative and lazy. “Would you prefer to shop at a store that increases profits by figuring out what you already do, then tricking you into doing it a little more often?” he asks. “Or a store that thinks creatively, brings you new products and showcases its wares in a novel way?”

It would be bad enough if over-reliance on Big Data caused a business to neglect other crucial skills. But there’s another potential pitfall: Big and diverse data sets can be devilishly hard to analyze in a way that generates useful insights. Such data tend to present many “false positives” (apparent causal relationships that are really just coincidences) and blind alleys (“obvious” connections that are, in fact, statistical dead ends).

Here’s an example: Since the turn of the millennium the median wage for all Americans has risen by 1 percent. At the same time, as the statistician David Smith recently noted, it is also true that the median wage has fallen since 2000 for high school dropouts, high school graduates, high school graduates with some college, college graduates and employees with advanced degrees. This phenomenon, in which aggregate data show one trend but data on every subgroup show the opposite, is known as Simpson’s paradox. It usually indicates that an important factor was overlooked when the data were collected. In this case, as Smith pointed out, the explanation is that more students in 2013 are graduating from college than did in 2000, and college grads have suffered less wage attrition than those without a bachelor’s degree. Because college grads do better than non-graduates, the higher college graduation rate raises the aggregate wage, even though wages within the college-grad cohort haven’t gone up.

IN THE 1970’s failure to appreciate Simpson’s paradox led to a lawsuit against the University of California at Berkeley for gender discrimination. The statistics showed that 44 percent of men who applied, but only 35 percent of women, had been admitted to graduate school. Yet a look at individual departments found none favoring men. From Astronomy to Zoology, they admitted both sexes at about the same rate (except for a few, which slightly favored women).

The reason for the perceived disparity in the Berkeley admission statistics turned out to be that women applied for programs with tougher standards. There were more female applicants to the English department, where even highly qualified candidates were rejected, and fewer women seeking admission to the graduate chemistry program, where most qualified applicants were admitted. To resolve the problem of contradictory analyses required a step outside the data; it required referring to cultural knowledge about the society and students the data represented. Only then could analysts return to the data with this knowledge, adding it as a previously “hidden variable.”

Big Data, therefore, can help you find relationships among variables that you can see. It is no help, however, at getting you to notice what you can’t see. Among the “likes” that were most important in creating the youarewhatyoulike.com profile of me, for example, were Felix Mendelssohn, Modest Mussorgsky, kayaking and Doctors Without Borders. Why would a fondness for paddling and “Boris Godunov” predict that I am not a Type-A personality? If you don’t need to know the answer, you might reply, “Who cares?” But sometimes, to market effectively or find an underlying cause of trouble, you do need to know.

For all its promise, in other words, Big Data isn’t yet a magic wand. “There is a disconnect between the ability to collect data and the ability to base decisions on them,” Eric Bradlow, a professor of marketing at the Wharton School and co-director of the Wharton Customer Analytics Initiative, told Fast Company last year. “People need to take a deep breath. They need to be more thoughtful about it.”

Historians have cautioned that we should not overestimate technologies in their infancies. If you’ve taken a survey course or two, you might think that the reason the Inca Empire fell in 1532 to 168 Spanish adventurers was that the conquistadors had guns and the Inca had arrows and stone throwers. However, as author Charles C. Mann has noted, 16th-century guns were tetchy and hard to aim—after the surprise wore off, the Inca found their weapons were more than adequate against Europeans. (The real “killer app” in the European conquest of the Americas was infectious disease.) It’s important to remember that new technologies are not nearly as powerful as their mature descendants. This is why, as the futurist Roy Amara famously observed, “We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.”

So it is with omnipresent, 24/7 Big Data. Its day, with its immense consequences for how we live, socialize and consume, is coming. But it is not here yet.  

Download the PDF