Chat with us, powered by LiveChat Discuss what you know and understand about Big Data - how do you define Big Data?? Discuss three Vs presented in this chapter, by the Gardner Group - explain each of the three Vs (Volum - EssayAbode

Discuss what you know and understand about Big Data – how do you define Big Data?? Discuss three Vs presented in this chapter, by the Gardner Group – explain each of the three Vs (Volum

  • Discuss what you know and understand about Big Data – how do you define Big Data? 
  • Discuss three Vs presented in this chapter, by the Gardner Group – explain each of the three Vs (Volume, Velocity, and Variety).
  • What are four major characteristics of Big Data? Provide examples drawn from current practice of each characteristic.
  • Give at least two examples of MapReduce application.
  • Discuss the "Promises" and "Perils" of Big Data (be certain to discuss at least one promise and also one peril)

300 words

B ollier

TH E PR

O M

ISE A N

D PER

IL O F B

IG D

A TA

Publications Office P.O. Box 222 109 Houghton Lab Lane Queenstown, MD 21658

10-001

C o m m u n i c a t i o n s a n d S o c i e t y P r o g r a m

THE PROMISE AND PERIL OF

BIG DATA David Bollier, Rapporteur

The Promise and Peril of Big Data

David Bollier Rapporteur

Communications and Society Program Charles M. Firestone Executive Director Washington, DC

2010

1762/CSP/10-BK

To purchase additional copies of this report, please contact:

The Aspen Institute Publications Office P.O. Box 222 109 Houghton Lab Lane Queenstown, Maryland 21658 Phone: (410) 820-5326 Fax: (410) 827-9174 E-mail: [email protected]

For all other inquiries, please contact:

The Aspen Institute Communications and Society Program One Dupont Circle, NW Suite 700 Washington, DC 20036 Phone: (202) 736-5818 Fax: (202) 467-0790

Copyright © 2010 by The Aspen Institute

This work is licensed under the Creative Commons Attribution- Noncommercial 3.0 United States License. To view a copy of this

license, visit http://creativecommons.org/licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street,

Suite 300, San Francisco, California, 94105, USA.

The Aspen Institute One Dupont Circle, NW

Suite 700 Washington, DC 20036

Published in the United States of America in 2010 by The Aspen Institute

All rights reserved

Printed in the United States of America

ISBN: 0-89843-516-1

10-001

Charles M. Firestone

Executive Director

Patricia K. Kelly

Assistant Director

Contents

Foreword, Charles M. Firestone …………………………………………………… vii

The Promise and Peril oF Big daTa, David Bollier

How to Make Sense of Big Data? ……………………………………………………… 3

Data Correlation or Scientific Models? …………………………………………… 4

How Should Theories be Crafted in an Age of Big Data? ………………….. 7

Visualization as a Sense-Making Tool ……………………………………………. 9

Bias-Free Interpretation of Big Data? …………………………………………… 13

Is More Actually Less? ………………………………………………………………… 14

Correlations, Causality and Strategic Decision-making ………………….. 16

Business and Social Implications of Big Data ………………………………….. 20

Social Perils Posed by Big Data ……………………………………………………. 23

Big Data and Health Care ………………………………………………………………. 25

Big Data as a Disruptive Force (Which is therefore Resisted) ……………… 28

Recent Attempts to Leverage Big Data ………………………………………….. 29

Protecting Medical Privacy …………………………………………………………. 31

How Should Big Data Abuses be Addressed? …………………………………… 33

Regulation, Contracts or Other Approaches? …………………………………. 35

Open Source Analytics for Financial Markets? ………………………………. 37

Conclusion …………………………………………………………………………………… 40

aPPendix

Roundtable Participants ………………………………………………………………… 45

About the Author ………………………………………………………………………….. 47

Previous Publications from the Aspen Institute Roundtable on Information Technology ……………………………………. 49

About the Aspen Institute Communications and Society Program ……………………………………… 55

This report is written from the perspective of an informed observer at the Eighteenth Annual Aspen Institute Roundtable on Information Technology.

Unless attributed to a particular person, none of the comments or ideas contained in this report should be taken as embodying the views or carrying the endorsement

of any specific participant at the Conference.

Foreword

According to a recent report1, the amount of digital content on the Internet is now close to five hundred billion gigabytes. This number is expected to double within a year. Ten years ago, a single gigabyte of data seemed like a vast amount of information. Now, we commonly hear of data stored in terabytes or petabytes. Some even talk of exabytes or the yottabyte, which is a trillion terabytes or, as one website describes it, “everything that there is.”2

The explosion of mobile networks, cloud computing and new tech- nologies has given rise to incomprehensibly large worlds of informa- tion, often described as “Big Data.” Using advanced correlation tech- niques, data analysts (both human and machine) can sift through mas- sive swaths of data to predict conditions, behaviors and events in ways unimagined only years earlier. As the following report describes it:

Google now studies the timing and location of search- engine queries to predict flu outbreaks and unemploy- ment trends before official government statistics come out. Credit card companies routinely pore over vast quantities of census, financial and personal informa- tion to try to detect fraud and identify consumer pur- chasing trends.

Medical researchers sift through the health records of thousands of people to try to identify useful correlations between medical treatments and health outcomes.

Companies running social-networking websites con- duct “data mining” studies on huge stores of personal information in attempts to identify subtle consumer preferences and craft better marketing strategies.

A new class of “geo-location” data is emerging that lets companies analyze mobile device data to make

vii

1. See http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm.

2. See http://www.uplink.freeuk.com/data.html.

intriguing inferences about people’s lives and the economy. It turns out, for example, that the length of time that consumers are willing to travel to shopping malls—data gathered from tracking the location of people’s cell phones—is an excellent proxy for mea- suring consumer demand in the economy.

But this analytical ability poses new questions and challenges. For example, what are the ethical considerations of governments or busi- nesses using Big Data to target people without their knowledge? Does the ability to analyze massive amounts of data change the nature of scientific methodology? Does Big Data represent an evolution of knowledge, or is more actually less when it comes to information on such scales?

The Aspen Institute Communications and Society Program con- vened 25 leaders, entrepreneurs, and academics from the realms of technology, business management, economics, statistics, journalism, computer science, and public policy to address these subjects at the 2009 Roundtable on Information Technology.

This report, written by David Bollier, captures the insights from the three-day event, exploring the topic of Big Data and inferential software within a number of important contexts. For example:

• Do huge datasets and advanced correlation techniques mean we no longer need to rely on hypothesis in scientific inquiry?

• When does “now-casting,” the search through massive amounts of aggregated data to estimate individual behavior, go over the line of personal privacy?

• How will healthcare companies and insurers use the correla- tions of aggregated health behaviors in addressing the future care of patients?

The Roundtable became most animated, however, and found the greatest promise in the application of Big Data to the analysis of sys- temic risk in financial markets.

viii The Promise and Peril of Big daTa

A system of streamlined financial reporting, massive transparency, and “open source analytics,” they concluded, would serve better than past regulatory approaches. Participants rallied to the idea, further- more, that a National Institute of Finance could serve as a resource for the financial regulators and investigate where the system failed in one way or another.

Acknowledgements We want to thank McKinsey & Company for reprising as the senior

sponsor of this Roundtable. In addition, we thank Bill Coleman, Google, the Markle Foundation, and Text 100 for sponsoring this con- ference; James Manyika, Bill Coleman, John Seely Brown, Hal Varian, Stefaan Verhulst and Jacques Bughin for their suggestions and assistance in designing the program and recommending participants; Stefaan Verhulst, Jacques Bughin and Peter Keefer for suggesting readings; and Kiahna Williams, project manager for the Communications and Society Program, for her efforts in selecting, editing, and producing the materials and organizing the Roundtable; and Patricia Kelly, assistant director, for editing and overseeing the production of this report.

Charles M. Firestone Executive Director

Communications and Society Program Washington, D.C.

January 2010

Foreword ix

The Promise and Peril oF Big daTa

David Bollier

The Promise and Peril of Big Data

David Bollier

It has been a quiet revolution, this steady growth of computing and databases. But a confluence of factors is now making Big Data a power- ful force in its own right.

Computing has become ubiquitous, creating countless new digi- tal puddles, lakes, tributaries and oceans of information. A menag- erie of digital devices has proliferated and gone mobile—cell phones, smart phones, laptops, personal sensors—which in turn are generating a daily flood of new information. More busi- ness and government agencies are discovering the strategic uses of large databases. And as all these systems begin to interconnect with each other and as powerful new software tools and techniques are invented to analyze the data for valuable inferences, a radically new kind of “knowledge infrastructure” is materializing. A new era of Big Data is emerging, and the impli- cations for business, government, democracy and culture are enormous.

Computer databases have been around for decades, of course. What is new are the growing scale, sophistication and ubiquity of data-crunching to identify novel patterns of information and inference. Data is not just a back-office, accounts-settling tool any more. It is increasingly used as a real-time decision-making tool. Researchers using advanced correlation techniques can now tease out potentially useful patterns of information that would otherwise remain hidden in petabytes of data (a petabyte is a number starting with 1 and having 15 zeros after it).

Google now studies the timing and location of search-engine que- ries to predict flu outbreaks and unemployment trends before official

1

…a radically new kind of “knowledge infrastructure” is materializing. A new era of Big Data is emerging….

government statistics come out. Credit card companies routinely pore over vast quantities of census, financial and personal information to try to detect fraud and identify consumer purchasing trends.

Medical researchers sift through the health records of thousands of people to try to identify useful correlations between medical treatments and health outcomes.

Companies running social-networking websites conduct “data min- ing” studies on huge stores of personal information in attempts to iden- tify subtle consumer preferences and craft better marketing strategies.

A new class of “geo-location” data is emerging that lets companies analyze mobile device data to make intriguing inferences about people’s lives and the economy. It turns out, for example, that the length of time that consumers are willing to travel to shopping malls—data gathered from tracking the location of people’s cell phones—is an excellent proxy for measuring consumer demand in the economy.

The inferential techniques being used on Big Data can offer great insight into many complicated issues, in many instances with remark- able accuracy and timeliness. The quality of business decision-making, government administration, scientific research and much else can potentially be improved by analyzing data in better ways.

But critics worry that Big Data may be misused and abused, and that it may give certain players, especially large corporations, new abilities to manipulate consumers or compete unfairly in the marketplace. Data experts and critics alike worry that potential abuses of inferential data could imperil personal privacy, civil liberties and consumer freedoms.

Because the issues posed by Big Data are so novel and significant, the Aspen Institute Roundtable on Information Technology decided to explore them in great depth at its eighteenth annual conference. A distinguished group of 25 technologists, economists, computer scien- tists, entrepreneurs, statisticians, management consultants and others were invited to grapple with the issues in three days of meetings, from August 4 to 7, 2009, in Aspen, Colorado. The discussions were moder- ated by Charles M. Firestone, Executive Director of the Aspen Institute Communications and Society Program. This report is an interpretive synthesis of the highlights of those talks.

2 The Promise and Peril of Big daTa

How to Make Sense of Big Data? To understand implications of Big Data, it first helps to understand

the more salient uses of Big Data and the forces that are expanding inferential data analysis. Historically, some of the most sophisticated users of deep analytics on large databases have been Internet-based companies such as search engines, social networking websites and online retailers. But as magnetic storage technologies have gotten cheaper and high-speed networking has made greater bandwidth more available, other industries, government agencies, universities and scientists have begun to adopt the new data-analysis techniques and machine-learning systems.

Certain technologies are fueling the use of inferential data techniques. New types of remote censors are generating new streams of digital data from telescopes, video cameras, traffic monitors, magnetic resonance imaging machines, and biological and chemical sensors monitoring the environment. Millions of individuals are generating roaring streams of personal data from their cell phones, laptops, websites and other digital devices.

The growth of cluster computing systems and cloud computing facilities are also providing a hospitable context for the growth of inferential data techniques, notes computer researcher Randal Bryant and his colleagues.1 Cluster computing systems provide the storage capacity, computing power and high-speed local area networks to handle large data sets. In conjunction with “new forms of computation combining statistical analysis, optimization and artificial intelligence,” writes Bryant, researchers “are able to construct statistical models from large collections of data to infer how the system should respond to new data.” Thus companies like Netflix, the DVD-rental company, can use automated machine-learning to identify correlations in their customers’ viewing habits and offer automated recommendations to customers.

Within the tech sector, which is arguably the most advanced user of Big Data, companies are inventing new services such that give driving directions (MapQuest), provide satellite images (Google Earth) and consumer recommendations (TripAdvisor). Retail giants like Wal- Mart assiduously study their massive sales databases—267 million transactions a day—to help them devise better pricing strategies, inven- tory control and advertising campaigns.

The Report 3

Intelligence agencies must now contend with a flood of data from its own satellites and telephone intercepts as well as from the Internet and publica- tions. Many scientific disciplines are becoming more computer-based and data-driven, such as physics, astronomy, oceanography and biology.

Data Correlation or Scientific Models?

As the deluge of data grows, a key question is how to make sense of the raw information. How can researchers use statistical tools and computer technologies to identify meaningful patterns of information? How shall significant correlations of data be interpreted? What is the role of traditional forms of scientific theorizing and analytic models in assessing data?

Chris Anderson, the Editor-in-Chief of Wired magazine, ignited a small firestorm in 2008 when he proposed that “the data deluge makes the scientific method obsolete.”2 Anderson argued the provocative case that, in an age of cloud computing and massive datasets, the real challenge is not to come up with new taxonomies or models, but to sift through the data in new ways to find meaningful correlations.

At the petabyte scale, information is not a matter of simple three and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data math- ematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn’t pre- tend to know anything about the culture and conven- tions of advertising—it just assumed that better data, with better analytic tools, would win the day. And Google was right.

Physics and genetics have drifted into arid, speculative theorizing, Anderson argues, because of the inadequacy of testable models. The solution, he asserts, lies in finding meaningful correlations in massive piles of Big Data, “Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without

4 The Promise and Peril of Big daTa

hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”

J. Craig Venter used supercomputers and statistical methods to find meaningful patterns from shotgun gene sequencing, said Anderson. Why not apply that methodology more broadly? He asked, “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. There’s no reason to cling to our old ways. It’s time to ask: What can science learn from Google?”

Conference participants agreed that there is a lot of useful informa- tion to be gleaned from Big Data correlations. But there was a strong consensus that Anderson’s polemic goes too far. “Unless you create a model of what you think is going to happen, you can’t ask questions about the data,” said William T. Coleman. “You have to have some basis for asking questions.”

Researcher John Timmer put it succinctly in an article at the Ars Technica website, “Correlations are a way of catching a scientist’s attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications.”3

Hal Varian, Chief Economist at Google, agreed with that argument, “Theory is what allows you to extrapolate outside the observed domain. When you have a theory, you don’t want to test it by just looking at the data that went into it. You want to make some new prediction that’s implied by the theory. If your prediction is validated, that gives you some confidence in the theory. There’s this old line, ‘Why does deduc- tion work? Well, because you can prove it works. Why does induction work? Well, it’s always worked in the past.’”

Extrapolating from correlations can yield specious results even if large data sets are used. The classic example may be “My TiVO Thinks I’m Gay.” The Wall Street Journal once described a TiVO customer who gradually came to realize that his TiVO recommendation system thought he was gay because it kept recommending gay-themes films. When the customer began recording war movies and other “guy stuff” in an effort to change his “reputation,” the system began recommend- ing documentaries about the Third Reich.4

The Report 5

Another much-told story of misguided recommendations based on statistical correlations involved Jeff Bezos, the founder of Amazon. To demonstrate the Amazon recommendation engine in front of an audience, Bezos once called up his own set of recommendations. To his surprise, the system’s first recommendation was Slave Girls from Infinity—a choice triggered by Bezos’ purchase of a DVD of Barbarella, the Jane-Fonda-as-sex-kitten film, the week before.

Using correlations as the basis for forecasts can be slippery for other reasons. Once people know there is an automated system in place, they may deliberately try to game it. Or they may unwittingly alter their behavior.

It is the “classic Heisenberg principle problem,” said Kim Taipale, the Founder and Executive Director of the Center for Advanced Studies in Science and Technology. “As soon as you put up a visualization of data, I’m like—whoa!—I’m going to ‘Google bomb’ those questions so that I can change the outcomes.” (“Google bombing” describes con- certed, often-mischievous attempts to game the search-algorithm of the Google search engine in order to raise the ranking of a given page in the search results.5)

The sophistication of recommendation-engines is improving all the time, of course, so many silly correlations may be weeded out in the future. But no computer system is likely to simulate the level of subtlety and personalization that real human beings show in dynamic social contexts, at least in the near future. Running the numbers and finding the correlations will never be enough.

Theory is important, said Kim Taipale, because “you have to have something you can come back to in order to say that something is right or wrong.” Michael Chui, Senior Expert at McKinsey & Company, agrees: “Theory is about predicting what you haven’t observed yet. Google’s headlights only go as far as the data it has seen. One way to think about theories is that they help you to describe ontologies that already exist.” (Ontology is a branch of philosophy that explores the nature of being, the categories used to describe it, and their ordered relationships with each other. Such issues can matter profoundly when trying to collect, organize and interpret information.)

Jeff Jonas, Chief Scientist, Entity Analytic Solutions at the IBM Software Group, offered a more complicated view. While he agrees

6 The Promise and Peril of Big daTa

that Big Data does not invalidate the need for theories and models, Jonas believes that huge datasets may help us “find and see dynami- cally changing ontologies without having to try to prescribe them in advance. Taxonomies and ontologies are things that you might dis- cover by observation, and watch evolve over time.”

John Clippinger, Co-Director of the Law Lab at Harvard University, said: “Researchers have wrestled long and hard with language and semantics to try to develop some universal ontologies, but they have not really resolved that. But it’s clear that you have to have some underlying notion of mechanism. That leads me to think that there may be some self-organizing grammars that have certain properties to them—certain mechanisms—that can yield certain kinds of predic- tions. The question is whether we can identify a mechanism that is rich enough to characterize a wide range of behaviors. That’s something that you can explore with statistics.”

How Should Theories be Crafted in an Age of Big Data?

If correlations drawn from Big Data are suspect, or not sturdy enough to build interpretations upon, how then shall society construct models and theories in the age of Big Data?

Patrick W. Gross, Chairman of the Lovell Group, challenged the either/or proposition that either scientific models or data correlations will drive future knowledge. “In practice, the theory and the data rein- force each other. It’s not a question of data correlations versus theory. The use of data for correlations allows one to test theories and refine them.”

That may be, but how should theory-formation proceed in light of the oceans of data that can now be explored? John Seely Brown, Independent Co-Chair of Deloitte Center for the Edge, believes that we may need to devise new methods of theory formation: “One of the big problems [with Big Data] is how to determine if something is an outlier or not,” and therefore can be disregarded. “In some ways, the more data you have, the more basis you have for deciding that something is an outlier. You have more confidence in deciding what to knock out of the data set—at least, under the Bayesian and correlational-type theories of the moment.”

The Report 7

But this sort of theory-formation is fairly crude in light of the keen and subtle insights that might be gleaned from Big Data, said Brown: “Big Data suddenly changes the whole game of how you look at the ethereal odd data sets.” Instead of identifying outliers and “cleaning” datasets, theory formation using Big Data allows you to “craft an ontol- ogy and subject it to tests to see what its predictive value is.”

He cited an attempt to see if a theory could be devised to compress the English language using computerized, inferential techniques. “It turns out that if you do it just right—if you keep words as words—you can

compress the language by x amount. But if you actually build a theory-formation system that ends up discovering the mor- phology of English, you can radically compress English. The catch was, how do you build a machine that actually starts to invent the ontologies and look at what it can do with those ontologies?”

Before huge datasets and computing power could be applied to this problem, researchers had rudimentary theories

about the morphology of the English language. “But now that we have ‘infinite’ amounts of computing power, we can start saying, ‘Well, maybe there are many different ways to develop a theory.’”

In other words, the data once perceived as “noise” can now be re- considered with the rest of the data, leading to new ways to develop theories and ontologies. Or as Brown put it, “How can you invent the ‘theory behind the noise’ in order to de-convolve it in order to find the pattern that you weren’t supposed to find? The more data there is, the better my chances of finding the ‘generators’ for a new theory.”

Jordan Greenhall suggested that there may be two general ways to develop ontologies. One is basically a “top down” mode of inquiry that applies familiar philosophical approaches, using a priori categories. The other is a “bottom up” mode that uses dynamic, low-level data and builds ontologies based on the contingent information identified through automated processes.

For

Related Tags

Academic APA Assignment Business Capstone College Conclusion Course Day Discussion Double Spaced Essay English Finance General Graduate History Information Justify Literature Management Market Masters Math Minimum MLA Nursing Organizational Outline Pages Paper Presentation Questions Questionnaire Reference Response Response School Subject Slides Sources Student Support Times New Roman Title Topics Word Write Writing