Subscribe to DSC Newsletter

David Donoho reflects on "50 Years of Data Science"

Abstract

More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘data analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name “Data Science” for his envisioned field.

A recent and growing phenomenon is the emergence of “Data Science” programs at major universities, including UC Berkeley, NYU, MIT, and most recently the Univ. of Michigan, which on September 8, 2015 announced a $100M “Data Science Initiative” that will hire 35 new faculty. Teaching in these new programs has significant overlap in curricular subject matter with traditional statistics courses; in general, though, the new initiatives steer away from close involvement with academic statistics departments.

This paper reviews some ingredients of the current “Data Science moment”, including recent commentary about data science in the popular media, and about how/whether Data Science is really different from Statistics.

The now-contemplated field of Data Science amounts to a superset of the fields of statistics and machine learning which adds some technology for ‘scaling up’ to ‘big data’. This chosen superset is motivated by commercial rather than intellectual developments. Choosing in this way is likely to miss out on the really important intellectual event of the next fifty years.

Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere ‘scaling up’, but instead the emergence of scientific studies of data analysis science-wide. In the future, we will be able to predict how a proposal to change data analysis workflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers and Breiman, I present a vision of data science based on the activities of people who are ‘learning from data’, and I describe an academic field dedicated to improving that activity in an evidence-based manner. This new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.

Based on a presentation at the Tukey Centennial workshop, Princeton NJ Sept 18 2015

Read article here. See also this tweet about it.

Contents

1 Today’s Data Science Moment 4

2 Data Science ‘versus’ Statistics 4

2.1 The ‘Big Data’ Meme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The ‘Skills’ Meme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 The ‘Jobs’ Meme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 What here is real? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 A Better Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 The Future of Data Analysis, 1962 10

4 The 50 years since FoDA 12

4.1 Exhortations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Reification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Breiman’s ‘Two Cultures’, 2001 15

6 The Predictive Culture’s Secret Sauce 16

6.1 The Common Task Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6.2 Experience with CTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.3 The Secret Sauce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.4 Required Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7 Teaching of today’s consensus Data Science 19

8 The Full Scope of Data Science 22

8.1 The Six Divisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

8.3 Teaching of GDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8.4 Research in GDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8.4.1 Quantitative Programming Environments: R . . . . . . . . . . . . . . . . . . . 27

8.4.2 Data Wrangling: Tidy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8.4.3 Research Presentation: Knitr . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

9 Science about Data Science 29

9.1 Science-Wide Meta Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

9.2 Cross-Study Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

9.3 Cross-Workflow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

10 The Next 50 Years of Data Science 32

10.1 Open Science takes over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

10.2 Science as data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

10.3 Scientific Data Analysis, tested Empirically . . . . . . . . . . . . . . . . . . . . . . . . 34 2

10.3.1 DJ Hand (2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

10.3.2 Donoho and Jin (2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

10.3.3 Zhao, Parmigiani, Huttenhower and Waldron (2014) . . . . . . . . . . . . . . . 36

10.4 Data Science in 2065 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

11 Conclusion 37

Read article here. See also this tweet about it.

Views: 3281

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service