This question was recently posted by Larry Wasserman on the Normal Deviate blog (see extract below). Larry is a statistics and machine learning professor at Carnegie Mellon University.
Here is my answer:
Data science is more than statistics: it also encompasses computer science and business concepts, and it's far more than a set of techniques and principles. I could imagine a data scientist not having a degree - this is not possible for a statistician. But the core of the issue, in my opinion, is explained below.
This diagram misses a few key concepts - including business and domain knowledge
Here's the article:
As I see newspapers and blogs filled with talk of “Data Science” and “Big Data” I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.
The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:
When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.
If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.
Well put.
Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.
Two questions come to mind:
1. Why do statisticians find themselves left out?
2. What can we do about it?
Related articles
Comment
When I look at books about statistics, they seem to start with a dataset and end with an inference. They cover neither data acquisition, cleaning and munging nor the communication of results to non-statisticians. I see data science as encompassing this entire workflow, and I think that’s why university statistics departments are changing their names to “data science.”
This being said, taken literally, “data science” is too broad a term, synonymous with computer science. Originally, “statistics” was the collection of data about the state, which was too narrow. Finding the right words is hard.
And as long as statisticians think that data science is statistics and what computer scientists, biologists, archaeologists, ..., do with data is just statistics, we are not going anywhere. What I see is while a statistician is waiting for results of a paper based survey to come back and form his dataset, a computer scientist simply writes scripts to scrape tones of data, does some analyses and publishes the results.
Big like to this article. I studied statistics and graduated recently. I actually believe that statisticians totally lost the boat. Statisticians had smth valuable but we lost it to computer scientists and physicists easily. All data scientists I know around me are either computer scientists or physicists. I took a computational course at physics department and it is interesting to see that how an old professor updated himself with new programming languages and technologies. He uses github, python, programming forums, .... However in my (statistics) department, it is like people are static. They remained at the same place they just graduated 20 years ago. If they used Matlab with their PhD theses, they just know Matlab, and barely R. They never touched python, C++ or Java. Even those study computational biology are better programmers than statisticians. I think it is easy to see how much a graduate student in statistics does worth today. You can line up a graduate student in computer science and a graduate student in statistics for a job interview in data science/big data and see which one would get the position.
I think I want to be a nerd when I grow up. ;)
Good article and thanks for the links!
Dear Lynne, I agree completely. Those who deal with data cleansing for a living develop their own disciplines and tools, but - as you observe - these seem 1) not to get turned into products, and 2) not to get written down. Also, some paradigms - think Google - just use more data to overcome the need to cleanse data. Put differently, lots of smart people with unbounded computing resources can approach problems differently. But, of course, that assumes unbounded data. In healthcare, for instance, there is rarely enough, no matter what.
Two questions come to mind:
1. Why do statisticians find themselves left out?
Because we're not working on the big problem -- computer scientists are. Someone said that the best way to be successful as a researcher is to identify the big problems associated with your discipline and work on those -- small problems are uninteresting and they'll almost never get a pedagogue to the pinnacle of his or her career. The big problem for me in statistics is data. Data hygiene, data integrity, data collection, data manipulation, data volume.
If 75% of any project is consumed by the process of data acquisition, cleansing, matching, denormalization, taxonomy, and other manipulation and only 25% devoted to science and analysis, then the focus SHOULD end up being on the data rather than the analysis.
2. What can we do about it?
If we as statisticians want to be in the light of the sun and bask in the general approval of the world and our peers, then we must solve the data problem and make it possible for our colleagues to spend less than half their time in data acquisition heck and more than half their time solving business problems.
I get 20 calls a year from tool providers telling me that they have yet another tool that will make model building faster -- but not any (or a vanishingly small number of) calls that promise easy data manipulation and handling. That's backwards, clearly. I have already solved the problem of efficiency of delivery once the data are clean -- I can build dozens or hundreds of models without much effort at all. However, I still spend the bulk of my time designing the data extracts and devising how best to use that data effectively -- until that problem is solved, then I fear we will stay out in the cold.
I think all depends about how do you want to apply the "Data Science",you need to be swimming inside the business you are applying the techniques before to propouse a new aplications to the company or University. Depending of your knowlege about the applications you will need, is how you learn about what abilities you need develop to reach the goals you need
The end of statistics? Hardly. I certainly don't see the ASA being heavily influenced by the pharaceutical industry. Ive been a member well over a quarter of a century, am an ASA Fellow and chair-elect for am ASA Section, and helped start another section 20 years ago. A statistician can be involved with huge data sets as those in astrostatistics and to a lesser extent healh outcomes analysis can attest, Of they can focus on small data where exact statistical methods are appropriate. There are those in ecology, in forestry, social science and econometrics, health outcomes abd medical statistics, geostatistics, epidemiology, and so on --- all are areas where statistical techniques can be applied. Statistics evolves. 20 years ago there were very few Bayesians, now its taking over many areas of statistics. In fact, statistics has evolved directly with computing power. The highly iterative nature of much of current statistics was barely possibe, and comparative slow, on any PC in 1995 compared to now. We can develop new methods to exploit the new technologies.
Big Data and writing efficient software routines is vital in my area - astrostatistics. The the field is in fact one with astroinformatics which deals with how best to handle truly huge amounts of data. New statistial methods are now being developed to properly analyze this type of data. But it's still statistics. If you are using mathematical models to classify and predict future data or data outside the data used in the model, you are doing statistics. Statistics as a term refers to other typs of analyses as well. But just because there is an interest in modeling huge data situatons does not mean you are not using statistical techniques, albeit perhaps new nd innovative statistical techniques.
Vincent,
Your answers explained some things for me - that "statistics" narrowed its professional focus. I'm working on a paper; if it comes to fruition I will follow up with you on this topic.
Thank you.
© 2020 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Upcoming DSC Webinar
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Upcoming DSC Webinar
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central