This post is based on two insightful threads I read online (References below)
Based on these, I address the question of ‘The difference between Statistics and Data Science’. Traditionally, most people, including me, would say that ‘statistics came first and Data Science builds upon statistics’. This chain of thought is valid but as you see below – it misses a much bigger picture – that of emphasis. Note that – Here, we discuss a purist approach for the sake of learning. In practice, the domains and the tools are converging
The two main differences between a purist statistical approach and a data scientist approach are:
- The use of Big Data (common in data science) and
- The use of Inferential statistics (common in statistics).
So, with this background, here are some differences in approaches from a purist statistical standpoint which differ from the typical datascience approach
- Small data: We are so used to the world of big data – that we do not fully appreciate that another world exists – that of ‘small data’. But in some domains, small data is very common especially in medicine, clinical trials etc because the procedures are risky and expensive. So, it you end up with 20 or 30 samples only (small data). This leads to the greater reliance on inferential statistics
- The use of inferential statistics: Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible. For example, to measure the diameter of each nail that is manufactured in a mill is impractical. You can measure the diameters of a representative random sample of nails. You can use the information from the sample to make generalizations about the diameters of all of the nails. Source: minitab. Statistics makes more use of the inferential / frequentist approach because of small data sizes (as above)
- Increased reliance on Domain knowledge: The first two points also lead to a greater reliance on domain knowledge for statistics – for example in the choice of features.
- Confirmatory data analysis: Exploratory data analysis is complemented by Confirmatory data analysis
- Increased reliance on Statistical tests many of which are domain specific
- Statistics needs interpretive models as opposed to black box models.
- Data science emphasises automation – in contrast to statistics which involves greater manual intervention due to the above factors (such as the increased use of domain knowledge)
- Handling outliers and imputation: Much greater emphasis on manual correction of outliers and imputation (missing values)
To conclude, the difference in approaches originates from the use of small data. While the above is a purist approach i.e. in practice – tools and techniques across the domains are more fluid. References below (including the comments on these threads). Image source – the pioneering statistician George Box and his book the Accidental statistician – which made me think that we are all accidental statisticians!