.

This post is based on two insightful threads I read online (References below)

Based on these, I address the question of ‘The difference between Statistics and Data Science’. Traditionally, most people, including me, would say that ‘statistics came first and Data Science builds upon statistics’. This chain of thought is valid but as you see below – it misses a much bigger picture - that of emphasis. Note that - Here, we discuss a purist approach for the sake of learning. In practice, the domains and the tools are converging

*The two main differences between a purist statistical approach and a data scientist approach are:*

*The use of Big Data (common in data science) and**The use of Inferential statistics (common in statistics).*

So, with this background, here are some differences in approaches from a purist statistical standpoint which differ from the typical datascience approach

**Small data:**We are so used to the world of big data – that we do not fully appreciate that another world exists – that of ‘small data’. But in some domains, small data is very common especially in medicine, clinical trials etc because the procedures are risky and expensive. So, it you end up with 20 or 30 samples only (small data). This leads to the greater reliance on inferential statistics**The use of inferential statistics:***Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible. For example, to measure the diameter of each nail that is manufactured in a mill is impractical. You can measure the diameters of a representative random sample of nails. You can use the information from the sample to make generalizations about the diameters of all of the nails.*Source: minitab. Statistics makes more use of the inferential / frequentist approach because of small data sizes (as above)**Increased reliance on Domain knowledge:**The first two points also lead to a greater reliance on domain knowledge for statistics – for example in the choice of features.**Confirmatory data analysis:**Exploratory data analysis is complemented by**Confirmatory data analysis**- Increased reliance on
**Statistical tests**many of which are domain specific - Statistics needs interpretive models as opposed to
**black box models**. - Data science emphasises
**automation**– in contrast to statistics which involves greater manual intervention due to the above factors (such as the increased use of domain knowledge) **Handling outliers and imputation:**Much greater emphasis on manual correction of outliers and imputation (missing values)

To conclude, the difference in approaches originates from the use of small data. While the above is a purist approach i.e. in practice – tools and techniques across the domains are more fluid. References below (including the comments on these threads). Image source – the pioneering statistician George Box and his book the Accidental statistician – which made me think that we are all accidental statisticians!

**References**

Isaac Faber on linkedin - If I had to guess, I would say that curre...

Adrian-Olszewski on Quora - Why do so many statisticians not want t...

- Demand proliferates for low-code app development platforms
- US Senate mulling bill on data breach notifications
- AIOps network management requires vendor-buyer teamwork
- Deloitte SAP Industry Cloud apps aim to fill ERP gaps
- Compare 7 headless CMS offerings and their key differences
- 9 steps to a dynamic data architecture plan
- Building trustworthy AI is key for enterprises
- IoT and responsibility: Use digital for good
- 4 zero-trust IoT steps to scale security
- Apply hyperscale data center design principles to IT storage

Posted 27 July 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central