Deep data science is a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. Even classical machine learning and statistical techniques such as clustering, density estimation, or tests of hypotheses, have model-free, data-driven, robust versions designed for automated processing (as in machine-to-machine communications), and thus these techniques also belong to deep data science. Note that unlike deep learning, deep data science is not the intersection of data science and artificial intelligence; however, the analogy between deep data science and deep learning is not completely meaningless, in the sense that both deal with automation.
Some of the features that characterize deep data science, at least the way I do it,, includes:
- No overlap with other fields such as statistics or machine learning. Not currently, at least. Other words for DDS (deep data science) could be pure data science or core data science.
- Techniques designed for automation, batch processing, black-box tools, or for usage by the non-expert. In some ways, deep data science can be regarded as the automation of data science. The techniques developed must produce results easy to interpret, must be very robust, fast and efficient, and can be resumed on the fly after a computer crash. Robustness, able to work with big messy unstructured data, is more important than extreme accuracy.
- Also, and this applies to my vision of deep data science but is not a requirement: the techniques must be very simple and generic; no advanced mathematics involved, data-driven and model-free. For small data, it can easily be implemented in Excel. If possible, the techniques can easily be implemented under a distributed architecture.
Deep data science: recommended articles
- Model-Free Confidence Intervals
- Tests of Hypotheses
- Hidden Decision Trees
- Jackknife Regression
- Fast Combinatorial Feature Selection
- Debunking Some Statistical Myths
- Variance, Clustering, and Density Estimation Revisited
The picture is from the last article
Putting it together
For a robust regression that will work even if all the traditional model assumptions are violated, click here. It is simple (it can be implemented in Excel and it is model-free), efficient and very comparable to the standard regression (when the model assumptions are not violated). And if you need confidence intervals for the predicted values, you can use the simple model-free confidence intervals (CI) described here. These CIs are equivalent to those being taught in statistical courses, but you don’t need to know stats to understand how they work, and to use them. Finally, to measure goodness-of-fit, instead of R-Squared or MSE, you can use this metric, which is more robust against outliers.
- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers
- What statisticians think about data scientists
- Data Science Compared to 16 Analytic Disciplines
- 10 types of data scientists
- 91 job interview questions for data scientists
- 50 Questions to Test True Data Science Knowledge
- 24 Uses of Statistical Modeling
- 21 data science systems used by Amazon to operate its business
- Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)
- 5 Data Science Leaders Share their Predictions for 2016 and Beyond
- 50 Articles about Hadoop and Related Topics
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 22 tips for better data science
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- High versus low-level data science