Subscribe to DSC Newsletter

Deep data science is a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. Even classical machine learning and statistical techniques such as clustering, density estimation,  or tests of hypotheses, have model-free, data-driven, robust versions designed for automated processing (as in machine-to-machine communications), and thus these techniques also belong to deep data science. Note that unlike deep learning, deep data science is not the intersection of data science and artificial intelligence; however, the analogy between deep data science and deep learning is not completely meaningless, in the sense that both deal with automation.

Some of the features that characterize deep data science, at least the way I do it,, includes:

  • No overlap with other fields such as statistics or machine learning. Not currently, at least. Other words for DDS (deep data science) could be pure data science or core data science.
  • Techniques designed for automation, batch processing, black-box tools, or for usage by the non-expert. In some ways, deep data science can be regarded as the automation of data science. The techniques developed must produce results easy to interpret, must be very robust, fast and efficient, and can be resumed on the fly after a computer crash. Robustness, able to work with big messy unstructured data, is more important than extreme accuracy. 
  • Also, and this applies to my vision of deep data science but is not a requirement: the techniques must be very simple and generic; no advanced mathematics involved, data-driven and model-free. For small data, it can easily be implemented in Excel. If possible, the techniques can easily be implemented under a distributed architecture.

Deep data science: recommended articles

The picture is from the last article

Putting it together

For a robust regression that will work even if all the traditional model assumptions are violated, click here. It is simple (it can be implemented in Excel and it is model-free), efficient and very comparable to the standard regression (when the model assumptions are not violated).  And if you need confidence intervals for the predicted values, you can use the simple model-free confidence intervals (CI) described here. These CIs are equivalent to those being taught in statistical courses, but you don't need to know stats to understand how they work, and to use them. Finally, to measure goodness-of-fit, instead of R-Squared or MSE, you can use this metric, which is more robust against outliers. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 3429

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service